| <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" |
| "http://www.w3.org/TR/html4/strict.dtd"> |
| <!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ --> |
| <html> |
| <head> |
| <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> |
| <title>Polly - Examples</title> |
| <link type="text/css" rel="stylesheet" href="menu.css"> |
| <link type="text/css" rel="stylesheet" href="content.css"> |
| </head> |
| <body> |
| <!--#include virtual="menu.html.incl"--> |
| <div id="content"> |
| <!--=====================================================================--> |
| <h1>Polly: Execute the individual Polly passes manually</h1> |
| <!--=====================================================================--> |
| |
| <p> |
| This example presents the individual passes that are involved when optimizing |
| code with Polly. We show how to execute them individually and explain for each |
| which analysis is performed or what transformation is applied. In this example |
| the polyhedral transformation is user-provided to show how much performance |
| improvement can be expected by an optimal automatic optimizer.</p> |
| |
| The files used and created in this example are available in the Polly checkout |
| in the folder <em>www/experiments/matmul</em>. They can be created automatically |
| by running the <em>www/experiments/matmul/runall.sh</em> script. |
| |
| <ol> |
| <li><h4>Create LLVM-IR from the C code</h4> |
| |
| Polly works on LLVM-IR. Hence it is necessary to translate the source files into |
| LLVM-IR. If more than on file should be optimized the files can be combined into |
| a single file with llvm-link. |
| |
| <pre class="code">clang -S -emit-llvm matmul.c -o matmul.s</pre> |
| </li> |
| |
| |
| <li><h4>Load Polly automatically when calling the 'opt' tool</h4> |
| |
| Polly is not built into opt or bugpoint, but it is a shared library that needs |
| to be loaded into these tools explicitally. The Polly library is called |
| LVMPolly.so. For a cmake build it is available in the build/lib/ directory, |
| autoconf creates the same file in |
| build/tools/polly/{Release+Asserts|Asserts|Debug}/lib. For convenience we create |
| an alias that automatically loads Polly if 'opt' is called. |
| <pre class="code"> |
| export PATH_TO_POLLY_LIB="~/polly/build/lib/" |
| alias opt="opt -load ${PATH_TO_POLLY_LIB}/LLVMPolly.so"</pre> |
| </li> |
| |
| <li><h4>Prepare the LLVM-IR for Polly</h4> |
| |
| Polly is only able to work with code that matches a canonical form. To translate |
| the LLVM-IR into this form we use a set of canonicalication passes. For this |
| example only three passes are necessary. To get good coverage on more |
| complicated input files often more canonicalization passes are needed. pollycc |
| contains a list of passes that have shown to be beneficial. |
| <pre class="code">opt -S -mem2reg -loop-simplify -polly-indvars matmul.s > matmul.preopt.ll</pre></li> |
| |
| <li><h4>Show the SCoPs detected by Polly (optional)</h4> |
| |
| To understand if Polly was able to detect SCoPs, we print the |
| structure of the detected SCoPs. In our example two SCoPs were detected. One in |
| 'init_array' the other in 'main'. |
| |
| <pre class="code">opt -basicaa -polly-cloog -analyze -q matmul.preopt.ll</pre> |
| |
| <pre> |
| init_array(): |
| for (c2=0;c2<=1023;c2++) { |
| for (c4=0;c4<=1023;c4++) { |
| Stmt_5(c2,c4); |
| } |
| } |
| |
| main(): |
| for (c2=0;c2<=1023;c2++) { |
| for (c4=0;c4<=1023;c4++) { |
| Stmt_4(c2,c4); |
| for (c6=0;c6<=1023;c6++) { |
| Stmt_6(c2,c4,c6); |
| } |
| } |
| } |
| </pre> |
| </li> |
| <li><h4>Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)</h4> |
| |
| Polly can use graphviz to graphically show a CFG in which the detected SCoPs are |
| highlighted. It can also create '.dot' files that can be translated by |
| the 'dot' utility into various graphic formats. |
| |
| <pre class="code">opt -basicaa -view-scops -disable-output matmul.preopt.ll |
| opt -basicaa -view-scops-only -disable-output matmul.preopt.ll</pre> |
| The output for the different functions<br /> |
| view-scops: |
| <a href="experiments/matmul/scops.main.dot.png">main</a>, |
| <a href="experiments/matmul/scops.init_array.dot.png">init_array</a>, |
| <a href="experiments/matmul/scops.print_array.dot.png">print_array</a><br /> |
| view-scops-only: |
| <a href="experiments/matmul/scopsonly.main.dot.png">main</a>, |
| <a href="experiments/matmul/scopsonly.init_array.dot.png">init_array</a>, |
| <a href="experiments/matmul/scopsonly.print_array.dot.png">print_array</a> |
| </li> |
| |
| <li><h4>View the polyhedral representation of the SCoPs</h4> |
| <pre class="code">opt -basicaa -polly-scops -analyze matmul.preopt.ll</pre> |
| <pre> |
| [...] |
| Printing analysis 'Polly - Create polyhedral description of Scops' for region: |
| 'for.cond => for.end19' in function 'init_array': |
| Context: |
| { [] } |
| Statements { |
| Stmt_5 |
| Domain := |
| { Stmt_5[i0, i1] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 }; |
| Scattering := |
| { Stmt_5[i0, i1] -> scattering[0, i0, 0, i1, 0] }; |
| WriteAccess := |
| { Stmt_5[i0, i1] -> MemRef_A[1037i0 + i1] }; |
| WriteAccess := |
| { Stmt_5[i0, i1] -> MemRef_B[1047i0 + i1] }; |
| FinalRead |
| Domain := |
| { FinalRead[0] }; |
| Scattering := |
| { FinalRead[i0] -> scattering[200000000, o1, o2, o3, o4] }; |
| ReadAccess := |
| { FinalRead[i0] -> MemRef_A[o0] }; |
| ReadAccess := |
| { FinalRead[i0] -> MemRef_B[o0] }; |
| } |
| [...] |
| Printing analysis 'Polly - Create polyhedral description of Scops' for region: |
| 'for.cond => for.end30' in function 'main': |
| Context: |
| { [] } |
| Statements { |
| Stmt_4 |
| Domain := |
| { Stmt_4[i0, i1] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 }; |
| Scattering := |
| { Stmt_4[i0, i1] -> scattering[0, i0, 0, i1, 0, 0, 0] }; |
| WriteAccess := |
| { Stmt_4[i0, i1] -> MemRef_C[1067i0 + i1] }; |
| Stmt_6 |
| Domain := |
| { Stmt_6[i0, i1, i2] : i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023 }; |
| Scattering := |
| { Stmt_6[i0, i1, i2] -> scattering[0, i0, 0, i1, 1, i2, 0] }; |
| ReadAccess := |
| { Stmt_6[i0, i1, i2] -> MemRef_C[1067i0 + i1] }; |
| ReadAccess := |
| { Stmt_6[i0, i1, i2] -> MemRef_A[1037i0 + i2] }; |
| ReadAccess := |
| { Stmt_6[i0, i1, i2] -> MemRef_B[i1 + 1047i2] }; |
| WriteAccess := |
| { Stmt_6[i0, i1, i2] -> MemRef_C[1067i0 + i1] }; |
| FinalRead |
| Domain := |
| { FinalRead[0] }; |
| Scattering := |
| { FinalRead[i0] -> scattering[200000000, o1, o2, o3, o4, o5, o6] }; |
| ReadAccess := |
| { FinalRead[i0] -> MemRef_C[o0] }; |
| ReadAccess := |
| { FinalRead[i0] -> MemRef_A[o0] }; |
| ReadAccess := |
| { FinalRead[i0] -> MemRef_B[o0] }; |
| } |
| [...] |
| </pre> |
| </li> |
| |
| <li><h4>Show the dependences for the SCoPs</h4> |
| <pre class="code">opt -basicaa -polly-dependences -analyze matmul.preopt.ll</pre> |
| <pre>Printing analysis 'Polly - Calculate dependences for SCoP' for region: |
| 'for.cond => for.end19' in function 'init_array': |
| Must dependences: |
| { } |
| May dependences: |
| { } |
| Must no source: |
| { } |
| May no source: |
| { } |
| Printing analysis 'Polly - Calculate dependences for SCoP' for region: |
| 'for.cond => for.end30' in function 'main': |
| Must dependences: |
| { Stmt_4[i0, i1] -> Stmt_6[i0, i1, 0] : |
| i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023; |
| Stmt_6[i0, i1, i2] -> Stmt_6[i0, i1, 1 + i2] : |
| i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1022; |
| Stmt_6[i0, i1, 1023] -> FinalRead[0] : |
| i1 <= 1091540 - 1067i0 and i1 >= -1067i0 and i1 >= 0 and i1 <= 1023; |
| Stmt_6[1023, i1, 1023] -> FinalRead[0] : |
| i1 >= 0 and i1 <= 1023 |
| } |
| May dependences: |
| { } |
| Must no source: |
| { Stmt_6[i0, i1, i2] -> MemRef_A[1037i0 + i2] : |
| i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023; |
| Stmt_6[i0, i1, i2] -> MemRef_B[i1 + 1047i2] : |
| i0 >= 0 and i0 <= 1023 and i1 >= 0 and i1 <= 1023 and i2 >= 0 and i2 <= 1023; |
| FinalRead[0] -> MemRef_A[o0]; |
| FinalRead[0] -> MemRef_B[o0] |
| FinalRead[0] -> MemRef_C[o0] : |
| o0 >= 1092565 or (exists (e0 = [(o0)/1067]: o0 <= 1091540 and o0 >= 0 |
| and 1067e0 <= -1024 + o0 and 1067e0 >= -1066 + o0)) or o0 <= -1; |
| } |
| May no source: |
| { } |
| </pre></li> |
| |
| <li><h4>Export jscop files</h4> |
| |
| Polly can export the polyhedral representation in so called jscop files. Jscop |
| files contain the polyhedral representation stored in a JSON file. |
| <pre class="code">opt -basicaa -polly-export-jscop matmul.preopt.ll</pre> |
| <pre>Writing SCoP 'for.cond => for.end19' in function 'init_array' to './init_array___%for.cond---%for.end19.jscop'. |
| Writing SCoP 'for.cond => for.end30' in function 'main' to './main___%for.cond---%for.end30.jscop'. |
| </pre></li> |
| |
| <li><h4>Import the changed jscop files and print the updated SCoP structure |
| (optional)</h4> |
| <p>Polly can reimport jscop files, in which the schedules of the statements are |
| changed. These changed schedules are used to descripe transformations. |
| It is possible to import different jscop files by providing the postfix |
| of the jscop file that is imported.</p> |
| <p> We apply three different transformations on the SCoP in the main function. |
| The jscop files describing these transformations are hand written (and available |
| in <em>www/experiments/matmul</em>). |
| |
| <h5>No Polly</h5> |
| |
| <p>As a baseline we do not call any Polly code generation, but only apply the |
| normal -O3 optimizations.</p> |
| |
| <pre class="code"> |
| opt matmul.preopt.ll -basicaa \ |
| -polly-import-jscop \ |
| -polly-cloog -analyze |
| </pre> |
| <pre> |
| [...] |
| main(): |
| for (c2=0;c2<g;=1535;c2++) { |
| for (c4=0;c4<g;=1535;c4++) { |
| Stmt_4(c2,c4); |
| for (c6=0;c6<g;=1535;c6++) { |
| Stmt_6(c2,c4,c6); |
| } |
| } |
| } |
| [...] |
| </pre> |
| <h5>Interchange (and Fission to allow the interchange)</h5> |
| <p>We split the loops and can now apply an interchange of the loop dimensions that |
| enumerate Stmt_6.</p> |
| <pre class="code"> |
| opt matmul.preopt.ll -basicaa \ |
| -polly-import-jscop -polly-import-jscop-postfix=interchanged \ |
| -polly-cloog -analyze |
| </pre> |
| <pre> |
| [...] |
| Reading JScop 'for.cond => for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled'. |
| [...] |
| main(): |
| for (c2=0;c2<=1535;c2++) { |
| for (c4=0;c4<=1535;c4++) { |
| Stmt_4(c2,c4); |
| } |
| } |
| for (c2=0;c2<=1535;c2++) { |
| for (c4=0;c4<=1535;c4++) { |
| for (c6=0;c6<=1535;c6++) { |
| Stmt_6(c2,c6,c4); |
| } |
| } |
| } |
| [...] |
| </pre> |
| <h5>Interchange + Tiling</h5> |
| <p>In addition to the interchange we tile now the second loop nest.</p> |
| |
| <pre class="code"> |
| opt matmul.preopt.ll -basicaa \ |
| -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \ |
| -polly-cloog -analyze |
| </pre> |
| <pre> |
| [...] |
| Reading JScop 'for.cond => for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled'. |
| [...] |
| main(): |
| for (c2=0;c2<=1535;c2++) { |
| for (c4=0;c4<=1535;c4++) { |
| Stmt_4(c2,c4); |
| } |
| } |
| for (c2=0;c2<=1535;c2+=64) { |
| for (c3=0;c3<=1535;c3+=64) { |
| for (c4=0;c4<=1535;c4+=64) { |
| for (c5=c2;c5<=c2+63;c5++) { |
| for (c6=c4;c6<=c4+63;c6++) { |
| for (c7=c3;c7<=c3+63;c7++) { |
| Stmt_6(c5,c7,c6); |
| } |
| } |
| } |
| } |
| } |
| } |
| [...] |
| </pre> |
| <h5>Interchange + Tiling + Strip-mining to prepare vectorization</h5> |
| To later allow vectorization we create a so called trivially parallelizable |
| loop. It is innermost, parallel and has only four iterations. It can be |
| replaced by 4-element SIMD instructions. |
| <pre class="code"> |
| opt matmul.preopt.ll -basicaa \ |
| -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \ |
| -polly-cloog -analyze </pre> |
| |
| <pre> |
| [...] |
| Reading JScop 'for.cond => for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'. |
| [...] |
| main(): |
| for (c2=0;c2<=1535;c2++) { |
| for (c4=0;c4<=1535;c4++) { |
| Stmt_4(c2,c4); |
| } |
| } |
| for (c2=0;c2<=1535;c2+=64) { |
| for (c3=0;c3<=1535;c3+=64) { |
| for (c4=0;c4<=1535;c4+=64) { |
| for (c5=c2;c5<=c2+63;c5++) { |
| for (c6=c4;c6<=c4+63;c6++) { |
| for (c7=c3;c7<=c3+63;c7+=4) { |
| for (c8=c7;c8<=c7+3;c8++) { |
| Stmt_6(c5,c8,c6); |
| } |
| } |
| } |
| } |
| } |
| } |
| } |
| [...] |
| </pre> |
| |
| </li> |
| |
| <li><h4>Codegenerate the SCoPs</h4> |
| <p> |
| This generates new code for the SCoPs detected by polly. |
| If -polly-import is present, transformations specified in the imported openscop |
| files will be applied.</p> |
| <pre class="code">opt matmul.preopt.ll | opt -O3 > matmul.normalopt.ll</pre> |
| <pre class="code"> |
| opt -basicaa \ |
| -polly-import-jscop -polly-import-jscop-postfix=interchanged \ |
| -polly-codegen matmul.preopt.ll \ |
| | opt -O3 > matmul.polly.interchanged.ll</pre> |
| <pre> |
| Reading JScop 'for.cond => for.end19' in function 'init_array' from |
| './init_array___%for.cond---%for.end19.jscop.interchanged'. |
| File could not be read: No such file or directory |
| Reading JScop 'for.cond => for.end30' in function 'main' from |
| './main___%for.cond---%for.end30.jscop.interchanged'. |
| </pre> |
| <pre class="code"> |
| opt -basicaa \ |
| -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \ |
| -polly-codegen matmul.preopt.ll \ |
| | opt -O3 > matmul.polly.interchanged+tiled.ll</pre> |
| <pre> |
| Reading JScop 'for.cond => for.end19' in function 'init_array' from |
| './init_array___%for.cond---%for.end19.jscop.interchanged+tiled'. |
| File could not be read: No such file or directory |
| Reading JScop 'for.cond => for.end30' in function 'main' from |
| './main___%for.cond---%for.end30.jscop.interchanged+tiled'. |
| </pre> |
| <pre class="code"> |
| opt -basicaa \ |
| -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \ |
| -polly-codegen -polly-vectorizer=polly matmul.preopt.ll \ |
| | opt -O3 > matmul.polly.interchanged+tiled+vector.ll</pre> |
| <pre> |
| Reading JScop 'for.cond => for.end19' in function 'init_array' from |
| './init_array___%for.cond---%for.end19.jscop.interchanged+tiled+vector'. |
| File could not be read: No such file or directory |
| Reading JScop 'for.cond => for.end30' in function 'main' from |
| './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'. |
| </pre> |
| <pre class="code"> |
| opt -basicaa \ |
| -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \ |
| -polly-codegen -polly-vectorizer=polly -enable-polly-openmp matmul.preopt.ll \ |
| | opt -O3 > matmul.polly.interchanged+tiled+openmp.ll</pre> |
| <pre> |
| Reading JScop 'for.cond => for.end19' in function 'init_array' from |
| './init_array___%for.cond---%for.end19.jscop.interchanged+tiled+vector'. |
| File could not be read: No such file or directory |
| Reading JScop 'for.cond => for.end30' in function 'main' from |
| './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'. |
| </pre> |
| |
| <li><h4>Create the executables</h4> |
| |
| Create one executable optimized with plain -O3 as well as a set of executables |
| optimized in different ways with Polly. One changes only the loop structure, the |
| other adds tiling, the next adds vectorization and finally we use OpenMP |
| parallelism. |
| <pre class="code"> |
| llc matmul.normalopt.ll -o matmul.normalopt.s && \ |
| gcc matmul.normalopt.s -o matmul.normalopt.exe |
| llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s && \ |
| gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe |
| llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s && \ |
| gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe |
| llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s && \ |
| gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe |
| llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s && \ |
| gcc -lgomp matmul.polly.interchanged+tiled+vector+openmp.s -o matmul.polly.interchanged+tiled+vector+openmp.exe </pre> |
| |
| <li><h4>Compare the runtime of the executables</h4> |
| |
| By comparing the runtimes of the different code snippets we see that a simple |
| loop interchange gives here the largest performance boost. However by adding |
| vectorization and by using OpenMP we can further improve the performance |
| significantly. |
| <pre class="code">time ./matmul.normalopt.exe</pre> |
| <pre>42.68 real, 42.55 user, 0.00 sys</pre> |
| <pre class="code">time ./matmul.polly.interchanged.exe</pre> |
| <pre>04.33 real, 4.30 user, 0.01 sys</pre> |
| <pre class="code">time ./matmul.polly.interchanged+tiled.exe</pre> |
| <pre>04.11 real, 4.10 user, 0.00 sys</pre> |
| <pre class="code">time ./matmul.polly.interchanged+tiled+vector.exe</pre> |
| <pre>01.39 real, 1.36 user, 0.01 sys</pre> |
| <pre class="code">time ./matmul.polly.interchanged+tiled+vector+openmp.exe</pre> |
| <pre>00.66 real, 2.58 user, 0.02 sys</pre> |
| </li> |
| </ol> |
| |
| </div> |
| </body> |
| </html> |