| // |
| // See README-LCALS_license.txt for access and distribution restrictions |
| // |
| ================================================================================ |
| ================================================================================ |
| LCALS: Livermore Compiler Analysis Loop Suite |
| by Rich Hornung (hornung1@llnl.gov), |
| Center for Applied Scientific Computing, |
| Lawrence Livermore National Laboratory |
| ================================================================================ |
| ================================================================================ |
| |
| o This code is under continuing development. Go to http://codesign.llnl.gov |
| to acquire the latest released version. |
| |
| o This loop suite is designed to measure performance for a variety of loops |
| using different compilers and platforms. In particular, the suite |
| helps to understand compiler optimization, run-time performance issues, |
| and platform capabilities. The suite is also useful as a source of |
| example code snippets for interactions with compiler developers. |
| |
| o The loops in the suite are partitioned into three subsets based on their |
| origins (and also to avoid having them all in a single source file). Each |
| loop is implemented using multiple software constructs (i.e., referred |
| to herein as "variants"). The three loop subsets are: |
| |
| - Subset A: Loops representative of those found in application codes. |
| They are implemented in source files named runA<variant>Loops.cxx. |
| |
| - Subset B: Basic loops that help to illustrate compiler optimization |
| issues. They are implemented in source files named runB<variant>Loops.cxx |
| |
| - Subset C: Loops extracted from "Livermore Loops coded in C" developed by |
| Steve Langer, which were derived from the Fortran version by Frank |
| McMahon. They are implemented in source files runC<variant>Loops.cxx |
| |
| Please see the contents of the loop source files to understand the |
| differences among the variants. |
| |
| o New loops may be added to the suite by inserting them into appropriate |
| loop source files and modifying a few other files that control suite |
| execution and parametrization. Details are provided below. |
| |
| o Various parameters can be adjusted to control how loops are defined and run. |
| |
| -- Each loop may be run with different loop lengths (currently up to three |
| lengths for each loop) and will be sampled some number of times to |
| generate execution timing data. Loop length and sampling parameters |
| may be modified to evaluate different platform performance |
| characteristics. Details are provided below. |
| |
| o Various run time statistics can be generated for analysis. Currently, |
| these include: min run time, max run time, average run time, |
| standard deviation across run times, and average execution time relative |
| to a reference loop variant. Here, run time is the time required to |
| execute the loop for one "sampling" pass through the suite. See below. |
| |
| |
| -------------------------------------------------------------------------------- |
| Loop kernels and variants: |
| |
| o Each loop in the suite is defined by its traditional C/C++ for-loop |
| "kernel". Then, each loop appears in multiple variants that use different |
| programming and execution constructs. |
| |
| o Loops that emply traditional C/C++ for-loop syntax are referred to as |
| "Raw" variants. The "Raw" variant of each loop represents the version |
| obtained from its original source, plus minor modifications necessary |
| to plug into the loop suite framework. For example, the loops in the |
| runCRawLoops.cxx file are essentially verbatim from the Livermore Loops |
| Coded in C" suite mentioned above. Typically, the "Raw" loops serve as |
| reference implemenation for runtime comparisons. |
| |
| o Other variants use loop traversal C++ template methods and represent the |
| loop body as a lambda function or functor class. One of the main goals |
| of the suite is to assess how SIMD vectorization, OpenMP multithreading, |
| etc. work with these different loop implementation choices. |
| |
| Note that only a subset of the loops in the suite appear in the OpenMP |
| variants since many of the loops do not benefit from thread parallelism |
| due to OpenMP overheads. OpenMP loops are implmented in source files |
| named runOMP<variant>Loops.cxx; in particular, they are not broken out |
| into separate source files based on the subsets described above. |
| |
| o Although all loop bodies contain only C-syntax, the loop framework |
| uses C++ classes and templates. So a C++ compiler is required to compile |
| the code. All C++ compilers should be able to compile the framework |
| code and "Raw" loop variants. |
| |
| o Not all compilers implement the OpenMP standard. Thus, those loop variants |
| may not be compiled and run depending on the compiler being used. |
| |
| o The intent of the C++ lambda and functor loop variants is to evaluate |
| compilers in the context of C++ abstraction layers using template methods. |
| Not all compilers support standard C++ lambda expressions at this time. |
| Thus, the lambda variants of the loops may not be compiled and run |
| depending on the compiler being used. |
| |
| |
| ******************** Test Suite Note *********************** |
| * * |
| * Below is the original build instructions, the * |
| * test suite replaces this build system with the * |
| * llvm test-suite CMake system. The control of * |
| * loop suite and timing has been altered to use * |
| * the google benchmark library included in the * |
| * MicroBenchmarks directory of the llvm test-suite. * |
| * * |
| ************************************************************ |
| -------------------------------------------------------------------------------- |
| Compiling and running the loop suite: |
| |
| The loop suite is typically compiled by typing 'make' and then executed as |
| |
| ./lcals.exe <optional output directory> |
| |
| o The executable generated by the Makefile accepts an optional argument |
| which is the name of a directory for placing output files that contain |
| detailed timing, checksum, and FOM (when specified) results. Some of |
| these files provide a summary of loop suite performance. Othere |
| contain subsets of this information in comma-delimited text files that may |
| be imported into Microsoft Excel to generate spreadsheets and plots. |
| When no output directory is given, a summary of the results is printed |
| to standard output. |
| |
| o LCALS is highly parametrized to explore many compilation and execution |
| options. Exercising the full range of options can be achieved by making |
| straightforward modifications in a few files, as describe below: |
| |
| -- Makefile: This file contains a simple build system for the code. |
| It has a variety of configurations for current LLNL |
| computing systems. Building for other platforms or changing |
| any compiler options can done by modifying this file. |
| |
| -- LCALS_rules.mk: This file contains "-D" compilation options that |
| conrol some aspects of LCALS parametrization. The effect of |
| these options is described in the comments in this file. |
| It is also helpful to see how they are used in the |
| LCALSParams.hxx file. |
| |
| -- main.cxx: The main program determines many of the LCALS execution |
| options, such as which loops are run (kernels and variants). |
| |
| -- LCALSSuite.cxx: The routine defineLoopSuiteRunInfo() in this file |
| defines loop lengths and sampling parameters for each loop |
| in the suite. It also defines loop weights used in Figure |
| of Merit (FOM) calculations. |
| |
| -- LCALSSuite.hxx: This file contains '#define' preprocessor directives |
| that can be used to turn on/off compilation of individual |
| loop kernels and loop variants in the suite. This can be |
| helpful for generating assembly code in small doses. |
| |
| o Details on many of these items are given in the next section. |
| |
| |
| -------------------------------------------------------------------------------- |
| Controlling loop suite execution and timing output: |
| |
| o The execution of the loop suite follows the pattern described here: |
| |
| Iterate over specified number of passes through the loop suite { |
| |
| Iterate over specified loop variants to run { |
| |
| Iterate over loop lengths to run (e.g., long, medium, short) { |
| |
| Iterate over each loop specified to run { |
| |
| TIMER_START() |
| Iterate over specified number of samples (for loop and length) { |
| |
| Execute loop variant and length. |
| |
| } |
| TIMER_STOP() |
| |
| } // end iteration over loops to run |
| |
| } // end iteration over loop lengths |
| |
| } // end iteration over loop variants |
| |
| } // end iteration over suite passes |
| |
| o The loop suite is parametrized so that its execution may be controlled |
| by editing various items in a small number of source and header files |
| as described below: |
| |
| -- Set number of passes through the suite by setting the variable |
| 'num_suite_passes' in main.cxx. |
| |
| -- Set loop variants to run by adding the corresponding enumeration |
| constants to the vector 'run_variants' in main.cxx. To prevent a |
| variant from running, simply comment out the line which adds the |
| corresponding enum value to the vector. |
| |
| NOTE: The first entry added this array indicates the reference variant |
| for relative execution time statistics. |
| |
| NOTE: An additional argument may be given to the exectuable to run |
| loops outside of the standard LCASL benchmark. This requires |
| that "BUILD_MISC" is defined in the Makefile. |
| |
| -- Set which loop lengths to run by setting the appropriate entry in |
| the array 'run_loop_length' in main.cxx (true/false for each length). |
| |
| -- Set which loop kernels will run be setting entries in the array |
| 'run_loop' in main.cxx (true/false for each loop). |
| |
| -- The lengths and number of samples per pass for each loop are set |
| in the routine defineLoopSuiteRunInfo() in LCALSSuite.cxx. |
| |
| NOTE: The "samples per pass" values for each loop were determined |
| manually to give approximately 1 second of execution time for its |
| serial raw variant on an Intel ES-2670 node. To reduce or increase the |
| total suite execution time, or change the loop lengths used, change |
| the 'sample_frac' and/or 'loop_length_factor' variables in |
| main.cxx. All default loop lengths will be multiplied by the |
| loop_length_factor value. The sample count for each loop will be |
| multiplied by sample_frac/loop_length_factor. |
| |
| -- The "LoopKernelID" and "LoopLength" enumeration types in the file |
| LCALSSuite.hxx are used to identify loops and loop lengths |
| in the suite. Macros are also provided in that file to conditionally |
| compile each loop in the suite. |
| |
| The way in which the loops are compiled can influence execution times. |
| For example, some compilers perform optimizations for loops compiled |
| individually that they do not perform when the same loop is compiled as |
| part of a larger suite. |
| |
| o All loop forms use the same data arrays, which are pre-allocated based |
| on the loop lengths. To help with SIMD vectorization and ensure corretness |
| data arrays are allocated to be aligned width SIMD vector unit boundaries. |
| This can be changed by setting the 'LCALS_DATA_ALIGN' constant in the |
| file LCALSParams.hxx. |
| |
| o To minimize the effects of execution of each loop on the others, |
| data caches are flushed before each loop is run. |
| |
| -- Data cache size is set for some LLNL platforms based on hostname. |
| If unknown, a warning message will appear when loop suite is run. |
| Please edit main.cxx to set the largest data cache size for other |
| platforms. |
| |
| o A simple checksum mechanism is provided to verify that different variants |
| of each loop, and implementation changes made to individual loops, generate |
| the same numerical results. "-D" compiler options are provided in the |
| LCALS_rules.mk file to control this behavior. Note that certain levels |
| and types of compiler optimization will cause slight differences in |
| checksums due to changes in operation order, for example. Thus, the |
| checksums may only be a qualitative indicator of correct execution. |
| |
| -- Note that the routines loopInit() and loopFinalize() in LCALSSuite.cxx |
| initialize data and compute result checksums for each loop. These |
| must remain consistent with the data used in each loop for correctness. |
| |
| |
| o There are two mechanisms available to generate execution timing data for |
| loops in the suite. The choice is made by defining/undefining the |
| associated "-D" option in the LCALS_rules.mk file. See that file for |
| more information. |
| |
| |
| -------------------------------------------------------------------------------- |
| Figures of Merit: |
| |
| o The program output includes a Figure of Merit (FOM) value for each loop |
| variant and loop length that is executed. The intent of the FOM is to |
| complement execution timing data with another measure of performance and |
| compiler optimization. Using the FOM values and total loop suite execution |
| time information in the Figure of Merit report, one can compare different |
| compilers' abilities to optimize on a given platform, performance of |
| different optimization levels for a given compiler, or potential performance |
| of different architectures, etc. |
| |
| o In the FOM calculation, execution time for each loop is weighted by a |
| factor defined in the loop setup routines. The loops are partitioned into |
| six classes depending on their structure; e.g., data-parallel, order- |
| dependent, etc. The weight for each loop class indicates its relative |
| importance based on code constructs we want the suite to emphasize |
| and how easy we think it should be for a compiler to optimize. Each loop |
| in the suite is given a weight, w_i (i is the loop id), based on which |
| class it exists in. Loop classes and weights are defined in the file |
| LCALSSuite.cxx. |
| |
| o The FOM is calculated as follows. |
| |
| - Relative FOM (FOM_rel). The aim of the FOM_rel value is to measure |
| a compiler's ability to optimize different loop constructs. |
| |
| -- When the code is executed, a reference loop execution time, t_ref, is |
| computed using a loop that any compiler should be able to optimize |
| well and which should run faster than any loop in the suite. |
| To help insure this, two simple loops are run, an element-wise vector |
| product and a vector dot product. Then, t_ref is the minimum execution |
| time between the two. |
| |
| -- After the suite is run, FOM_rel is calulated as: |
| |
| FOM_rel = W * t_ref / Sum_i [ w_i * t_i ] |
| |
| The denominator is a weighted sum of execution times for the loops |
| that were run; t_i is the run time for loop i. W = Sum_i ( w_i ) is |
| the sum of loop weights. |
| |
| -- Note that FOM_rel is a dimensionless quantity that satisfies |
| 0 <= FOM_rel <= 1, and FOM_rel increases as loop execution times |
| decrease. In the ideal case, where each loop executes as fast as the |
| reference loop (which should be impossible), t_i = t_ref for each i. |
| So FOM_rel = 1. |