MultiSource/Benchmarks/ASC_Sequoia/sphot/sphot_v0.9.1.txt - third_party/llvm-test-suite - Git at Google


   SPhot - Monte Carlo Transport Code


     Benchmark-specific Instructions and Constraints

 *Privacy & Legal Notice* <http://www.llnl.gov/disclaimer.html>
 ------------------------------------------------------------------------


     Contents

     * Expected Results
     * Optimization and Improvement Challenges
     * Parallelism and Scalability Expectations
     * Benchmark Specific Instructions
     * Release and Modification Record

 ------------------------------------------------------------------------


     Expected Results

        Please see the Summary writeup and the Sphot Sequoia Specific Benchmark
        tests for information about expected results and the reporting of data.


 ------------------------------------------------------------------------


     Optimization and Improvement Challenges

 Performance improvement and optimizations for this code fall into the
 following categories:

    1. *MPI Related*

       Very little data is actually transferred between MPI tasks, since
       each task has its own copy of the mesh, most likely in memory or
       even cache. Most of the inter-task data transfer is for the
       purpose of collecting timing results and involves very little
       data. Several MPI_Barrier calls are used for synchronization
       purposes also.

       Improving MPI related performance would involve minimizing or
       removing all MPI_Barrier calls and finding a way to minimize the
       exchange of inter-task timing statistics. In regards to the
       exchange of timing statistics, task 0 is a definitive bottleneck
       since it must act as the "master" task and communicate with all
       other tasks. This bottleneck may pose a challenge as the problem
       is scaled up to hundreds or thousands of MPI tasks.

       A very important, platform dependent optimization consideration
       for this code is determining the optimal number of MPI tasks to
       use. For example, a cluster of 32-processor SMP machines may
       demonstrate optimal performance when there are 4 MPI tasks
       (running 8 OpenMP threads each) on each SMP machine. This
       optimization will probably best be determined by experimentation
       on a given platform.

    2. *OpenMP Related*

       The current implementation does not make use of THREADPRIVATE
       common blocks. Instead, all threads access certain COMMON block
       variables globally, and that access occurs within the
       computational core routine (execute.f) of the code. For SMPs with
       few processors, this does not seem to pose a performance problem.
       However, it is anticipated that for many-processor SMP machines,
       it may. Improving OpenMP performance would involve finding a means
       to implement THREADPRIVATE common blocks. Currently, the code
       produces "wrong" results if this is attempted.

       As with MPI, determining the optimal number of OpenMP threads to
       use is a very important platform dependent optimization
       consideration.

    3. *Computation Kernel Related*

       Actual execution time for a "run" of SPhot requires very little
       wallclock or CPU time--certainly much less than a minute for any
       current platform. Also, due to the small problem size (grid size),
       and the embarrassingly parallel nature of this code, performance
       gains in this category would be minimal at best.

       For those interested in improving the performance of the
       computationally significant routines, the following might be
       evaluated:

           * execute.f

             This routine is responsible for over 50% of the total
             execution time. Profiling has shown that most of the cycles
             are distributed over a fairly large number of lines that
             perform simple arithmetic operations (multiplies, adds,
             divides), variable assignments (load/store), and IF
             condition testing.

           * pranf.f

             This routine comprises approximately 10% of the total
             execution time. It is a very small routine where virtually
             all of the execution time can be accounted for by a single
             line of code, shown below:


    RandNum = float( Seed( 4 ) ) / Divisor( 4 ) +

 1     float( Seed( 3 ) ) / Divisor( 3 ) +

 2     float( Seed( 2 ) ) / Divisor( 2 ) +

 3     float( Seed( 1 ) ) / Divisor( 1 )


           * ranfmodmult.f

             This routine comprises approximately 10% of the total
             execution time. This is another rather trivial (in size)
             routine. The execution time is more or less evenly
             distributed over all lines of executable code (less than 20)
             which mostly perform simple arithmetic and load/store
             operations.

           * sqrt intrinsic function

             This will vary according to platform. On one platform tested
             using the native Fortran compiler, approximately 10% of the
             total execution time was attributed to this intrinsic function.

 ------------------------------------------------------------------------


     Parallelism and Scalability Expectations

 Because SPhot is trivially parallel, it might be expected that this code
 should scale "perfectly" to very large processor counts. Additionally,
 I/O operations have been replaced with MPI communications to eliminate
 scalability problems common to many I/O systems.


 When SPhot is executed in hybrid mode (MPI with OpenMP), both shared
 memory and distributed memory communications hardware and software
 performance will be demonstrated.  CPU time per "run" will remain
 relatively constant since runs are intended to map to a processor's
 CPUs. Increasing the number of "runs" which may be done in parallel
 increases with the number of available CPUs. Perfect scalability is
 evidenced by an efficiency calculation of 1.00. That is, there is no
 additional time required to compute N runs on N CPUs.

 Realistically, the scalability of SPhot may be affected by the MPI and
 OpenMP related factors discussed under the Optimization and Improvement
 Challenges Optimization> section. As indicated
 there, the application's MPI communications include barriers and
 potential task 0 bottlenecks. Also, the OpenMP sharing of global COMMON
 block data within the computation loop may also degrade performance for
 many-processor SMP machines.

 ------------------------------------------------------------------------


     Benchmark Specific Instructions and Constraints

    1. *Code Modifications*

       For the purposes of the ASC benchmarks, modifications to this
       application's source code are not permitted unless such
       modifications are:
           * Needed to modify certain default parameters as discussed in
             the Building the Code section.
           * Minor and not intended for optimization purposes. Permitted
             modifications would include those required to overcome
             platform specific obstacles to building or running the code.
             These modifications should be documented and reported to the
             ASCI benchmark point of contact.

    2. *Modifications to Input File Parameters*

       As discussed in the Running the Code
       section previously, the *Nruns* parameter will almost invariably
       need to be modified every time the number of MPI tasks / OpenMP
       threads used in an execution changes. This input parameter and the
       output print flag are the only two parameters which can be modified.

       Note that this parameter must be at least equal to and evenly
       divisible by the (Number of MPI tasks * Number of OpenMP threads)
       used. It has a default maximum value of 10001. If this default
       value is too small, consult the Building the Code
       section for instructions on how to
       increase the value of the Nruns parameter.

    3. *Compiler Generated Optimizations*

       Benchmarkers are welcomed and encouraged to employ the use of
       compiler generated optimizations when building the code.
       Specifying these optimizations should not require modification to
       the source code, but instead should take the form of common
       compiler "flags". These types of optimizations may be specified in
       the two Makefiles used for that purpose. See Building the Code
       for additional information.

    4. *Maximum Number of MPI Tasks and OpenMP Threads*

       For the purpose of array dimensioning, particularly for the
       collection of timing statistics, maximum values for these two
       parameters are defined in the file includes/times.inc. The
       distribution software sets these as follows:

       PARAMETER( maxMPItasks = 16384 )
       PARAMETER( maxThreadsPerMPItask = 128 )

       If these values need to be increased, see the Building the Code
       section for instructions.

    5. *Required Problems*

      Please see the Sphot Sequoia Specific Problem set for the problems
      to be run, the results to be reported, and the description of the
      calculation of the Figure of Merit (FOM) for the runs.
 ------------------------------------------------------------------------


     Release and Modification Record

 Version 1.0, this version. No release required. Public domain software.
 ------------------------------------------------------------------------

 For more information about this page, contact:
 Tom Spelce ,  spelce1@llnl.gov mailto:spelce1@llnl.gov>

 <http://www.llnl.gov/>

 *UCRL-MI-144211*
 September 19, 2001

	SPhot - Monte Carlo Transport Code


	Benchmark-specific Instructions and Constraints

	Privacy & Legal Notice <http://www.llnl.gov/disclaimer.html>
	------------------------------------------------------------------------


	Contents

	* Expected Results
	* Optimization and Improvement Challenges
	* Parallelism and Scalability Expectations
	* Benchmark Specific Instructions
	* Release and Modification Record

	------------------------------------------------------------------------


	Expected Results

	Please see the Summary writeup and the Sphot Sequoia Specific Benchmark
	tests for information about expected results and the reporting of data.


	------------------------------------------------------------------------


	Optimization and Improvement Challenges

	Performance improvement and optimizations for this code fall into the
	following categories:

	1. MPI Related

	Very little data is actually transferred between MPI tasks, since
	each task has its own copy of the mesh, most likely in memory or
	even cache. Most of the inter-task data transfer is for the
	purpose of collecting timing results and involves very little
	data. Several MPI_Barrier calls are used for synchronization
	purposes also.

	Improving MPI related performance would involve minimizing or
	removing all MPI_Barrier calls and finding a way to minimize the
	exchange of inter-task timing statistics. In regards to the
	exchange of timing statistics, task 0 is a definitive bottleneck
	since it must act as the "master" task and communicate with all
	other tasks. This bottleneck may pose a challenge as the problem
	is scaled up to hundreds or thousands of MPI tasks.

	A very important, platform dependent optimization consideration
	for this code is determining the optimal number of MPI tasks to
	use. For example, a cluster of 32-processor SMP machines may
	demonstrate optimal performance when there are 4 MPI tasks
	(running 8 OpenMP threads each) on each SMP machine. This
	optimization will probably best be determined by experimentation
	on a given platform.

	2. OpenMP Related

	The current implementation does not make use of THREADPRIVATE
	common blocks. Instead, all threads access certain COMMON block
	variables globally, and that access occurs within the
	computational core routine (execute.f) of the code. For SMPs with
	few processors, this does not seem to pose a performance problem.
	However, it is anticipated that for many-processor SMP machines,
	it may. Improving OpenMP performance would involve finding a means
	to implement THREADPRIVATE common blocks. Currently, the code
	produces "wrong" results if this is attempted.

	As with MPI, determining the optimal number of OpenMP threads to
	use is a very important platform dependent optimization
	consideration.

	3. Computation Kernel Related

	Actual execution time for a "run" of SPhot requires very little
	wallclock or CPU time--certainly much less than a minute for any
	current platform. Also, due to the small problem size (grid size),
	and the embarrassingly parallel nature of this code, performance
	gains in this category would be minimal at best.

	For those interested in improving the performance of the
	computationally significant routines, the following might be
	evaluated:

	* execute.f

	This routine is responsible for over 50% of the total
	execution time. Profiling has shown that most of the cycles
	are distributed over a fairly large number of lines that
	perform simple arithmetic operations (multiplies, adds,
	divides), variable assignments (load/store), and IF
	condition testing.

	* pranf.f

	This routine comprises approximately 10% of the total
	execution time. It is a very small routine where virtually
	all of the execution time can be accounted for by a single
	line of code, shown below:


	RandNum = float( Seed( 4 ) ) / Divisor( 4 ) +

	1 float( Seed( 3 ) ) / Divisor( 3 ) +

	2 float( Seed( 2 ) ) / Divisor( 2 ) +

	3 float( Seed( 1 ) ) / Divisor( 1 )



	* ranfmodmult.f

	This routine comprises approximately 10% of the total
	execution time. This is another rather trivial (in size)
	routine. The execution time is more or less evenly
	distributed over all lines of executable code (less than 20)
	which mostly perform simple arithmetic and load/store
	operations.

	* sqrt intrinsic function

	This will vary according to platform. On one platform tested
	using the native Fortran compiler, approximately 10% of the
	total execution time was attributed to this intrinsic function.

	------------------------------------------------------------------------


	Parallelism and Scalability Expectations

	Because SPhot is trivially parallel, it might be expected that this code
	should scale "perfectly" to very large processor counts. Additionally,
	I/O operations have been replaced with MPI communications to eliminate
	scalability problems common to many I/O systems.


	When SPhot is executed in hybrid mode (MPI with OpenMP), both shared
	memory and distributed memory communications hardware and software
	performance will be demonstrated. CPU time per "run" will remain
	relatively constant since runs are intended to map to a processor's
	CPUs. Increasing the number of "runs" which may be done in parallel
	increases with the number of available CPUs. Perfect scalability is
	evidenced by an efficiency calculation of 1.00. That is, there is no
	additional time required to compute N runs on N CPUs.

	Realistically, the scalability of SPhot may be affected by the MPI and
	OpenMP related factors discussed under the Optimization and Improvement
	Challenges Optimization> section. As indicated
	there, the application's MPI communications include barriers and
	potential task 0 bottlenecks. Also, the OpenMP sharing of global COMMON
	block data within the computation loop may also degrade performance for
	many-processor SMP machines.

	------------------------------------------------------------------------


	Benchmark Specific Instructions and Constraints

	1. Code Modifications

	For the purposes of the ASC benchmarks, modifications to this
	application's source code are not permitted unless such
	modifications are:
	* Needed to modify certain default parameters as discussed in
	the Building the Code section.
	* Minor and not intended for optimization purposes. Permitted
	modifications would include those required to overcome
	platform specific obstacles to building or running the code.
	These modifications should be documented and reported to the
	ASCI benchmark point of contact.

	2. Modifications to Input File Parameters

	As discussed in the Running the Code
	section previously, the Nruns parameter will almost invariably
	need to be modified every time the number of MPI tasks / OpenMP
	threads used in an execution changes. This input parameter and the
	output print flag are the only two parameters which can be modified.

	Note that this parameter must be at least equal to and evenly
	divisible by the (Number of MPI tasks * Number of OpenMP threads)
	used. It has a default maximum value of 10001. If this default
	value is too small, consult the Building the Code
	section for instructions on how to
	increase the value of the Nruns parameter.

	3. Compiler Generated Optimizations

	Benchmarkers are welcomed and encouraged to employ the use of
	compiler generated optimizations when building the code.
	Specifying these optimizations should not require modification to
	the source code, but instead should take the form of common
	compiler "flags". These types of optimizations may be specified in
	the two Makefiles used for that purpose. See Building the Code
	for additional information.

	4. Maximum Number of MPI Tasks and OpenMP Threads

	For the purpose of array dimensioning, particularly for the
	collection of timing statistics, maximum values for these two
	parameters are defined in the file includes/times.inc. The
	distribution software sets these as follows:

	PARAMETER( maxMPItasks = 16384 )
	PARAMETER( maxThreadsPerMPItask = 128 )

	If these values need to be increased, see the Building the Code
	section for instructions.

	5. Required Problems

	Please see the Sphot Sequoia Specific Problem set for the problems
	to be run, the results to be reported, and the description of the
	calculation of the Figure of Merit (FOM) for the runs.
	------------------------------------------------------------------------


	Release and Modification Record

	Version 1.0, this version. No release required. Public domain software.
	------------------------------------------------------------------------

	For more information about this page, contact:
	Tom Spelce , spelce1@llnl.gov mailto:spelce1@llnl.gov>

	<http://www.llnl.gov/>

	UCRL-MI-144211
	September 19, 2001