blob: 92d7b5d8e47a84eb1cf116373358052a5d830f9a [file] [log] [blame]
%==========================================================================
%==========================================================================
Code Description
A. General description:
SMG2000 is a parallel semicoarsening multigrid solver for the linear
systems arising from finite difference, finite volume, or finite
element discretizations of the diffusion equation,
\grad \cdot ( D \grad u ) + \sigma u = f
on logically rectangular grids. The code solves both 2D and 3D
problems with discretization stencils of up to 9-point in 2D and up to
27-point in 3D. See the following paper for details on the algorithm
and its parallel implementation/performance:
P. N. Brown, R. D. Falgout, and J. E. Jones,
"Semicoarsening multigrid on distributed memory machines",
SIAM Journal on Scientific Computing, 21 (2000), pp. 1823-1834.
Also available as LLNL technical report UCRL-JC-130720.
The driver provided with SMG2000 builds linear systems for the special
case of the above equation,
- cx u_xx - cy u_yy - cz u_zz = (1/h)^2 , (in 3D)
- cx u_xx - cy u_yy = (1/h)^2 , (in 2D)
with Dirichlet boundary conditions of u = 0, where h is the mesh
spacing in each direction. Standard finite differences are used to
discretize the equations, yielding 5-pt. and 7-pt. stencils in 2D and
3D, respectively.
To determine when the solver has converged, the driver currently uses
the relative-residual stopping criteria,
||r_k||_2 / ||b||_2 < tol
with tol = 10^-6.
This solver can serve as a key component for achieving scalability in
radiation diffusion simulations.
B. Coding:
SMG2000 is written in ISO-C. It is an SPMD code which uses MPI.
Parallelism is achieved by data decomposition. The driver provided
with SMG2000 achieves this decomposition by simply subdividing the
grid into logical P x Q x R (in 3D) chunks of equal size.
C. Parallelism:
SMG2000 is a highly synchronous code. The communications and
computations patterns exhibit the surface-to-volume relationship
common to many parallel scientific codes. Hence, parallel efficiency
is largely determined by the size of the data "chunks" mentioned
above, and the speed of communications and computations on the
machine. SMG2000 is also memory-access bound, doing only about 1-2
computations per memory access, so memory-access speeds will also have
a large impact on performance.
%==========================================================================
%==========================================================================
Files in this Distribution
NOTE: The SMG2000 code is derived directly from the hypre library, a large
linear solver library that is being developed in the Center for Applied
Scientific Computing (CASC) at LLNL.
In the smg2000 directory the following files are included:
COPYRIGHT_and_DISCLAIMER
HYPRE_config.h
Makefile
Makefile.include
The following subdirectories are also included:
docs
krylov
struct_ls
struct_mv
test
utilities
In the 'docs' directory the following files are included:
smg2000.readme
In the 'krylov' directory the following files are included:
HYPRE_pcg.c
Makefile
krylov.h
pcg.c
In the 'struct_ls' directory the following files are included:
HYPRE_struct_ls.h
HYPRE_struct_pcg.c
HYPRE_struct_smg.c
Makefile
coarsen.c
cyclic_reduction.c
general.c
headers.h
pcg_struct.c
point_relax.c
semi_interp.c
semi_restrict.c
smg.c
smg.h
smg2_setup_rap.c
smg3_setup_rap.c
smg_axpy.c
smg_relax.c
smg_residual.c
smg_setup.c
smg_setup_interp.c
smg_setup_rap.c
smg_setup_restrict.c
smg_solve.c
struct_ls.h
In the 'struct_mv' directory the following files are included:
HYPRE_struct_grid.c
HYPRE_struct_matrix.c
HYPRE_struct_mv.h
HYPRE_struct_stencil.c
HYPRE_struct_vector.c
Makefile
box.c
box_algebra.c
box_alloc.c
box_neighbors.c
communication.c
communication_info.c
computation.c
grow.c
headers.h
hypre_box_smp_forloop.h
project.c
struct_axpy.c
struct_copy.c
struct_grid.c
struct_innerprod.c
struct_io.c
struct_matrix.c
struct_matrix_mask.c
struct_matvec.c
struct_mv.h
struct_scale.c
struct_stencil.c
struct_vector.c
In the 'test' directory the following files are included:
Makefile
smg2000.c
In the 'utilities' directory the following files are included:
HYPRE_utilities.h
Makefile
general.h
hypre_smp_forloop.h
memory.c
memory.h
mpistubs.c
mpistubs.h
random.c
threading.c
threading.h
timer.c
timing.c
timing.h
utilities.h
version
%==========================================================================
%==========================================================================
Building the Code
SMG2000 uses a simple Makefile system for building the code. All
compiler and link options are set by modifying the file
'smg2000/Makefile.include' appropriately. This file is then included
in each of the following makefiles:
krylov/Makefile
struct_ls/Makefile
struct_mv/Makefile
test/Makefile
utilities/Makefile
To build the code, first modify the 'Makefile.include' file
appropriately, then type (in the smg2000 directory)
make
Other available targets are
make clean (deletes .o files)
make veryclean (deletes .o files, libraries, and executables)
To configure the code to run with:
1 - OpenMP only, add '-DHYPRE_USING_OPENMP -DHYPRE_SEQUENTIAL' to
the 'INCLUDE_CFLAGS' line in the 'Makefile.include' file and
use a valid OpenMP compiler.
2 - Open MP with MPI, add '-DHYPRE_USING_OPENMP -DTIMER_USE_MPI'
to the 'INCLUDE_CFLAGS' line in the 'Makefile.include' file
and use a valid OpenMP compiler and MPI library.
3 - MPI only , add '-DTIMER_USE_MPI' to the 'INCLUDE_CFLAGS' line
in the 'Makefile.include' file and use a valid MPI.
%==========================================================================
%==========================================================================
Optimization and Improvement Challenges
This code is memory-access bound. We believe it would be very
difficult to obtain "good" cache reuse with an optimized version of
the code.
%==========================================================================
%==========================================================================
Parallelism and Scalability Expectations
SMG2000 has been run on the following platforms:
Blue-Pacific - up to 1000 procs
Red - up to 3150 procs
Compaq cluster - up to 64 procs
Sun Sparc Ultra 10's - up to 4 machines
Consider increasing both problem size and number of processors in tandem.
On scalable architectures, time-to-solution for SMG2000 will initially
increase, then it will level off at a modest numbers of processors,
remaining roughly constant for larger numbers of processors. Iteration
counts will also increase slightly for small to modest sized problems,
then level off at a roughly constant number for larger problem sizes.
For example, we get the following results for a 3D problem with
cx = 0.1, cy = 1.0, and cz = 10.0, for a problem distributed on
a logical P x Q x R processor topology, with fixed local problem
size per processor given as 35x35x35:
"P x Q x R" P "iters" "setup time" "solve time"
1x1x1 1 6 1.681680 23.255241
2x2x2 8 6 3.738600 32.262907
3x3x3 27 6 6.601194 41.341892
6x6x6 216 7 12.310776 46.672215
8x8x8 512 7 18.968893 50.051737
10x10x10 1000 7 18.890876 54.094806
14x15x15 3150 8 30.635085 62.725305
These results were obtained on ASCI Red.
%==========================================================================
%==========================================================================
Running the Code
The driver for SMG2000 is called `smg2000', and is located in the
smg2000/test subdirectory. Type
mpirun -np 1 smg2000 -help
to get usage information. This prints out the following:
Usage: .../smg2000/test/smg2000 [<options>]
-n <nx> <ny> <nz> : problem size per block
-P <Px> <Py> <Pz> : processor topology
-b <bx> <by> <bz> : blocking per processor
-c <cx> <cy> <cz> : diffusion coefficients
-v <n_pre> <n_post> : number of pre and post relaxations
-d <dim> : problem dimension (2 or 3)
-solver <ID> : solver ID (default = 0)
0 - SMG
1 - CG with SMG precond
2 - CG with diagonal scaling
3 - CG
All of the arguments are optional. The most important options for the
SMG2000 compact application are the `-n' and `-P' options. The `-n'
option allows one to specify the local problem size per MPI process,
the the `-P' option specifies the process topology on which to run.
The global problem size will be <Px>*<nx> by <Py>*<ny> by <Pz>*<nz>.
When running with OpenMP, the number of threads used per MPI process
is controlled via the OMP_NUM_THREADS environment variable.
%==========================================================================
%==========================================================================
Timing Issues
If using MPI, the whole code is timed using the MPI timers. If not using
MPI, standard system timers are used. Timing results are printed to
standard out, and are divided into "Setup Phase" times and "Solve Phase"
times. Timings for a few individual routines are also printed out.
%==========================================================================
%==========================================================================
Memory Needed
SMG2000 is a memory intensive code, and its memory needs are somewhat
complicated to describe. For the 3D problems discussed in this
document, memory requirements are roughly 54 times the local problem
size times the size of a double plus some overhead for storing ghost
points, etc. in the code. The overhead required by this version of
the SMG code grows essentially like the logarithm of the problem size.
%==========================================================================
%==========================================================================
About the Data
SMG2000 does not read in any data. All control is on the execute line.
%==========================================================================
%==========================================================================
Expected Results
Consider the following run:
mpirun -np 1 smg2000 -n 12 12 12 -c 2.0 3.0 40
This is what SMG2000 prints out:
Running with these driver parameters:
(nx, ny, nz) = (12, 12, 12)
(Px, Py, Pz) = (1, 1, 1)
(bx, by, bz) = (1, 1, 1)
(cx, cy, cz) = (2.000000, 3.000000, 40.000000)
(n_pre, n_post) = (1, 1)
dim = 3
solver ID = 0
=============================================
Struct Interface:
=============================================
Struct Interface:
wall clock time = 0.005627 seconds
cpu clock time = 0.010000 seconds
=============================================
Setup phase times:
=============================================
SMG Setup:
wall clock time = 0.330096 seconds
cpu clock time = 0.330000 seconds
=============================================
Solve phase times:
=============================================
SMG Solve:
wall clock time = 0.686244 seconds
cpu clock time = 0.480000 seconds
Iterations = 4
Final Relative Residual Norm = 8.972097e-07
The relative residual norm may differ slightly from machine to machine
or compiler to compiler, but should only differ very slightly (say,
the 6th or 7th decimal place). Also, the code should generate nearly
identical results for a given problem, independent of the data
distribution. The only part of the code that does not guarantee
bitwise identical results is the inner product used to compute norms.
In practice, the above residual norm has remained the same.
%==========================================================================
%==========================================================================
Release and Modification Record
LLNL code release number: UCRL-CODE-2000-022
(c) 2000 The Regents of the University of California
See the file COPYRIGHT_and_DISCLAIMER for a complete copyright notice,
contact person, and disclaimer.