NOTE: These doc pages are still incomplete as of 1Jun11.
NOTE: The USER-CUDA package discussed below has not yet been officially released in LAMMPS.
Accelerated versions of various pair_style, fixes, computes, and other commands have been added to LAMMPS, which will typically run faster than the standard non-accelerated versions, if you have the appropriate hardware on your system.
The accelerated styles have the same name as the standard styles, except that a suffix is appended. Otherwise, the syntax for the command is identical, their functionality is the same, and the numerical results it produces should also be identical, except for precision and round-off issues.
For example, all of these variants of the basic Lennard-Jones pair style exist in LAMMPS:
Assuming you have built LAMMPS with the appropriate package, these styles can be invoked by specifying them explicitly in your input script. Or you can use the -suffix command-line switch to invoke the accelerated versions automatically, without changing your input script. The suffix command also allows you to set a suffix and to turn off/on the comand-line switch setting within your input script.
Styles with an "opt" suffix are part of the OPT package and typically speed-up the pairwise calculations of your simulation by 5-25%.
Styles with a "gpu" or "cuda" suffix are part of the GPU or USER-CUDA packages, and can be run on NVIDIA GPUs associated with your CPUs. The speed-up due to GPU usage depends on a variety of factors, as discussed below.
To see what styles are currently available in each of the accelerated packages, see this section of the manual. A list of accelerated styles is included in the pair, fix, compute, and kspace sections.
The following sections explain:
The final section compares and contrasts the GPU and USER-CUDA packages, since they are both designed to use NVIDIA GPU hardware.
10.1 OPT packageThe OPT package was developed by James Fischer (High Performance Technologies), David Richie and Vincent Natoli (Stone Ridge Technologies). It contains a handful of pair styles whose compute() methods were rewritten in C++ templated form to reduce the overhead due to if tests and other conditional code.
The procedure for building LAMMPS with the OPT package is simple. It is the same as for any other package which has no additional library dependencies:
make yes-opt make machine
If your input script uses one of the OPT pair styles, you can run it as follows:
lmp_machine -sf opt < in.script mpirun -np 4 lmp_machine -sf opt < in.script
You should see a reduction in the "Pair time" printed out at the end of the run. On most machines and problems, this will typically be a 5 to 20% savings.
The GPU package was developed by Mike Brown at ORNL.
Additional requirements in your input script to run the styles with a gpu suffix are as follows:
The newton pair setting must be off and the fix gpu command must be used. The fix controls the GPU selection and initialization steps.
A few LAMMPS pair styles can be run on graphical processing units (GPUs). We plan to add more over time. Currently, they only support NVIDIA GPU cards. To use them you need to install certain NVIDIA CUDA software on your system:
When using GPUs, you are restricted to one physical GPU per LAMMPS process. Multiple processes can share a single GPU and in many cases it will be more efficient to run with multiple processes per GPU. Any GPU accelerated style requires that fix gpu be used in the input script to select and initialize the GPUs. The format for the fix is:
fix name all gpu mode first last split
where name is the name for the fix. The gpu fix must be the first fix specified for a given run, otherwise the program will exit with an error. The gpu fix will not have any effect on runs that do not use GPU acceleration; there should be no problem with specifying the fix first in any input script.
mode can be either "force" or "force/neigh". In the former, neighbor list calculation is performed on the CPU using the standard LAMMPS routines. In the latter, the neighbor list calculation is performed on the GPU. The GPU neighbor list can be used for better performance, however, it cannot not be used with a triclinic box or with hybrid pair styles.
There are cases when it might be more efficient to select the CPU for neighbor list builds. If a non-GPU enabled style requires a neighbor list, it will also be built using CPU routines. Redundant CPU and GPU neighbor list calculations will typically be less efficient.
first is the ID (as reported by lammps/lib/gpu/nvc_get_devices) of the first GPU that will be used on each node. last is the ID of the last GPU that will be used on each node. If you have only one GPU per node, first and last will typically both be 0. Selecting a non-sequential set of GPU IDs (e.g. 0,1,3) is not currently supported.
split is the fraction of particles whose forces, torques, energies, and/or virials will be calculated on the GPU. This can be used to perform CPU and GPU force calculations simultaneously. If split is negative, the software will attempt to calculate the optimal fraction automatically every 25 timesteps based on CPU and GPU timings. Because the GPU speedups are dependent on the number of particles, automatic calculation of the split can be less efficient, but typically results in loop times within 20% of an optimal fixed split.
If you have two GPUs per node, 8 CPU cores per node, and would like to run on 4 nodes with dynamic balancing of force calculation across CPU and GPU cores, the fix might be
fix 0 all gpu force/neigh 0 1 -1
with LAMMPS run on 32 processes. In this case, all CPU cores and GPU devices on the nodes would be utilized. Each GPU device would be shared by 4 CPU cores. The CPU cores would perform force calculations for some fraction of the particles at the same time the GPUs performed force calculation for the other particles.
Because of the large number of cores on each GPU device, it might be more efficient to run on fewer processes per GPU when the number of particles per process is small (100's of particles); this can be necessary to keep the GPU cores busy.
In order to use GPU acceleration in LAMMPS, fix_gpu should be used in order to initialize and configure the GPUs for use. Additionally, GPU enabled styles must be selected in the input script. Currently, this is limited to a few pair styles and PPPM. Some GPU-enabled styles have additional restrictions listed in their documentation.
The GPU accelerated pair styles can be used to perform pair style force calculation on the GPU while other calculations are performed on the CPU. One method to do this is to specify a split in the gpu fix as described above. In this case, force calculation for the pair style will also be performed on the CPU.
When the CPU work in a GPU pair style has finished, the next force computation will begin, possibly before the GPU has finished. If split is 1.0 in the gpu fix, the next force computation will begin almost immediately. This can be used to run a hybrid GPU pair style at the same time as a hybrid CPU pair style. In this case, the GPU pair style should be first in the hybrid command in order to perform simultaneous calculations. This also allows bond, angle, dihedral, improper, and long-range force computations to be run simultaneously with the GPU pair style. Once all CPU force computations have completed, the gpu fix will block until the GPU has finished all work before continuing the run.
GPU accelerated pair styles can perform computations asynchronously with CPU computations. The "Pair" time reported by LAMMPS will be the maximum of the time required to complete the CPU pair style computations and the time required to complete the GPU pair style computations. Any time spent for GPU-enabled pair styles for computations that run simultaneously with bond, angle, dihedral, improper, and long-range calculations will not be included in the "Pair" time.
When mode for the gpu fix is force/neigh, the time for neighbor list calculations on the GPU will be added into the "Pair" time, not the "Neigh" time. A breakdown of the times required for various tasks on the GPU (data copy, neighbor calculations, force computations, etc.) are output only with the LAMMPS screen output at the end of each run. These timings represent total time spent on the GPU for each routine, regardless of asynchronous CPU calculations.
See the lammps/lib/gpu/README file for instructions on how to build the LAMMPS gpu library for single, mixed, and double precision. The latter requires that your GPU card supports double precision.
The USER-CUDA package was developed by Christian Trott at U Technology Ilmenau in Germany.
This package will only be of any use to you, if you have an NVIDIA(tm) graphics card being CUDA(tm) enabled. Your GPU needs to support Compute Capability 1.3. This list may help you to find out the Compute Capability of your card:
http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units
Install the Nvidia Cuda Toolkit in version 3.2 or higher and the corresponding GPU drivers. The Nvidia Cuda SDK is not required for LAMMPSCUDA but we recommend to install it and
make sure that the sample projects can be compiled without problems.
You should also be able to compile LAMMPS by typing
make YourMachine
inside the src directory of LAMMPS root path. If not, you should consult the LAMMPS documentation.
If your CUDA toolkit is not installed in the default directoy /usr/local/cuda edit the file lib/cuda/Makefile.common accordingly.
Go to lib/cuda/ and type
make OPTIONS
where OPTIONS are one or more of the following:
The settings will be written to the lib/cuda/Makefile.defaults. When compiling with make only those settings will be used.
Go to src, install the USER-CUDA package with make yes-USER-CUDA and compile the binary with make YourMachine. You might need to delete old object files if you compiled without the USER-CUDA package before, using the same Machine file (rm Obj_YourMachine/*).
CUDA versions of classes are only installed if the corresponding CPU versions are installed as well. E.g. you need to install the KSPACE package to use pppm/cuda.
In order to make use of the GPU acceleration provided by the USER-CUDA package, you only have to add
accelerator cuda
at the top of your input script. See the accelerator command for details of additional options.
When compiling with USER-CUDA support the -accelerator command-line switch is effectively set to "cuda" by default and does not have to be given.
If you want to run simulations without using the "cuda" styles with the same binary, you need to turn it explicitely off by giving "-a none", "-a opt" or "-a gpu" as a command-
line argument.
The kspace style pppm/cuda has to be requested explicitely.
The USER-CUDA package is an alternative package for GPU acceleration that runs as much of the simulation as possible on the GPU. Depending on the simulation, this can provide a significant speedup when the number of atoms per GPU is large.
The styles available for GPU acceleration will be different in each package.
The main difference between the "GPU" and the "USER-CUDA" package is that while the latter aims at calculating everything on the device the GPU package uses it as an accelerator for the pair force, neighbor list and pppm calculations only. As a consequence in different scenarios either package can be faster. Generally the GPU package is faster than the USER-CUDA package, if the number of atoms per device is small. Also the GPU package profits from oversubscribing devices. Hence one usually wants to launch two (or more) MPI processes per device.
The exact crossover where the USER-CUDA package becomes faster depends strongly on the pair-style. For example for a simple Lennard Jones system the crossover (in single precision) can often be found between 50,000 - 100,000 atoms per device. When performing double precision calculations this threshold can be significantly smaller. As a result the GPU package can show better "strong scaling" behaviour in comparison with the USER-CUDA package as long as this limit of atoms per GPU is not reached.
Another scenario where the GPU package can be faster is, when a lot of bonded interactions are calculated. Those are handled by both packages by the host while the device simultaniously calculates the pair-forces. Since, when using the GPU package, one launches several MPI processes per device, this work is spread over more CPU cores as compared to running the same simulation with the USER-CUDA package.
As a side note: the GPU package performance depends to some extent on optimal bandwidth between host and device. Hence its performance is affected if no full 16 PCIe lanes are available for each device. In HPC environments this can be the case if S2050/70 servers are used, where two devices generally share one PCIe 2.0 16x slot. Also many multi GPU mainboards do not provide full 16 lanes to each of the PCIe 2.0 16x slots.
While the GPU package uses considerable more device memory than the USER-CUDA package, this is generally not much of a problem. Typically run times are larger than desired, before the memory is exhausted.
Currently the USER-CUDA package supports a wider range of force-fields. On the other hand its performance is considerably reduced if one has to use a fix at every timestep, which is not yet available as a "CUDA"-accelerated version.
In the end for each simulations its best to just try both packages and see which one is performing better in the particular situation.
In the following 4 benchmark systems which are supported by both the GPu and the CUDA package are shown:
1. Lennard Jones, 2.5A 256,000 atoms 2.5 A cutoff 0.844 density
2. Lennard Jones, 5.0A 256,000 atoms 5.0 A cutoff 0.844 density
3. Rhodopsin model 256,000 atoms 10A cutoff Coulomb via PPPM
4. Lihtium-Phosphate 295650 atoms 15A cutoff Coulomb via PPPM
Hardware: Workstation: 2x GTX470 i7 950@3GHz 24Gb DDR3 @ 1066Mhz CentOS 5.5 CUDA 3.2 Driver 260.19.12
eStella: 6 Nodes 2xC2050 2xQDR Infiniband interconnect(aggregate bandwidth 80GBps) Intel X5650 HexCore @ 2.67GHz SL 5.5 CUDA 3.2 Driver 260.19.26
Keeneland: HP SL-390 (Ariston) cluster 120 nodes 2x Intel Westmere hex-core CPUs 3xC2070s QDR InfiniBand interconnec