14. Running in Parallel¶
If you are running a simulation on a computer with multiple cores, multiple sockets, or multiple nodes (i.e., a cluster), you can benefit from the fact that OpenMC is able to use all available hardware resources if configured correctly. OpenMC is capable of using both distributed-memory (MPI) and shared-memory (OpenMP) parallelism. If you are on a single-socket workstation or a laptop, using shared-memory parallelism is likely sufficient. On a multi-socket node, cluster, or supercomputer, chances are you will need to use both distributed-memory (across nodes) and shared-memory (within a single node) parallelism.
14.2. Distributed-Memory Parallelism (MPI)¶
MPI defines a library specification for message-passing between processes. There are two major implementations of MPI, OpenMPI and MPICH. Both implementations are known to work with OpenMC; there is no obvious reason to prefer one over the other. Building OpenMC with support for MPI requires that you have one of these implementations installed on your system. For instructions on obtaining MPI, see Prerequisites. Once you have an MPI implementation installed, compile OpenMC following Specifying the Build Type.
To run a simulation using MPI, openmc needs to be called using the mpiexec wrapper. For example, to run OpenMC using 32 processes:
mpiexec -n 32 openmc
The same thing can be achieved from the Python API by supplying the mpi_args
argument to openmc.run()
:
openmc.run(mpi_args=['mpiexec', '-n', '32'])
14.3. Maximizing Performance¶
There are a number of things you can do to ensure that you obtain optimal performance on a machine when running in parallel:
Use OpenMP within each NUMA node. Some large server processors have so many cores that the last level cache is split to reduce memory latency. For example, the Intel Xeon Haswell-EP architecture uses a snoop mode called cluster on die where the L3 cache is split in half. Thus, in general, you should use one MPI process per socket (and OpenMP within each socket), but for these large processors, you will want to go one step further and use one process per NUMA node. The Xeon Phi Knights Landing architecture uses a similar concept called sub NUMA clustering.
Use a sufficiently large number of particles per generation. Between fission generations, a number of synchronization tasks take place. If the number of particles per generation is too low and you are using many processes/threads, the synchronization time may become non-negligible.
Use hardware threading if available.
Use process binding. When running with MPI, you should ensure that processes are bound to a specific hardware region. This can be set using the
-bind-to
(MPICH) or--bind-to
(OpenMPI) option tompiexec
.Turn off generation of tallies.out. For large simulations with millions of tally bins or more, generating this ASCII file might consume considerable time. You can turn off generation of
tallies.out
via theSettings.output
attribute:settings = openmc.Settings() settings.output = {'tallies': False}