Improving Performance by Vectorizing Particle-in-Cell Codes
A Practical Guide
The basic particle method is a well-established approach to simulating the behavior of charged particles interacting with each other through pairwise electromagnetic forces. At each step, the particle properties are updated according to these calculated forces. For applications on powerful modern supercomputers with deep cache hierarchies, a pure particle method is efficient with respect to both locality and arithmetic intensity (compute-bound).
Unfortunately, the O(N2) complexity makes a particle method impractical for plasma simulations using millions of particles per process. Instead of calculating O(N2) forces, the particle-in-cell (PIC) method uses a grid as the medium to calculate long-range electromagnetic forces. This reduces the complexity from O(N2) to O(N + M log M), where M is the number of grid points, generally much smaller than N. However, achieving high parallel and architectural efficiency is a significant challenge for PIC methods due to the gather/scatter nature of the algorithm.
Attaining performance becomes even more complex as HPC technology moves to the era of multi- and many-core architectures with increased thread and vector parallelism on shared memory processors. A deep understanding of how to improve the associated scalability will have a wide-ranging influence on numerous physical applications that use particle-mesh algorithms―including molecular dynamics, cosmology, accelerator physics, and plasma physics.
This article is a practical guide to improving performance by enabling vectorization for PIC codes.
Optimization for PIC Codes
The code example we use for this demonstration is the particle class in Athena++, an astrophysical magnetohydrodynamics (MHD) code written in C++ 1. The particle class encapsulates basic data structures and functions in PIC methods. Particle properties are represented by phase space position―that is, physical space position (x1,x2,x3) and velocity (v1,v2,v3). The class functions implement three essential particle-based operations in PIC methods:
- Deposit: Charge deposition from particles onto grid
- Move: Interpolation of grid-based fields onto particles and updating of particle properties using the fields
- Shift: Move particles among processes in the distributed environment. Usually, deposit and move take 80 to 90% of the total computational time and are the focus for optimizations.
Data Layout and Alignment
Particle properties can be stored as array-of-structures (AoS) or structure-of-arrays (SoA) data layout (Figure 1).
AoS helps to pack and unpack particles for shift, but doesn’t help to enable vectorization for deposit and move with stride-one memory access. Since deposit and move are the hotspots of the code, we choose SoA data layout for particle representation. Specifically, we allocate the memory with alignment using the posix_memalign() function (Figure 2).
We start the optimization by checking the vectorization report of the original code for deposit.cpp and move.cpp (Figure 3).
We compile the code with the Intel® C++ Compiler 19.0 and the following options: -O3 -g -qopt -report5
-xCORE-AVX512 -qopenmp simd. By default, the compiler will try to vectorize the innermost loop for both deposit (L106) and move (L121). From the vectorization report, we see that the compiler fails to do so because of a potential data dependency in deposit.cpp (Figure 4) and efficiency reasons in move.cpp (Figure 5).
Explicit SIMD via OpenMP Directives
For both deposit and move, the loop count for the outermost loop (e.g., nparticle loop) is large, while the loop count for the innermost loop is small. Therefore, the outermost loop is a more suitable candidate for vectorization. Our first attempt is to enable vectorization for the outermost loop in both deposit and move using OpenMP* SIMD directives. For both kernels, there’s no data dependency with respect to pointer aliasing. In considering vectorization for deposit in the outermost loop, a potential data dependency will appear when two particles within a single vector length try to write on the same memory location for the grid-based array (e.g., mcoup). When compiling for AVX2, it’s important to use #pragma omp ordered simd such that the loop can be safely vectorized. Since AVX512* provides conflict detection instructions (vpconflict), this is no longer necessary for AVX512. Also, we use aligned and simdlen clauses in the OpenMP SIMD directives to have the compiler generate more efficient vector instructions, e.g.:
Note the current Intel C++ Compiler only allows a local variable name in the aligned clause list. Using a class data member name directly will cause a compiler error. Our solution here is to define local variables and assign class data member variables to local variables, e.g.:
When we check the vectorization report after adding the OpenMP SIMD directives, we see that the compiler has now successfully vectorized the outermost loop with aligned access to the particle data (Figure 6). Unfortunately, the vectorized version is quite inefficient and doesn’t give a speedup over the scalar version due to the gather/scatter nature of the kernels. For example, the estimated potential speedup is only 0.55 and 0.67 for deposit and move, respectively.
Strip Mining
Strip mining is a common technique to help the compiler automatically vectorize code by exposing data parallelism2. The idea is to transform a single loop into a nested loop where the outer loop strides through a strip and the inner loop strides all iterations within the strip (Figure 7).
Usually, the strip size is a multiple of the vector length. In some computationally-intensive loops, the strip mining technique alone can lead to a significant performance boost. For deposit and move, replacing the particle loop with two nested loops doesn’t help with performance due to the gather/scatter. After applying strip mining, for example, the estimated potential speedup is still only 0.52 and 0.67 for deposit and move, respectively (Figure 8).
However, strip mining can be a powerful technique when you use it as a basis for other techniques, such as loop fission.
Loop Fission for move
The move loop includes two operations:
- The interpolation of grid-based fields onto particles (gather)
- Updating of particle properties according to the interpolated fields
While the former won’t benefit from vector instructions, the latter will. The idea here is to use loop fission to separate the non-vectorizable operation from the vectorizable one to improve performance. At the same time, by applying strip mining with loop fission, we can significantly reduce the storage requirement for passing local variables from the first loop to the second. Figure 9 shows the new move implementation.
The vectorization report shows that although there is no potential speedup for the first loop, the second achieves a 6.59 estimated potential speedup (Figure 10).
Vectorizable Charge Deposition Algorithm for deposit
Compared with move, deposit is more challenging to optimize. For the scatter operation, the performance challenges involve not only random memory accesses and potential data conflicts, but also increased pressure on the memory subsystem. Also, unlike move, which also includes many independent computationally-intensive operations (i.e., updating particles properties), deposit is mostly memory operations. A key optimization here is to reduce (or even avoid) data conflicts and regularize memory accesses. Motivated by the portable SIMD charge deposition algorithm in the PICSAR* code3, we implemented a vectorizable charge deposition algorithm for deposit (Figure 11).
First, the outermost particle loop is changed to two nested loops using strip mining. In the first nested loop (L69-L133), the particle properties are deposited on a set of local arrays with stride-one memory access, so the compiler should have no trouble generating highly vectorized machine code.
In the second nested loop (L135-L159), the information saved in the local arrays is transferred to an extended global grid array (L163-L205), where each grid point is associated with particle deposition on surrounding grid points (i.e., a 27-point stencil). The price we pay here is the extra storage required for the extended global array. The storage can be reduced if we consider the 27-point stencil as three 9-point stencils.
The grid-based values in the extended global array need to be merged at the end. Though it will be difficult to vectorize the merging operation, the cost isn’t significant, since the loop is over grid indices at least an order of magnitude smaller than the particle one.
Figure 12 shows the vectorization report for the new implementation. The first and second loops now achieve 5.72 and 5.69 estimated potential speedup, respectively.
Performance Evaluation on Intel Architectures
Performance was evaluated on a 10x10x10 test grid with 100 particles per cell. We ran the simulations for 1,001 timesteps and measured the total wall clock time. We measured performance on single-core Intel® Xeon® and Intel® Xeon Phi™ processors: “BDW” (Intel® Xeon® E5-2680 v4 processor, 2.4GHz, 2 sockets, 14 cores), “SKX” (Intel® Xeon® Gold 6148 processor, 2.40GHz, 2 sockets, 20 cores), and “KNL” (Intel® Xeon Phi™ 7250 processor, 1.4GHz, 1 sockets, 68 cores) (Figure 13).
We can see that the actual performance is consistent with the estimated potential speedup from the compiler report. On BDW, we see significant performance overhead in using SIMD for deposit. This is due to the #pragma omp ordered simd clause ensuring the right order of writing to the grid array when compiling with AVX2 instructions. Overall, we see that strip mining, combined with other techniques, has led to 1.3x to 1.9x performance boost on Intel® processors.
Acknowledgements
This work was supported by the Intel® Parallel Computing Center (IPCC) program. We would like to thank Jason Sewall at Intel for providing strong technical support for this work. We also would like to thank Lev Arzamasskiy and Matthew Kunz in the Princeton University Department of Astrophysical Sciences for sharing their source code.
References
1 https://princetonuniversity.github.io/athena/
2 Andrey Vladimirov “Optimization Techniques for the Intel MIC Architecture, part 2 of 3: Strip-Mining for
Vectorization,” Colfax International, June 26, 2015.
3 https://picsar.net/features/optimized-morse-nielson-deposition/