Hiding Communication Latency Using MPI-3* Non-Blocking Collectives
Improving HPC Performance by Overlapping Communication and Computation
An indicator of a well-written HPC application is its ability to scale linearly―or near-linearly―across a large number of discrete nodes. And linear scalability/speedup of applications largely depends on minimizing communication overhead.
One way to minimize communication overhead is by hiding it. Non-blocking communication functions have always been part of the Message Passing Interface* (MPI*) to let developers overlap communication with computation―thus hiding communication latency. Basic interprocess communication in MPI (e.g., MPI_Send, MPI_Recv) is blocking in nature (i.e., calls don’t return until the communication has completed). Non-blocking functions (e.g., MPI_Isend, MPI_Irecv), on the other hand, return immediately―allowing computation to continue while communication completes in the background.
This article focuses on the relatively new non-blocking collective (NBC) communication functions (e.g., MPI_Iallreduce, MPI_Ibcast) in the MPI-3* standard. Despite the potential goodness of NBC functions, they’re not yet widely used in MPI applications. We’ll try to make them more approachable and show how to use them to improve performance by overlapping communication and computation.
Understanding Non-Blocking Communication
Using NBC in an MPI code involves more than just replacing a blocking collective function (e.g., replacing MPI_Bcast with its non-blocking counterpart, MPI_Ibcast). In fact, blindly making this change might actually degrade performance due to the additional overhead associated with checking for completion of the communication. Besides replacing MPI_Bcast with MPI_Ibcast, it’s also necessary to carefully examine the code and identify computation that can be executed while the NBC is in progress. The goal is to ensure that the data being computed is independent of the data being communicated. Otherwise, parallel correctness is compromised (Figure 1).
Ideally, it’s also important to consider the execution time of the code in the overlapping compute section. The execution time for this section should be equal to or greater than the time required for the encompassing communication (MPI_Ibcast in this case) to justify the additional complexity of non-blocking communication.
The MPI implementation used here is Intel® MPI Library 2018 Update 2 (Intel® MPI), which is fully compliant with the latest MPI 3.1 standard. This library is available standalone or as part of Intel® Parallel Studio XE Cluster Edition.
Meeting the Challenges Associated with Non-Blocking Communication
The semantics of blocking communication ensure completion of the associated data transfer. Non-blocking calls, however, use helper functions like MPI_Wait or MPI_Test to check for completion. The fact that the non-blocking semantics don’t ensure completion presents an opportunity to overlap computation and communication. However, the use of helper functions results in an increase in the number of instructions on the critical path due to cascading overheads at several levels.
Figure 2 shows the software stack used in Intel MPI. CH3 in the abstract device interface layer represents Intel MPI. To put the data to be transferred onto the wire (i.e., the communication medium), several software layers (Netmod and Provider) must be traversed before arriving at the hardware layer.
All the auxiliary instructions that run after invoking a communication call to put data on the wire are referred to as the critical path for the sender. MPI_Wait or MPI_Test calls originating from the abstract device interface layer complete with the help of additional calls in the Netmod and Provider layer. This is why nonblocking communication is more expensive than blocking communication. However, this expense can be offset by the performance gains from doing computation while communication is in progress.
Another overhead associated with non-blocking communication comes from making it asynchronous. Although asynchronous progress improves communication-computation overlap, it requires an additional thread per MPI rank. This thread consumes CPU cycles and, ideally, must be pinned to an exclusive core. However, exclusive thread pinning for each rank results in half of the cores being assigned just to accelerate the progress of non-blocking MPI calls. Therefore, through careful experimentation, we must select a certain number of cores per node to be assigned for asynchronous progress without causing a considerable compute penalty.
In Intel MPI, the following environment variables are available to enable asynchronous progress and to pin
threads to cores:
$ export I_MPI_ASYNC_PROGRESS=1 $ export I_MPI_ASYNC_PROGRESS_PIN=<CPU list>
Another tuning parameter to be considered is the spin count, which controls the aggressiveness of these threads. This can be controlled using:
$ export I_MPI_SPIN_COUNT=<scount>
The asynchronous approach requires careful consideration and usage. Some alternatives leverage the imbalance in the application to schedule threads on idle cores.1 A more elegant alternative would be for the network hardware to enable asynchronous progress. When such hardware features become available, it will be much easier to use asynchronous progress to enhance performance.
Lattice Quantum Chromodynamics (QCD)
Now let’s explore the impact of non-blocking communication on a QCD application. In QCD, space-time is discretized on a four-dimensional hypercube lattice, with fermion (quark) fields ascribed to the lattice sites and gauge (gluon) fields ascribed to the links between sites. The Wilson-Dslash operator is used as the gauge covariant derivative in a variety of lattice Dirac Fermion operators. This operator is like a four-dimensional, nearest-neighbor stencil (a nine-point stencil in 4D), with the added elaboration that the spinors on the sites and the gauges ascribed to the links are represented in component complex matrices. A large proportion of time in QCD applications is spent solving linear systems with these operators, typically using sparse iterative solvers such as conjugate gradient (CG).2
The communication pattern for the Wilson-Dslash operator is nearest-neighbor point-to-point (send-receive) exchange. This computation and communication is implemented in the fdopr function (not shown) in Figure 3. The CG solver code has two such invocations to the fdopr function for computing the “red” and “black” data (red-black algorithm). Typical multi-node implementations overlap computation of the Wilson-Dslash operator with the communication of the faces using non-blocking MPI functions (MPI_Isend, MPI_Irecv). The CG solver code also uses the collective communication function, MPI_Allreduce, to compute the error residue of the iterative solver.
The performance of the CG solver is expected to be lower as compared to Wilson-Dslash due to MPI_Allreduce collective operations in the CG solver loop. This is an opportunity to experiment with non-blocking collective communication, where the collective communication after the red computation is overlapped with the black for every iteration of the solver (Figure 4). MPI_Wait is necessary to verify completion of the non-blocking communication. (The optional call to optional MPI_Test is to test for completion of a pending request, and to initiate progress of the non-blocking communication if it hasn’t already begun.) The asynchronous communication is overlapped with the computation (and communication) in function fdopr.
tstart1 = MPI_Wtime() call MPI_Allreduce(alphad,ret,1,MPI_DOUBLE_PRECISION, + MPI_SUM, MPI_COMM_WORLD, ierr) call fdopr(atap,p,ap,0,1) tend1 = MPI_Wtime()
tstart1 = MPI_Wtime() call MPI_Iallreduce(alphad,ret,1,MPI_DOUBLE_PRECISION, + MPI_SUM, MPI_COMM_WORLD, ired1, ierr) call MPI_Test(ired1, flag1, MPI_STATUS_IGNORE) call fdopr(atap,p,ap,0,1) call MPI_Wait(ired1,iStatus,ierr) tend1 = MPI_Wtime()
The performance improvement from overlapping communication and computation using NBC is shown in Figure 5. The application was run on a cluster of dual Intel® Xeon® Gold 6148 processor nodes connected by the Intel® Omni-Path Fabric (Intel® OP Fabric). The improvement is higher for larger node counts, which is expected because the MPI_Allreduce time also increases with node count. Switching to MPI_Iallreduce hides some of this communication latency.
Improving Performance
NBC in Intel MPI provides opportunities to improve performance by overlapping communication and computation. It’s important to carefully consider the code segments that lend themselves to this optimization.
References
- Karthikeyan Vaidyanathan, Dhiraj D. Kalamkar, Kiran Pamnany, Jeff R. Hammond, Pavan Balaji, Dipankar Das, Jongsoo Park and Balint Joo. “Improving Concurrency and Asynchrony in Multithreaded MPI Applications using Software Offloading,” Supercomputing 2015
- M.R. Hestenes and E. Stiefel. “Methods of Conjugate Gradients for Solving Linear Systems,” Journal of Research of the National Bureau of Standards, 1952.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.