Vectorization Opportunities for Improved Performance with Intel® AVX -512
Examples of How Intel® Compilers Can Vectorize and Speed Up Loops
The days of easy performance improvements from steadily increasing clock speeds are long gone. Moore’s Law instead provides increased parallelism through additional cores and more and wider SIMD registers. Besides the increased SIMD vector width of 512 bits, Intel® Advanced Vector Extensions 512 (Intel® AVX-512) includes new instructions that enable vectorization of some loops that could not be vectorized with older instruction sets and more efficient vectorization of others.
In this article, we’ll explain, with examples, how Intel® C++ and Intel® Fortran compilers can vectorize and speed up loops that compress or expand arrays and loops that fill histograms or perform scatters with potential address conflicts. We will also show how the compilers can convert strided loads or gathers into more efficient unit stride SIMD loads for certain types of loops over arrays of structures. Finally, we’ll show how to recognize this conversion using new features of the optimization report in the compiler version 17.0 from Intel® Parallel Studio XE 2017. These optimizations can benefit a broad range of applications running not only on the Intel® Xeon Phi™ X200 processor, but also on future Intel® Xeon® processors that support the Intel AVX-512 instruction set.
A Vectorization Refresher
Figure 1 illustrates a simple, double-precision floating-point loop. In scalar mode, one instruction produces one result. After vectorization, a single Intel AVX-512 instruction can produce eight results, compared to four for Intel® AVX, or two for Intel® Streaming SIMD Extensions (Intel® SSE). The Intel Xeon Phi x200 processor supports Intel AVX-512 instructions for a wide variety of operations on 32- and 64-bit integer and floating-point data. This may be extended to 8- and 16- bit integers in future Intel Xeon processors.
Automatic vectorization is enabled by default in Intel compilers. However, one of the switches in Figure 2 must be set in order to target the Intel AVX-512 instructions.
Compress and Expand Loops
The Fortran example in Figure 2A illustrates array compression. Only those elements of a large source array that satisfy a condition are copied into a smaller target array. The C example in Figure 2B illustrates the inverse operation (i.e., array expansion), where elements of a smaller source array are copied back into the larger, sparse array.
nb = 0 do ia=1, na ! line 23 if (a(ia) > 0.) then nb = nb + 1 ! dependency b(nb) = a(ia) ! compress endif enddo
int j = 0 for (int i=0; i <N; i++) { if (a[i] > 0) { c[i] = a[k++]; // expand } } // Cross-iteration dependencies via j and k
The conditional increment of the dense array index introduces a dependence between loop iterations. In the past, this would have prevented automatic vectorization. For example, when compiling the loop in Figure 2A targeting Intel® AVX2, the optimization report shows:
ifort -c -xcore-avx2 -qopt-report-file=stderr -qopt-report=3 compress.f90 … LOOP BEGIN at compress.f90(23,3) remark #15344: loop was not vectorized: vector dependence prevents vectorization. … remark #15346: vector dependence: assumed ANTI dependence between nb (25:7) and nb (25:7) LOOP END
The C example in Figure 2B behaves in the same way. An OpenMP* SIMD directive cannot be used either, because the dependency would lead to incorrect results.
Intel AVX-512 overcomes this dependency with the new vcompress instructions that write selected elements of one SIMD register into contiguous elements of another, or to memory. Likewise, the vexpand instructions write contiguous elements from a source register or memory to selected (sparse) elements of a destination SIMD register. These new instructions allow the compiler to vectorize the compression example (Figure 2A) when Intel AVX-512 is enabled:
ifort -c -xmic-avx512 -qopt-report-file=stderr -qopt-report=3 compress.f90 … LOOP BEGIN at compress.f90(23,3) remark #15300: LOOP WAS VECTORIZED remark #15450: unmasked unaligned unit stride loads: 1 remark #15457: masked unaligned unit stride stores: 1 … remark #15497: vector compress: 1 LOOP END
All elements of the source array are loaded, so the load is unmasked. Only selected elements are stored, so the effective store is masked. The optimization report contains remark #15497 to show that a compress idiom was recognized and vectorized. An assembly listing, obtained with the -S option, shows instructions such as:
vcompressps %zmm4, -4(%rsi,%rdx,4){%k1}
Similar results are obtained for the array expansion loop in Figure 2B.
Compressing a single-precision array of 1,000,000 random values by a factor of two, repeated 1,000 times on an Intel Xeon Phi 7250 processor, resulted in an approximately 16x speedup of Intel AVX-512 over Intel AVX2, which corresponds to the width of the SIMD registers and instructions.
Intel AVX-512 Conflict Detection Instructions
Loops containing a store with indirect addressing have a potential dependency that typically prevents vectorization. For example:
for (i=0; i<n; i++) a[index[i]] = …
If index[i] has the same value for two different values of i, the stores conflict and cannot be executed safely at the same time. The vpconflict instruction from Intel AVX-512 resolves this conflict by providing a mask for those SIMD lanes (values of i) that are conflict-free (i.e., no values of index[i] are duplicated). After the SIMD computation has been safely performed for these lanes, the loop is re-executed for any lanes that were masked out.
Histogramming
Filling a histogram is a common operation in many applications (e.g., image processing). The code fragment in Figure 4 fills a histogram of sin(x) in the array h for an array x of input values. With Intel AVX2, this does not vectorize because of a dependency, since two input values could contribute to the same histogram bin, ih:
for (i=0; i<n; i++) { y = sinf(x[i]*twopi); ih = floor((y-bot)*invbinw); ih = ih > 0 ? ih : 0; ih = ih < nbin-1 ? ih : nbin-1; h[ih] = h[ih] + 1; // line 25 }
icc -c -xcore-avx2 histo.c -qopt-report-file=stderr -qopt-report-phase=vec … LOOP BEGIN at histo2.c(20,4) remark #15344: loop was not vectorized: vector dependence prevents vectorization… remark #15346: vector dependence: assumed FLOW dependence between h[ih] (25:7) and h[ih] (25:7) LOOP END
Intel AVX-512 conflict detection instructions allow this loop to be vectorized safely:
ifort -c -xmic-avx512 histo2.f90 -qopt-report-file=stderr -qopt-report=3 -S … LOOP BEGIN at histo2.c(20,4) remark #15300: LOOP WAS VECTORIZED remark #15458: masked indexed (or gather) loads: 1 remark #15459: masked indexed (or scatter) stores: 1 remark #15499: histogram: 2 LOOP END
In the assembly code (not shown), x and index are loaded, and ih is calculated for all SIMD lanes. This is followed by a vpconflict instruction and a masked gather of the unique bins (i.e., elements of h) into a register, which are then incremented. If some values of ih were duplicated, the code loops back and reincrements the corresponding bins. Finally, there is a masked store (scatter) of the incremented bins back into h.
A simple test that filled a histogram of 200 bins from a single-precision array of 100,000,000 random values between 0 and 1, repeated 10 times, was compiled first for Intel AVX2 and then for Intel AVX-512 and run on an Intel Xeon Phi 7250 processor. A nearly 9x speedup was observed. The speedup depends heavily on the problem details. It comes mostly from vectorization of the loop computations, such as the sine function in this example, not from the gather and scatter themselves. The speedup can be large in the common case where conflicts are relatively infrequent. However, it may be smaller if there are many conflicts, as in a histogram containing a singularity or narrow spike.
Gather-to-Shuffle Optimization
Large arrays of small structures or of short vectors occur frequently but present a challenge to efficient vectorization. Vectorization over the structure content or short vectors may be impossible or inefficient due to the low trip count. Vectorization over the large array index is also inefficient because data used by consecutive iterations are not adjacent in memory, so efficient SIMD loads of contiguous data cannot be used. This is illustrated in Figure 5, which simply calculates the sum of squares of the components of an array of 3-vectors.
struct Point { float x; float y; float z; }; float sumsq( struct Point *ptvec, int n) { float t_sum = 0; for (int i=0; i<n; i++) { t_sum += ptvec[i].x * ptvec[i].x; t_sum += ptvec[i].y * ptvec[i].y; t_sum += ptvec[i].z * ptvec[i].z; } return t_sum; }
The older version 15 compiler vectorizes this loop using strided loads or gathers to load each
component of the struct:
icc -std=c99 -xmic-avx512 -qopt-report=4 -qopt-report-file=stderr -qopt-report-phase=vec,cg … LOOP BEGIN at g2s.c(6,5) remark #15415: vectorization support: gather was generated for the variable ptvec: strided by 3 … remark #15300: LOOP WAS VECTORIZED remark #15460: masked strided loads: 6
However, the compiler can do better by noting that the x, y, and z components of the struct are adjacent in memory. The compiler can perform SIMD loads of the components, and then perform permutations and shuffles to transpose the data so it is laid out for vectorization of the loop over i, as illustrated in Figure 6.
When compiled with the newer version 17 compiler, the optimization report contains a “codegeneration” component:
Report from: Code generation optimizations [cg] sumsq.c(10,22):remark #34030: adjacent sparse (strided) loads optimized for speed. Details: stride { 12 }, types { F32-V512, F32-V512, F32-V512 }, number of elements { 16 }, select mask { 0x000000007 }.
This shows that the compiler was able to convert the original strided loads into contiguous SIMD loads followed by shuffles and permutes. A simple test loop over 10,000 random points, repeated 100,000 times, was compiled for Intel AVX-512 and run on an Intel Xeon Phi 7250 processor. The observed speedup using the version 17 compiler, compared to the version 15 compiler, was more than a factor of 2. The same optimization is performed if pt is a twodimensional array pt[10000][3], or in Fortran, if pt is a two-dimensional array pt(3,10000) or a derived type array.
The compiler can sometimes perform this “gather to shuffle” optimization when targeting other instruction sets. However, the powerful new shuffle and permute instructions result in it being generated much more frequently when targeting Intel AVX-512.
Summary
Powerful new instructions in Intel AVX-512 enable improved application performance through the vectorization of some loops that could not be vectorized previously, and more efficient vectorization of others. The compiler optimization report shows when and where these optimizations are performed.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.