Intel® Advisor Roofline Analysis
A New Way to Visualize Performance Optimization Trade-offs
Software must be both threaded and vectorized to fully utilize today’s and tomorrow’s hardware. Data-driven vectorization design can yield long-term performance growth with less risk and more impact. Even with perfect vector and thread parallelism, developers often have to additionally balance CPU/vector/thread utilization versus memory subsystem data bottlenecks. This aspect of optimization could often be addressed by using a roofline “bounds and bottlenecks” performance model.
This article provides an overview of Intel® Advisor and discusses the new Intel Advisor Roofline Analysis feature. The roofline model provides an intuitive and powerful representation of how to best address performance issues in your application. Finally, a case study shows the optimization process on a real example.
Roofline Model
Roofline modeling was first proposed by University of California at Berkeley researchers Samuel Williams, Andrew Waterman, and David Patterson in the paper Roofline: An Insightful Visual Performance Model for Multicore Architectures in 2009. Recently, it was extended to address all levels of the memory subsystem, as described by Aleksandar Ilic, Frederico Pratas, and Leonel Sousa in their Cache-Aware Roofline Model: Upgrading the Loft paper.
A roofline model provides insight into how your application works by helping you answer these questions:
- Does my application work optimally on the current hardware? If not, what is the most underutilized hardware resource?
- What limits performance? Is my application workload memory or compute bound?
- What is the right strategy to improve application performance?
The model plots data to help you visualize application compute and memory bandwidth ceilings by measuring two parameters:
- Arithmetic intensity, the number of floating-point operations per byte transferred between CPU and memory
- Floating-point performance measured in billions of floating-point operations per second (GFLOPS)
The proximity of the data points to the model lines (rooflines) shows the degree of optimization (Figure 1). The kernels on the right-hand side, in the blue region, are more compute bound. As you move up the Y axis, they get close to the floating-point peak. The performance of these kernels is bounded by the compute capabilities of the platform. To improve the performance of kernel 3, consider migrating this kernel to a highly parallel platform, such as the Intel® Xeon Phi™ processor, where the compute ceiling and memory throughput are higher. For kernel 2, vectorization can be considered as a performance improvement strategy, since it is far away from the ceiling.
Toward the left-hand side of the plot, in the red region, the kernels are memory bound. As you move up the Y axis, they are limited by the DRAM and cache peak bandwidth of the platform. To increase the performance of these kernels, consider improving the algorithm or its implementation to perform more computations per data item, thereby shifting the plot position to the right, where the performance ceiling is higher. These kernels may also run faster on an Intel Xeon Phi processor because of the higher memory bandwidth.
Intel Advisor Overview
Intel Advisor is a software analysis tool offering a powerful software design and performance characterization platform for applications. Intel Advisor incorporates thread parallelism prototyping (Threading Advisor), vector parallelism optimization (Vectorization Advisor), and memory-versus-compute characterization (Advisor Roofline Automation) capabilities.
In this article, we mainly focus on the Advisor Roofline and Vectorization analyses. When using the Intel Advisor GUI, you can switch between Vectorization and Threading Advisor flows using the “Workflow” toggle. The Roofline chart can be accessed by using the Roofline sidebar (Figure 2).
Vectorization Advisor can help you increase the performance of your code using these steps:
- Survey shows which loops consume the most time with detailed SIMD statics.
- FLOPS and Trip Counts measure iteration counts, call counts, and the precise number of floating-point operations per second for each loop and function.
- Recommendations gives specific advice on how to fix performance issues.
- Dependencies Analysis provides a dynamic dependency analysis to verify if a loop has cross-iteration dependencies that can limit vectorization and parallelization.
- Memory Access Patterns Analysis checks if you are accessing memory in a vector-friendly manner.
You can discover many important performance and design insights by combining Vectorization Advisor and Roofline analyses. For example, knowing the Vectorization Efficiency metric provided by the Vectorization Advisor Survey Report is often crucial when interpreting the data on a roofline chart.
Intel Advisor Roofline Analysis
Intel Advisor implements the “cache-aware” flavor of the roofline model, which provides additional insight by addressing all levels of memory/cache hierarchy:
- Sloped rooflines illustrate peak performance levels if all the data fits into the respective cache.
- Horizontal lines show the peak achievable performance levels if vectorization and other CPU resources are used effectively.
Intel Advisor places a dot for every loop in the roofline plot (Figure 3). The circle sizes and colors denote the relative execution time of the loops. Most of the loops require further optimizations to better utilize cache memory. Some, such as the green dot sitting on the dotted ScalarAddPeak line, just to the right of the vertical line in the middle, may be a loop that is poorly vectorized. As you can see, the roofline chart makes it easy to locate opportunities to improve application performance.
How to Interpret the Intel Advisor Roofline Chart
A roofline plot provides useful information but is not a reference table, in which one simply locates their input and reads the corresponding output. It is a guide that suggests what factors to investigate. It requires interpretation.
The lines on a roofline chart, such as the ones in Figure 3, are representative of hardware limits on kernel performance based on benchmarks run by Intel Advisor to establish baselines and performance limits on the host system. The uppermost lines form the roofs that are representative of the maximum performance of the machine. In Figure 3, the uppermost lines are “L1 Bandwidth” and “Single Vector FMA Peak.” Not every kernel can achieve this performance and may be ultimately limited by lower roofs, depending on the nature of the algorithm (e.g., a kernel that cannot be vectorized would be limited by the maximum performance of scalar computations).
A kernel’s horizontal position on the plot is its arithmetic intensity (the number of floating-point operations per byte transferred between CPU and memory, as measured by operand size), which is primarily determined by its algorithm, though this can be altered somewhat by optimizations that occur during compilation. Figure 4 gives examples of different algorithms and their relative arithmetic intensities. Redesigning a kernel’s algorithm to increase its arithmetic intensity pushes it to the right in the roofline chart, which may be helpful in raising its maximum potential performance, due to the slopes of the memory bandwidth roofs.
A kernel’s vertical position relative to the various roofs reveals bottlenecks. If a kernel is placed above a roof, then that roof is not the primary performance bottleneck, although it can still affect performance. The roofs above a kernel are potential bottlenecks, each corresponding to an issue that can be overcome using a particular type of optimization. If a kernel is below the scalar computation peak line, then it is worth investigating the kernel’s vectorization status. If it is unvectorized, or inefficiently vectorized, it is likely that this roof represents the bottleneck and suggests that it would be prudent to improve or implement the kernel’s vectorization if possible. If, on the other hand, this kernel is efficiently vectorized, the scalar computation peak can be ruled out as a bottleneck, and you can move on to investigating the other roofs above the kernel.
Solving Performance Problems Using Intel Advisor
Some Intel Advisor tips are provided in this section.
Tip No. 1: Use the summary view to see the top time-consuming loops (Figures 5 and 6) and tuning recommendations (Figure 7).
Tip No. 2: Use roofline customization to remove the roofs you don’t need (Figure 8).
For example, if you are operating on only single-precision data, you can safely remove the double-precision peaks from your roofline.
Tip No. 3: Use Smart Mode to find the best optimization
candidates (Figure 9).
Loops are ordered on the roofline by their Elapsed Self
Time, but by activating Smart Mode, you can identify loops
that have high total time. The more total time that is spent
in a loop, the larger the overall effect of optimizing it can be.
Tip No. 4: Use some of the other Intel Advisor features to supplement the information in the roofline chart.
Vectorization efficiency is your vectorization thermometer (Figures 10 and 11). Under Instruction Set Analysis, look at the Traits column to see factors that could be affecting vectorization (Figure 10). Consider running the Intel Advisor memory access pattern collection if you suspect you’re referencing memory in a non-vector-friendly fashion. [Editor’s note: Vladimir Tsymbal’s article “Identify Scalability Problems in Parallel Applications” describes some techniques to analyze memory access.]
Tip No. 5: Isolate vectorized from nonvectorized loops using the loop toggle ((Figure 12).
Tip No. 6: Use the source window together with the roofline chart (Figure 13).
Intel Advisor seamlessly integrates your source code into the performance profile.
Case Study: Using Roofline Analysis to Tune an MRI Image Reconstruction Benchmark
The 514.pomriq SPEC ACCEL Benchmark is an MRI image reconstruction kernel described in Stone et al. (2008). MRI image reconstruction is a conversion from sampled radio responses to magnetic field gradients. The sample coordinates are in the space of magnetic field gradients, or K-space. The Q matrix in the MRI image reconstruction is a precomputable value based on the sampling trajectory, the plan of how points in K-space will be sampled. The algorithm examines a large set of input, representing the intended MRI scanning trajectory and the points that will be sampled. Each element of the Q matrix is computed by a summation of contributions from all trajectory sample points. Each contribution involves a three-element vector dot product of the input and output 3-D location plus a few trigonometric operations. The output Q elements are complex numbers but the inputs are multielement vectors. The kernel is fundamentally compute bound because trigonometric functions are expensive, and the regularity of the problem allows for easy management of memory bandwidth. Therefore, once tiling and data layout remove any artificial memory bandwidth bottleneck, the most important optimizations are low-level sequential code optimizations and improving the instruction stream efficiency, such as loop unrolling.
The input to 514.pomriq consists of one file containing the number of K-space values; the number of X-space values; and the list of K-space coordinates, X-space coordinates, and Phifield complex values for the K-space samples. Each set of coordinates and the complex values are stored as arrays with each field written contiguously. The 514.pomriq output consists of the resulting Q matrix of complex values in “real, imaginary” format for each line. This case study will analyze the 514.pomriq compute kernel and focus on its optimization. The Intel Advisor summary for 514.pomriq run on an Intel Xeon Phi 7250 processor is shown in Figure 14, where it is easy to see that loops involved in computing the Q matrix are the hotspot.
We use Intel Advisor Smart Mode to narrow down the best optimization candidates (Figure 15a). The loop where we are spending the most time is vectorized, but it is still below the MCDRAM roof (Figure 15b). This possibly indicates issues with memory use. Let’s examine the loop using Code Analytics and Recommendations.
We can see heavy gather instructions here and advice to explore the memory access pattern of the loop (Figures 16 and 17). After running a memory access pattern analysis, we observe a 4 percent gather stride, which gives us a tip as to where there might be the potential bottleneck: nonoptimal memory access (Figure 18). After looking at the details, we can verify that there is no need to use gather instructions since the stride is constant (Figure 19a). We can also see the same in the Intel Advisor recommendations (Figure 19b).
#pragma omp simd private(expArg, cosArg, sinArg) reduction(+:QrSum, QiSum) for (indexK = 0; indexK < numK; indexK++) { expArg = PIx2 * (kVals[indexK].Kx * x[indexX] + kVals[indexK].Ky * y[indexX] + kVals[indexK].Kz * z[indexX]); cosArg = cosf(expArg); sinArg = sinf(expArg); float phi = kVals[indexK].PhiMag; QrSum += phi * cosArg; QiSum += phi * sinArg; }
Everything becomes clear after checking the source code of the kernel (Figure 20). The code uses an array of structures, which become “gathers” after vectorization. However, newer versions of the Intel® compiler can recognize the access pattern and apply optimizations to get rid of gathers in order to use more lightweight instructions. “Gathers” replacement is performed by the “Gather to Shuffle/Permutes” compiler transformation and can often be profitable on modern CPUs, especially on platforms with Intel AVX-512 support. [Editorʼs note: Martyn Corden’s article “Vectorization Opportunities for Improved Performance with Intel® AVX-512” describes how the Intel® Compiler 2017 takes advantage of Intel AVX-512 to create new opportunities for loop vectorization.]
Let’s take a look at the roofline after recompilation using the new Intel compiler (e.g., Intel Compiler 2017 Update 1) with “Gather to Shuffle/Permutes” support. We can see that the dot is above MCDRAM now, and there are no gather instructions (replaced with Intel AVX-512 “2-source permutes”), as well as an increased number of floating-point operations per second (Figures 21 and 22).
However, there is a more effective way to resolve such issues: AOS (array of structures) to SOA (structure of arrays) conversion. This optimization allows us to use more convenient data containers to improve efficiency during vector processing. In the past, it involved manually reworking the underlying data structures. Now, using the Intel® SIMD Data Layout Templates library (Figure 23), we can simply improve the performance by adding a few lines of code where the kValues structure is declared, where the structure is initialized, and where the K-values are computed.
#include <sdlt/sdlt.h> struct kValues { float Kx; float Ky; float Kz; float PhiMag; }; SDLT_PRIMITIVE(kValues, Kx, Ky, Kz, PhiMag) sdlt::soa1d_container<kValues> inputKValues(numK); auto kValues = inputKValues.access(); for (k = 0; k < numK; k++) { kValues [k].Kx() = kx[k]; kValues [k].Ky() = ky[k]; kValues [k].Kz() = kz[k]; kValues [k].PhiMag() = phiMag[k]; } auto kVals = inputKValues.const_access(); #pragma omp simd private(expArg, cosArg, sinArg) reduction(+:QrSum, QiSum) for (indexK = 0; indexK < numK; indexK++) { expArg = PIx2 * (kVals[indexK].Kx() * x[indexX] + kVals[indexK].Ky() * y[indexX] + kVals[indexK].Kz() * z[indexX]); cosArg = cosf(expArg); sinArg = sinf(expArg); float phi = kVals[indexK].PhiMag(); QrSum += phi * cosArg; QiSum += phi * sinArg; }
Let’s check the new roofline chart (Figure 24). After applying this optimization, the dot is no longer red. This means it takes less time now, and it has more GFLOPS, putting it close to the L2 roof. Additionally, the loop now has unit stride access and, as a result, no special memory manipulations. The total performance improvement is almost 3x for the kernel and 50 percent for the entire application. Additionally, the loop now has unit stride access and, as a result, no special memory manipulations (Figure 25a and 25b).
Conclusion
The roofline model provides a new, visually intuitive, and powerful representation of your application’s performance. By using the proper optimization techniques, as indicated by the region of the roofline chart your application is in, you can avoid wasting valuable time on optimizations that will have minimal impact on your performance. The roofline model can answer the following questions:
- Can I get better performance?
- What are the key performance bottlenecks: memory or CPU?
- How much speedup can I get if I optimize a particular bottleneck?
- How much speedup can I get if I use another platform?
As systems get bigger and more complex, getting these answers is nontrivial, but roofline analysis can save you time and effort.
Modernize Your Code
- To get the most out of your hardware, you need to modernize your code with vectorization and threading.
- Taking a methodical approach such as the one outlined in this article, and taking advantage of the powerful tools in Intel® Parallel Studio XE, can make the modernization task dramatically easier.
- Use Intel® Advisor roofline analysis, now available in Intel® Parallel Studio XE 2017 Update 1.
- Send an email to vector_advisor@intel.com to get the latest information on some exciting new capabilities that are currently under development.
Useful Intel Advisor Links
Get started with the Intel® Advisor roofline feature
Selftime-based FLOPS computing with an important explanation of how to interpret roofline results for nested loops
Analyzing Intel MPI applications with Intel Advisor
References
Stone, S. S.; J. P. Haldar, S. C. Tsao, W. W. Hwu., Z. Liang, and B. P. Sutton. “Accelerating advanced MRI reconstructions on GPUs.” In International Conference on Computing Frontiers, pages 261–272, 2008.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.