Floating-Point Reproducibility in Intel® Software Tools
Getting Beyond the Uncertainty
Binary floating-point (FP) representations of most real numbers are inexact―and there’s an inherent uncertainty in the result of most calculations involving FP numbers. Consequently, computations repeated under different conditions may give different results, although the results remain consistent within the expected uncertainty. This usually isn’t concerning, but some contexts demand reproducibility beyond this uncertainty (e.g., for quality assurance, legal issues, or functional safety requirements). However, improved or exact reproducibility typically comes at a cost in performance.
What’s Reproducibility?
Reproducibility means different things to different people. At its most basic, it means rerunning the same executable on the same data using the same processor should always yield the exact same result. This is sometimes called repeatability or run-to-run reproducibility. Users are sometimes surprised―or even shocked―to learn that this isn’t automatic, and that results aren’t necessarily deterministic.
Reproducibility can also mean getting identical results when targeting and/or running on different processor types, building at different optimization levels, or running with different types and degrees of parallelism. This is sometimes called conditional numerical reproducibility. The conditions required for exactly reproducible results depend on the context―and may result in some loss of performance.
Many software tools don’t provide exactly reproducible results by default.
Sources of Variability
The primary source of variations in FP results is optimization. Optimizations can include:
- Targeting specific processors and instruction sets at either build- or run-time
- Various forms of parallelism
On modern processors, the performance benefits are so great that users can rarely afford not to optimize a large application. Differences in accuracy can result from:
- Different approximations to math functions or operations such as division
- The accuracy with which intermediate results are calculated and stored
- Denormalized (very small) results being treated as zero
- The use of special instructions such as fused multiply-add (FMA) instructions
Special instructions are typically more accurate than the separate multiply and add instructions they replace, but the consequence is still that the final result may change.
FMA generation is an optimization that may occur at O1 and above for instruction set targets of Intel® Advanced Vector Extensions 2 (Intel® AVX2) and higher. It’s not covered by language standards, so the compiler may optimize differently in different contexts (e.g., for different processor targets, even when both targets support FMA instructions).
Probably the most important source of variability, especially for parallel applications, is variations in the order of operations. Although different orderings may be mathematically equivalent, in finite precision arithmetic, the rounding errors change and accumulate differently. A different result doesn’t necessarily mean a less accurate one, though users sometimes consider the unoptimized result as correct.
Examples are transformations such as Figure 1, which the compiler may make to improve performance.
(x[i] + y) + z , x[i] + (y + z); a*b + a*c → a*(b+c)
The optimizations we’ve considered so far impact sequential and parallel applications similarly. For compiled code, they can be controlled or suppressed by compiler options.
Reductions
Reductions are a particularly important example showing how results depend on the order of FP operations. We take summation as an example, but the discussion also applies to other reductions such as product, maximum, and minimum. Parallel implementations of summations break these down into partial sums, one per thread (e.g., for OpenMP*), per process (e.g., for MPI), or per SIMD lane (for vectorization). All of these partial sums can then be safely incremented in parallel. Figure 2 shows an example.
Note that the order in which the elements of A are added―and hence the rounding of intermediate results to machine precision―is very different in the two cases. If there are big cancellations between positive and negative terms, the impact on the final result can be surprisingly large. While users tend to consider the first, serial version to be the “correct” result, the parallel version with multiple partial sums tends to reduce the accumulation of rounding errors and give a result closer to what we’d see with infinite precision, especially for large numbers of elements. The parallel version also runs much faster.
Can Reductions be Reproducible?
For reductions to be reproducible, the composition of the partial sums must not change. For vectorization, that means the vector length must not change. For OpenMP, it means that the number of threads must be constant. For Intel® MPI Library, it means the number of ranks must not change. Also, the partial sums must be added together in the same, fixed order. This happens automatically for vectorization.
For OpenMP threading, the standard allows partial sums to be combined in any order. In Intel’s implementation, the default is first come, first served for low numbers of threads (less than four for Intel® Xeon processors, less than eight on Intel® Xeon Phi™ processors). To ensure that the partial sums are added in a fixed order, you should set the environment variable KMP_DETERMINISTIC_REDUCTION=true and use static scheduling (the default scheduling protocol).
Intel® Threading Building Blocks (TBB) uses dynamic scheduling, so the parallel_reduce() method does not produce run-to-run reproducible results. However, an alternative method, parallel_deterministic_reduce(), is supported. This creates fixed tasks to compute partial sums and then a fixed, ordered tree for combining them. The dynamic scheduler can then schedule the tasks as it sees fit, provided it respects the dependencies between them. This not only yields reproducible results from run to run in the same environment, it ensures the results remain reproducible, even when the number of worker threads is varied. (The OpenMP standard doesn’t provide for an analogous reduction operation based on a fixed tree, but one can be written by making use of OpenMP tasks and dependencies.)
For Intel MPI Library, we can optimize the order in which partial results are combined according to how MPI ranks are distributed among processor nodes. The only other way to get reproducible results is to choose from a restricted set of reduction algorithms that are topology unaware (see below).
In all the examples we’ve looked at, the result of a parallel or vectorized reduction will normally be different from the sequential result. If that’s not acceptable, the reduction must be performed by a single thread, process, or SIMD lane. For the latter, that means compiling with /fp:precise (Windows*) or -fp-model precise (Linux* or macOS*) to ensure that reduction loops are not automatically vectorized.
The Intel® Compiler
The high-level option /fp:consistent (Windows) or –fp-model-consistent (Linux and macOS) is recommended for best reproducibility between runs, between different optimization levels, and between different processor types of the same architecture. It’s equivalent to the set of options /Qfma-(-no-fma) to disable FMA generation, /Qimf-arch-consistency:true(-fimf-arch-consistency=true), to limit math functions to implementations that give the same result on all processor types, and /fp:precise(-fp-model-precise) to disable other compiler optimizations that might cause variations in results.
This reproducibility comes at some cost in performance. How much is application-dependent, but performance loss of about 10% is common. The impact is typically greatest for compute-intensive applications with many vectorizable loops containing floating-point reductions or calls to transcendental math functions. It can sometimes be mitigated by adding the option /Qimf-use-svml(-fimf-use-svml), which causes the short vector math library to be used for scalar calls to math functions as well as for vector calls, ensuring consistency and re-enabling automatic vectorization of loops containing math functions.
The default option /fp:fast(-fp-model fast) allows the compiler to optimize without regard for reproducibility. If the only requirement is repeatability—that repeated runs of the same executable on the same processor with the same data yield the same result—it may be sufficient to recompile with /Qopt-dynamic-align-(-qno-opt-dynamic-align). This disables only the generation of peel loops that test for data alignment at run-time and has far less impact on performance than the /fp(-fp-model) options discussed above.
Reproducibility between Different Compilers and Operating Systems
Reproducibility is constrained between different compilers and operating systems by the lack of generally accepted requirements for the results of most math functions. Adhering to an eventual standard for math functions (e.g., one that required exact rounding) would improve consistency, but at a significant cost in performance.
There’s currently no systematic testing of the reproducibility of results for code targeting different operating systems such as Windows and Linux. The options /Qimf-use-svml and –fimf-use-svml address certain known sources of differences related to vectorization of loops containing math functions and are recommended for improving consistency between floating-point results on both Windows and Linux.
There’s no way to ensure consistency between application builds that use different major versions of the Intel® Compiler. Improved implementations of math library functions may lead to results that are more accurate but different from previous implementations, though the /Qimf-precision:high (-fimf-precision=high) option may reduce any such differences. Likewise, there’s no way to ensure reproducibility between builds using the Intel Compiler and builds using compilers from other vendors. Using options such as /fp:consistent(-fp-model consistent) and the equivalent for other compilers can help to reduce differences resulting from compiled code. So may using the same math runtime library with both compilers, where possible.
Intel® Math Kernel Library
Intel® Math Kernel Library (Intel®MKL) contains highly optimized functions for linear algebra, fast Fourier transforms, sparse solvers, statistical analyses, and other domains that may be vectorized and threaded internally using OpenMP or TBB. By default, repeated runs on the same processor might not give identical results due to variations in the order of operations within an optimized function. Intel MKL functions detect the processor on which they are running and execute a code path that’s optimized for that processor―so repeated runs on different processors may yield different results. To overcome this, Intel MKL has implemented conditional numerical reproducibility. The conditions are:
- Use the version of Intel MKL layered on OpenMP, not on TBB
- Keep the number of threads constant
- Use static scheduling (OMP_SCHEDULE=static, the default)
- Disable dynamic adjustment of the number of active threads (OMP_DYNAMIC=false and MKL_DYNAMIC=false, the default)
- Use the same operating system and architecture (e.g., Intel 64 Linux)
- Use the same microarchitecture or specify a minimum microarchitecture
The minimum microarchitecture may be specified by a function or subroutine call (e.g., mkl_cbwr_set (MKL_CBWR_AVX) or by setting a run-time environment variable (e.g., MKL_CBWR_BRANCH=MKL_CBWR_AVX). This leads to consistent results on any Intel® processor that supports Intel® AVX or later instruction sets such as Intel® AVX2 or Intel® AVX-512, though at a potential cost in performance on processors that support the more advanced instruction sets. The argument MKL_CBWR_COMPATIBLE would lead to consistent results on any Intel or compatible non-Intel processor of the same architecture. The argument MKL_CBWR_AUTO causes the code path corresponding to the processor detected at runtime to be taken. It ensures that repeated runs on that processor yield the same result, though results on other processor types may differ. If the runtime processor doesn’t support the specified minimum microarchitecture, the executable still runs but takes the code path corresponding to the actual run-time microarchitecture, as if MKL_CBWR_AUTO had been specified. Results may differ from those obtained on other processors, without warning.
The impact on performance from limiting the instruction set can sometimes be substantial for computeintensive Intel MKL functions. Table 1 shows the relative slowdown of a DGEMM matrix-matrix multiply on an Intel® Xeon® Scalable processor for different choices of the minimum microarchitecture.
Table 1. Effect of instruction set architecture (ISA) on DGEMM running on an Intel Xeon Scalable processor
Performance results are based on testing by Intel as of Sept. 6, 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configuration: Instruction set architecture (ISA) on DGEMM running on an Intel Xeon Scalable processor For more complete information visit www.intel.com/benchmarks.
Intel® MPI Library
Results using Intel MPI Library are reproducible provided that:
- Compiled code and library calls respect the reproducibility conditions for the compiler and libraries
- Nothing in the MPI and cluster environment changes, including the number of ranks and the processor type
As usual, collective operations like summations and other reductions are the most sensitive to small changes in the environment. Many implementations of collective operations are optimized according to how the MPI ranks are distributed over cluster nodes, which can lead to changed orders of operations and variations in results. Intel MPI Library supports conditional numerical reproducibility in the sense that an application will get reproducible results for the same binary, even when the distribution of ranks over nodes varies. This requires selecting an algorithm that’s topology unaware―that is, one that doesn’t optimize according to the distribution of ranks over nodes, using the I_MPI_ADJUST_ family of environment variables:
- I_MPI_ADJUST_ALLREDUCE
- I_MPI_ADJUST_REDUCE
- I_MPI_ADJUST_REDUCE_SCATTER
- I_MPI_ADJUST_SCAN
- I_MPI_ADJUST_EXSCAN
- And others
For example, Intel MPI Library Developer Reference documents 11 different implementations of MPI_REDUCE(), of which the first seven are listed in Table 2.
Table 2. Comparison of results from MPI_REDUCE() for different rank distributions
Two nodes of Intel® Core™ i5-4670T processors at 2.30 GHz, 4 cores and 8 GB memory each, one running Red Hat* EL 6.5, the other running Ubuntu* 16.04. The sample code from Intel® MPI Library Conditional Reproducibility in The Parallel Universe, Issue 21 was used (see references below).
Table 2 compares results from a sample program for a selection of implementations of MPI_REDUCE() and for four different distributions of eight MPI ranks over two cluster nodes. The five colors correspond to five different results that were observed. The differences in results are very small―close to the limit of precision―but small differences can sometimes get amplified by cancellations in a larger computation. The topology-independent implementations gave the same result, no matter what the distribution of ranks over nodes, whereas the topology-aware implementations did not. The default implementation of MPI_REDUCE (not shown) is a blend of algorithms that depend on workload as well as topology. It also gives results that vary with the distribution of ranks over nodes.
Bottom Line
Intel® Software Development Tools provide methods for obtaining reproducible FP results under clearly defined conditions.
References
- “Consistency of Floating-Point Results using the Intel® Compiler”
- Developer Guide for Intel® Math Kernel Library 2019 for Linux*, section “Obtaining Numerically Reproducible Results”
- “Intel® MPI Library Conditional Reproducibility,” The Parallel Universe, issue 21
- “Tuning the Intel MPI Library: Basic Techniques,” section “Tuning for Numerical Stability”
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.