Accelerating the Eigen Math Library for Automated Driving Workloads

Meeting the Need for Speed with Intel® Math Kernel Library

This article recently appeared in Issue 31 of The Parallel Universe magazine.

Automated driving workloads include several matrix operations at their core. Sensor fusion and localization algorithms―such as different versions of the Kalman* filter―are critical components in the automated driving software pipeline. The Intel® Math Kernel Library (Intel® MKL) is a powerhouse of tuned subprograms for numerous math operations, including a fast DGEMM. The automated driving developer community typically uses Eigen*,1, a C++ math library, for matrix operations. In addition to Intel MKL, LIBXSMM*2, 3, a highly-tuned library for high-performance matrix-matrix multiplications, shows potential to speed up matrix operations. In this article, we investigate and improve the performance of native Eigen on matrix multiplication benchmarks and the extended Kalman filter (EKF)5 by using Intel MKL and LIBXSMM with GNU* and Intel® compilers on the Intel® Xeon® processor.4, 5

The Need for Speed in Kalman Filtering

The automated driving pipeline includes a series of computational blocks, starting with perception, which acquires information on the driving environment from sensors such as cameras, RADARs and LIDARs, sensor fusion and localization, path planning, and finally actuation of vehicle controls such as steering angle and throttle. Performance optimizations across the entire software pipeline are crucial for meeting strict end-toend latency requirements. Each component of the pipeline is typically assigned a tight latency budget that needs to be met almost 100 percent of the time. In this study, we focus on speeding up the EKF, an important component of sensor fusion and localization.

Extended Kalman Filter Algorithm

The EKF is a simple―yet extremely powerful―algorithm that makes predictions about the state of the vehicle (e.g., Cartesian position coordinates, velocities, yaw angle). The EKF has two consecutive steps over several iterations:

  • The prediction step estimates values of current variables and their uncertainties based on motion models, including changes in values over time.
  • The update step occurs when the next set of measurements is received from the sensors. This phase updates the predicted estimates based on one important factor—the weighted average of the predicted estimate and the estimate from the current measurement. Higher weights imply lower uncertainty6.

In particular, this algorithm predicts the position of the vehicle (px,py) and its velocity (vx,vy) from noisy LIDAR and RADAR sensor measurements. The coupled estimate of the vehicle’s position from fusing both RADAR and LIDAR has higher accuracy than using noisy LIDAR and RADAR by themselves. LIDAR measurements that localize an object are defined in Cartesian coordinate form—(px,py). RADAR measurements are typically in polar coordinate form and can be converted to Cartesian coordinates, forming measurements that are at a lower resolution than those from LIDAR6.

Table 1 shows the vectors and matrices the EKF uses to represent different states and estimates4, 5

Table 1. Matrices and vectors used by EKF

Predict
x’ = F *x + u     Predicted state estimate
P’ = F * P *FT+ Q     Predicted covariance estimate
Measurement Update
y = z – H’ * x     Innovation or measurement residual
S = H * P’ * HT+ R     Innovation (or residual) covariance
K = P’ * HT * S-1     Near-optimal Kalman gain
x = x’ + K * y     Updated state estimate
P = (I – K * H) * P’     Updated covariance estimate

Important Math Libraries

Intel MKL provides highly optimized, threaded, and vectorized math functions that maximize performance on Intel® processor architectures. It is compatible across many different compilers, languages, operating systems, linking, and threading models. Important for our purposes, it provides a highly-tuned DGEMM function for matrix-matrix multiplication. To eliminate overhead from additional error checking for DGEMM on small matrices, Intel MKL provides the –DMKL_DIRECT_CALL compiler flag to guarantee that the fastest code path is used at runtime7.

Eigen is an open-source, easy-to-use C++ library that provides operations ranging from matrix math to geometry algorithms. It enables vectorization across different levels of SSE and AVX. Eigen can take advantage of Intel MKL through the –DEIGEN_USE_MKL_ALL flag.

LIBXSMM is an open-source, high-performance library tuned for fast matrix-matrix multiplication on very small matrix sizes2, 3. LIBXSMM generates just-in-time (JIT) code for small matrix-matrix multiplication kernels for various instruction sets including SSE, AVX, AVX2, and AVX512. LIBXSMM is best suited for matrices where (M*N*K)1/3 is less than 80. LIBXSMM provides high performance through its modular design―specifically, a separate frontend (high-level language and routine selection) and backend for xGEMM code generation2. LIBXSMM provides a simple interface to call S/DGEMM to integrate into an application with very little effort.

Figures 1, 2, and 3 show three modes in which LIBXSMM can be used for matrix multiplications

During installation, LIBXSMM can be built explicitly for:

  • Particular M, N, and K values
  • Leading dimension values that differ from M, N, and K values
  • Specific values of α and β
void libxsmm smm(int m, int n, int k, const float* a, const float* b, float* c);
void libxsmm dmm(int m, int n, int k, const double* a, const double* b, double* c);

Figure 1. Automatically dispatched matrix multiplication API in LIBXSMM2

void libxsmm simm(int m, int n, int k, const float* a, const float* b, float* c);
void libxsmm dimm(int m, int n, int k, const double* a, const double* b, double* c);

Figure 2. Non-dispatched matrix multiplication API in LIBXSMM2

void libxsmm sblasmm(int m, int n, int k, const float* a, const float* b, float* c);
void libxsmm dblasmm(int m, int n, int k, const double* a, const double* b, double* c);

Figure 3. LIBXSMM API for matrix multiplication using BLAS9

 

Enabling Eigen with Intel® MKL and LIBXSMM

In its original form, Eigen does not use Intel MKL for small matrix multiplication (specifically, when M+N+K is less than 20). To allow Eigen to call the DGEMM function in Intel MKL, we modify the Eigen source code to eliminate the M+N+K<20 heuristic and permit calls to Intel MKL DGEMM for all matrix sizes.

To enable LIBXSMM in Eigen, we replace Eigen’s native matrix-matrix multiplication implementation with a call to libxsmm_dgemm.

Experiment Setup
We examine the performance of two workloads that use Eigen:

  1. A simple DGEMM benchmark that implements DGEMM on a set of square, double-precision matrices
  2. An implementation of EKF that works on synthetically generated RADAR and LIDAR data

We use native Eigen, Eigen with Intel MKL, and Eigen with LIBXSMM in these experiments. All benchmarks are executed in serial.

Table 2 details our library and compiler versions and hardware specifications.

Table 2. Library, compiler, and hardware specifications

Benchmarking DGEMM on Intel® Xeon® Processor

In this DGEMM benchmark, our figure of merit is the improvement in performance (gigaflops/second) over native Eigen with g++. With the exception of matrix sizes 2 and 4, both Eigen with Intel MKL and Eigen with LIBXSMM provide a speedup over native Eigen across all classes of matrices. It is interesting to note that native Eigen has the lowest performance, regardless of whether it is compiled with GNU or Intel® compilers (Figures 4 and 5). In terms of performance improvement, the overall trend is that:

  • Eigen+LIBXSMM produces the highest performance across all matrices (excluding matrix sizes 2 and 4).
  • Eigen+LIBXSMM with g++ produces the highest speedup for matrices of sizes less than size 13.
  • Eigen+LIBXSMM with ICPC produces the highest speedup across all g++ and ICPC variants for matrix sizes greater than 13.

Figure 4. Speedup over native Eigen with Intel MKL and LIBXSMM using g++ 7.2

Figure 5. Speedup over native Eigen with Intel MKL and LIBXSMM using the Intel® C++ Compiler

 

Evaluating and Speeding Up the Extended Kalman Filter

We evaluate EKF by using native Eigen, Eigen with Intel MKL, and Eigen with LIBXSMM. From our earlier DGEMM benchmarking, we see that g++ provides higher performance for matrix sizes less than 13. Since EKF works on smaller matrices, we evaluate speedup in EKF using g++. Our baseline for evaluating speedup is EKF that uses native Eigen. Our figure of merit is the median time to predict and update each sensor measurement (a total of 10,000 sensor measurements were processed). As shown in Figure 6, incorporating Intel MKL or LIBXSMM can produce a speedup of approximately 1.2X in EKF.

Figure 6. Speedup in Extended Kalman filter from using Eigen with Intel MKL and Eigen with LIBXSMM

 

Improving Performance

In this article, we concentrated on speeding up the performance of EKF, a common automated driving workload used for sensor fusion and localization. We investigate this performance improvement on the Intel Xeon processor in two ways:

  • Speeding up matrix-matrix multiplication kernel in native Eigen from using Intel MKL and LIBXSMM
  • Improving performance of the EKF workload

We show a maximum speedup of 3.1X over native Eigen from using Eigen+LIBXSMM with the Intel C++ compiler. We improved EKF performance by using Intel MKL and LIBXSMM to produce a speedup of 1.2X

References

1. Eigen: http://eigen.tuxfamily.org
2. LIBXSMM: http://sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/poster_files/
post137s2-file2.pdf

3. LIBXSMM code repository: https://github.com/hfp/libxsmm
4. Extended Kalman filter: https://en.wikipedia.org/wiki/Extended_Kalman_filter
5. Extended Kalman filter tutorial: https://www.cse.sc.edu/~terejanu/files/tutorialEKF.pdf; object tracking
and fusing sensor measurements using the extended Kalman filter algorithm: https://medium.com/@mithi/
object-tracking-and-fusing-sensor-measurements-using-the-extended-kalman-filter-algorithm-part-
1-f2158ef1e4f0

6. Intel Math Kernel Library benchmarks: https://software.intel.com/en-us/mkl/features/benchmarks
7. Improve performance of Intel MKL for small problems with MKL_DIRECT_CALL: https://software.intel.com/en-us/node/725700
8. Basic linear algebra subprograms (BLAS): http://www.netlib.org/blas/

For more complete information about compiler optimizations, see our Optimization Notice.