Tuning Autonomous Driving Using Intel® System Studio
Intel® GO™ SDK Offers Automotive Solution Developers an Integrated Solutions Environment
The Internet of Things is a collection of smart devices connected to the cloud. “Things” can be as small and simple as
a connected watch or a smartphone, or they can be as large and complex as a car. In fact, cars are rapidly becoming some of the world’s most intelligent connected devices, using sensor technology and powerful processors to sense and continuously respond to their surroundings. Powering these cars requires a complex set of technologies:
- Sensors that pick up LIDAR, sonar, radar, and optical signals
- A sensor fusion hub that gathers millions of data points
- A microprocessor that processes the data
- Machine learning algorithms that require an enormous amount of computing power to make the data intelligent
and useful
Successfully realizing the enormous opportunities of these automotive innovations has the potential to not only change driving but also to transform society.
Intel® GO™ Automotive Software Development Kit (SDK)
From car to cloud―and the connectivity in between―there is a need for automated driving solutions that include high-performance platforms, software development tools, and robust technologies for the data center. With Intel GO automotive driving solutions, Intel brings its deep expertise in computing, connectivity, and the cloud to the automotive industry.
Autonomous driving on a global scale takes more than high-performance sensing and computing in the vehicle. It requires an extensive infrastructure of data services and connectivity. This data will be shared with all autonomous vehicles to continuously improve their ability to accurately sense and safely respond to surroundings. To communicate with the data center, infrastructure on the road, and other cars, autonomous vehicles will need high-bandwidth, reliable two-way communication along with extensive data center services to receive, label, process, store, and transmit huge quantities of data every second. The software stack within autonomous driving systems must be able to efficiently handle demanding real-time processing requirements while minimizing power consumption.
The Intel GO automotive SDK helps developers and system designers maximize hardware capabilities with a variety of tools:
- Computer vision, deep learning, and OpenCL™ toolkits to rapidly develop the necessary middleware and algorithms for perception, fusion, and decision-making
- Sensor data labeling tool for the creation of “ground truth” for deep learning training and environment modeling
- Autonomous driving-targeted performance libraries, leading compilers, performance and power analyzers, and debuggers to enable full stack optimization and rapid development in a functional safety compliance workflow
- Sample reference applications, such as lane change detection and object avoidance, to shorten the learning curve for developers
Intel® System Studio
Intel also provides software development tools that help accelerate time to market for automated driving solutions. Intel System Studio provides developers with a variety of tools including compilers, performance libraries, power and performance analyzers, and debuggers that maximize hardware capabilities while speeding the pace of development. It is a comprehensive and integrated tool suite that provides developers with advanced system tools and technologies to help accelerate the delivery of the next-generation, power-efficient, high-performance, and reliable embedded and mobile devices. This includes tools to:
- Build and optimize your code
- Debug and trace your code to isolate and resolve defects
- Analyze your code for power, performance, and correctness
Build and Optimize Your Code
- Intel® C++ Compiler: A high-performance, optimized C and C++ cross-compiler that can offload compute-intensive code to Intel® HD Graphics.
- Intel® Math Kernel Library (Intel® MKL): A set of highly optimized linear algebra, fast Fourier transform (FFT), vector math, and statistics functions.
- Intel® Threading Building Blocks (Intel® TBB): C++ parallel computing templates to boost embedded system performance.
- Intel® Integrated Performance Primitives (Intel® IPP): A software library that provides a broad range of highly optimized functionality including general signal and image processing, computer vision, data compression, cryptography, and string manipulation.
Debug and Trace Your Code to Isolate and Resolve Defects
- Intel® System Debugger: Includes a System Debug feature that provides source-level debugging of OS kernel software, drivers, and firmware plus a System Trace feature that provides an Eclipse* plug-in, which adds the capability to access the Intel® Trace Hub providing advanced SoC-wide instruction and data events tracing through its trace viewer.
- GNU* Project Debugger: This Intel-enhanced GDB is for debugging applications natively and remotely on Intel® architecture-based systems.
Analyze Your Code for Power, Performance, and Correctness
- Intel® VTune™ Amplifier: This software performance analysis tool is for users developing serial and multithreaded applications.
- Intel® Energy Profiler: A platform-wide energy consumption analyzer of power-related data collected on a target platform using the SoC Watch tool.
- Intel® Performance Snapshot: Provides a quick, simple view into performance optimization opportunities.
- Intel® Inspector: A dynamic memory and threading error-checking tool for users developing serial and multithreaded applications on embedded platforms.
- Intel® Graphics Performance Analyzers: Real-time, system-level performance analyzers to optimize CPU/GPU workloads.
Optimizing Performance
Advanced Hotspot Analysis
Matrix multiplication is a commonly used operation in autonomous driving. Intel System Studio tools, mainly the performance analyzers and libraries, can help maximize performance. Consider this example of a very naïve implementation of matrix multiplication using two nested for-loops:
void multiply0(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { int i,j,k; // Basic serial implementation for(i=0; i<msize; i++) { for(j=0; j<msize; j++) { for(k=0; k<msize; k++) { c[i][j] = c[i][j] + a[i][k] * b[k][j]; } } } }
Advanced hotspot analysis is a fast and easy way to identify performance-critical code sections (hotspots). The periodic instruction pointer sampling performed by Intel VTune Amplifier identifies code locations where an application spends more time. A function may consume much time either because its code is slow or because the function is frequently called. But any improvements in the speed of such functions should have a big impact on overall application performance.
Running an advanced hotspot analysis on the previous matrix multiplication code using Intel VTune Amplifier shows a total elapsed time of 22.9 seconds (Figure 1). Of that time, the CPU was actively executing for 22.6 seconds. The CPI rate (i.e., cycles per instruction) of 1.142 is flagged as a problem. Modern superscalar processors can issue four instructions per cycle, suggesting an ideal CPI of 0.25, but various effects in the pipeline―like long latency memory instructions, branch mispredictions, or instruction starvation in the front end―tend to increase the observed CPI. A CPI of one or less is considered good but different application domains will have different expected values. In our case, we can further analyze the application to see if the CPI can be lowered. Intel VTune Amplifier’s advanced hotspot analysis also indicates the top five hotspot functions to consider for optimization.
CPU Utilization
As shown in Figure 2, analysis of the original code indicates that only one of the 88 logical CPUs is being used. This means there is significant room for performance improvement if we can parallelize this sample code.
Parallelizing the sample code as shown below gives an immediate 12x speedup (Figure 3). Also, the CPI has gone below 1, which is also a significant improvement.
void multiply1(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM],
TYPE c[][NUM], TYPE t[][NUM])
{
int i,j,k;
// Basic parallel implementation
#pragma omp parallel for
for(i=0; i<msize; i++) {
for(j=0; j<msize; j++) {
for(k=0; k<msize; k++) {
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
}
General Exploration Analysis
Once you have used Basic Hotspots or Advanced Hotspots analysis to determine hotspots in your code, you can perform General Exploration analysis to understand how efficiently your code is passing through the core pipeline. During General Exploration analysis, Intel VTune Amplifier collects a complete list of events for analyzing a typical client application. It calculates a set of predefined ratios used for the metrics and facilitates identifying hardware-level performance problems. Superscalar processors can be conceptually divided into the front end (where instructions are fetched and decoded into the operations that constitute them) and the back end (where the required computation is performed). General Exploration analysis performs this estimate and breaks up all pipeline slots into four categories:
- Pipeline slots containing useful work that issued and retired (retired)
- Pipeline slots containing useful work that issued and canceled (bad speculation)
- Pipeline slots that could not be filled with useful work due to problems in the front end (front-end bound)
- Pipeline slots that could not be filled with useful work due to a backup in the back end (back-end bound)
Figure 4 shows the results of running a general exploration analysis on the parallelized example code using Intel VTune Amplifier. Notice that 77.2 percent of pipeline slots are blocked by back-end issues. Drilling down into the source code shows where these backend issues occur (Figure 5, 49.4 + 27.8 = 77.2 percent back-end bound). Memory issues and L3 latency are very high. The memory bound metric shows how memory subsystem issues affect performance. The L3 bound metric shows how often the CPU stalled on the L3 cache. Avoiding cache misses (L2 misses/L3 hits) improves latency and increases performance.
Memory Access Analysis
The Intel VTune Amplifier’s Memory Access analysis identifies memory-related issues, like NUMA (non-uniform memory access) problems and bandwidth-limited accesses, and attributes performance events to memory objects (data structures). This information is provided from instrumentation of memory allocations/deallocations and getting static/global variables from symbol information.
By selecting the grouping option of the Function/Memory Object/Allocation stack (Figure 6), you can identify the memory objects that are affecting performance. Out of the three objects listed in the multiply1 function, one has a very high latency of 82 cycles. Double-clicking on this object takes you to the source code, which indicates that array “b” has the highest latency. This is because array “b” is using a column-major order. Interchanging the nested loops changes the access to row-major order and reduces the latency, resulting in better performance (Figure 7).
We can see that although the sample is still back-end-bound, it is no longer memory-bound. It is only core-bound. A shortage in hardware compute resources, or dependencies on the software’s instructions, both fall under core-bound. Hence, we can tell that the machine may have run out of out-of-order resources. Certain execution units are overloaded, or there may be dependencies in the program’s data or instruction flow that are limiting performance. In this case, vector capacity usage is low, which indicates floating-point scalar or vector instructions are using only partial vector capacity. This can be solved by vectorizing the code.
Another optimization option is to use Intel Math Kernel Library, which offers highly optimized and threaded implementations of many mathematical operations, including matrix multiplication. The dgemm routine multiplies two double-precision matrices:
void multiply5(int msize, int tidx, int numt, TYPE a[][NUM], TYPE b[][NUM], TYPE c[][NUM], TYPE t[][NUM]) { double alpha = 1.0, beta = 0.0; cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, NUM, NUM, NUM, alpha, (const double *)b, NUM, (const double *)a, NUM, beta, (double *)c, NUM); }
Performance Analysis and Tuning for Image Resizing
Image resizing is commonly used in operation in the autonomous driving space. For example, we ran Intel VTune Amplifier’s advanced hotspots analysis on an open-source OpenCV* version of image resize (Figure 8). We can see that the elapsed time is 0.33 seconds and the top hotspot is the cv:HResizeLinear function, which consumes 0.19 seconds of the total CPU time.
Intel IPP offers developers highly optimized, production-ready building blocks for image processing, signal processing, and data processing (data compression/decompression and cryptography) applications. These building blocks are optimized using the Intel® Streaming SIMD Extensions (Intel® SSE) and Intel® Advanced Vector Extensions (Intel® AVX, Intel® AVX2) instruction sets. Figure 9 shows the analysis results for the image resize that takes advantage of the Intel IPP. We can see that the elapsed time has gone down by a factor of two, and since currently only one core is used, there is an opportunity for further performance improvement using parallelism via the Intel Threading Building Blocks.
Conclusion
The Intel System Studio tools in the Intel GO SDK give automotive solution developers an integrated development environment with the ability to build, debug and trace, and tune the performance and power usage of their code. This helps both system and embedded developers meet some of their most daunting challenges:
- Accelerate software development to bring competitive automated driving cars to market faster
- Quickly target and help resolve defects in complex automated driving (AD), advanced driver assistance systems (ADAS), or software-defined cockpit (SDC) systems
- Help speed performance and reduce power consumption
This is all provided in one easy-to-use software package.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.