Home > Topics > Boosting the Performance of Graph Analytics Workloads

Boosting the Performance of Graph Analytics Workloads

Analyzing the Graph Benchmarks on Intel® Xeon® Processors

Stijn Eyerman, Wim Heirman, and Kristof Du Bois, Research Scientists, and Joshua B. Fryman and Ibrahim Hur, Principal Engineers, Intel Corporation

A graph is an intuitive way of representing a big data set and the relationships between its elements. Each vertex represents an element, and edges connect related elements. Representing data sets as a graph means you can build on the rich history of graph theory. And there are a variety of algorithms to extract useful information from a graph. In this article, we’ll explore the implementation characteristics of basic graph analysis algorithms and how they perform on Intel® Xeon™ processors.

Graphs and Graph Analytics

A graph is a structured representation of a data set with relationships between elements. Each element is represented as a vertex, and relationships between elements are shown as an edge between two vertices. Both vertices and edges can have attributes representing either the characteristics of the element or the relationship. The vertices to which a vertex is connected through edges are called its neighbors, and the number of neighbors is called the degree of a vertex.

Graph analysis applications extract characteristics from a graph (or multiple graphs) that provide useful information, e.g.:

Vertices with specific features
The shortest path between vertices
Vertex clusters
Higher-order relationships among vertices

With the growing availability of big data sets and the need to extract useful information from them, graph analytics applications are becoming an important workload, both in the data center and at the edge.

For this study, we’ll use the GAP Benchmark Suite¹, a set of common graph algorithms implemented in C++ and optimized by researchers at the University of California at Berkeley for performance on sharedmemory multicore processors (Table 1). We evaluate GAP performance on an Intel Xeon processor-based server and investigate opportunities to further improve performance.

Table 1. Gap Benchmark Suite overview

Algorithm	Abbreviation	What it Does
PageRank*	pr	Calculates the popularity of a vertex by aggregating the popularity of its neighbors
Triangle counting	tc	Counts the number of triangles (three vertices that are fully connected)
Connected components	cc	Splits the graph into subgraphs with no edges between them
Breadth-first search	bfs	Walks through the graph in breadth-first order
Single-source shortest path	sssp	Calculates the shortest path from one vertex to all others
Betweenness centrality	bc	Calculates the centrality of a vertex, determined by how many shortest paths go through it

Characteristics of Graph Algorithms

Graph algorithms pose challenging behavior for conventional processor architectures. A typical operation is to fetch the attributes of all neighbors of a vertex. The list of neighbors, determined by the topology of the graph, is usually irregular. This leads to sparse memory accesses―accessing individual elements scattered on a large data structure. Sparse memory accesses have no locality, leading to poor cache utilization.

Fetching the attributes of the neighbors means using indirect accesses. For example, if N is the array of neighbors of a vertex, and A the array containing the attributes, the attribute of neighbor i is accessed by A[N[i]]. This pattern is difficult to predict and to vectorize, which leads to an underutilization of the available compute and memory resources.

On the other hand, graph algorithms generally have a lot of parallelism. The set of algorithms in the GAP suite fall into two categories in terms of parallelism. The first category consists of algorithms that operate on all vertices concurrently (pr, tc, and cc). They have abundant parallelism and can be executed across many threads. Their parallelism is only limited by the size of the graph.

The second category is front-based algorithms where, at each iteration, a subset of vertices is analyzed (the current front) and a new front is defined to be processed in the next iteration (bfs, sssp, and bc). These algorithms usually start with a front containing a single vertex. The next front consists of its neighbors, then the neighbors of these neighbors, and so on. In the first iterations, the size of the front (and thus the parallelism that can be exploited) is limited. Also, each iteration ends with a global barrier, which creates additional synchronization overhead. These algorithms scale worse with increasing thread count, especially on smaller graphs.

Running Graph Algorithms on Intel Xeon Processors

Despite the challenging behavior of graph algorithms, there are ways to increase the efficiency of running these applications on a multi-core Intel Xeon processor-based server.

Vectorization

Using vector memory instructions can increase the performance of a graph algorithm by increasing the number of parallel load operations, which hides part of their latency. Specifically, you can use the vector gather instruction (AVX2* and AVX-512*) to perform multiple indirect loads in one instruction. However, the compiler isn’t always able to detect these indirect access patterns, or it can decide to not vectorize based on its heuristics. Therefore, it might be useful to add #pragma vector always to force the compiler to vectorize and/or to rewrite the code to make the indirect access pattern more apparent to the compiler.

Figure 1 gives an example for cc. The original code on the left did not generate vector gather instructions, while the code on the right did. This led to a speedup of 5x for cc on Intel Xeon processors.

for (NodeID v : g.out_neigh(u)) { 
  NodeID comp_v = comp[v]; 
  if ((comp_u < comp_v) && 
      (comp_v == comp[comp_v])) { 
    change = true; 
    comp[comp_v] = comp_u; 
  }
}

NodeID *a = g.out_neigh(u).begin(); 
#pragma vector always 
for (int i=0; i<n; i++){ 
  NodeID comp_v = comp[a[i]]; 
  if ((comp_u < comp_v) && 
      (comp_v == comp[comp_v])) { 
    change = true; 
    comp[comp_v] = comp_u; 
  }
}

Figure 1 – Inner loop of connected components. On the left is the original code, which didn’t generate vector gather instructions. On the right is the altered code, which makes the indirect pattern clearer and forces the compiler to vectorize.

It might also be useful to look at other vectorization opportunities, and to rewrite the code so that they can be exploited (e.g., using intrinsics). For example, in tc, we need to count the number of matches between the elements of two neighbor lists. Katsov describes an algorithm to speed up the matching algorithm with SSE instructions². We adapted this algorithm to AVX-512 and included this in the tc benchmark, leading to a performance increase of 2.5x (code not included for brevity).³

Parallelism

The GAP benchmarks are parallelized using OpenMP*. As discussed before, there are two categories of parallelism: vertex- and front-based. For the vertex-parallel algorithms (pr, cc, and tc), it’s important to use dynamic scheduling in the OpenMP parallel for-loops because the processing time of a vertex depends on its neighbor count, which can differ significantly across vertices. With static scheduling, threads that are assigned vertices with many neighbors execute longer than other threads―leading to load imbalance. To reduce the scheduling overhead while still maintaining enough scheduling flexibility, set chunk size to 16 or 32.

The front-based algorithms (bfs, sssp, and bc) are harder to parallelize, which means they don’t use the full capacity of the processor (fewer threads than cores). The current front contains many fewer vertices than the full graph, and the next front can only be processed when the current front is finished. To fully exploit the increasing core count of Intel Xeon processor-based servers, these algorithms need to be revised to increase their parallelism.

An example of this is already implemented in bfs. Instead of looking for neighbors of the current front and checking whether they’ve already been visited (forward algorithm), all non-visited vertices are considered, and it is checked whether they are a neighbor of a vertex in the current front (backward algorithm). Because there are more non-visited vertices than vertices in the front, there’s more parallelism to exploit. The downside is that the backward algorithm sometimes does unnecessary work when most of the non-visited vertices are not neighbors of the current front.

At each step of the algorithm, we can choose between the two methods. This choice is currently done using the characteristics of the current front and the remaining vertices, but it should also include the available parallelism (core or thread count) to better exploit the capacity of the processor.

Caches and Input Graphs

Graph workloads generally don’t generate cache-friendly access patterns. The one exception is when the graph, or the most accessed data structure of the graph (e.g., the attributes of vertices), fits in the last-level cache (which is up to 38 MB per socket on high-end Intel Xeon processors). From our experiments, we notice that performance (expressed in GTEPS, or giga traversed edges per second) decreases with increasing graph size, since less and less of the data fits into the cache. Because of the nature of graphs and graph algorithms, methods to improve cache locality either don’t work well or take too much time to reorganize the graph, often more than the algorithm itself.

Distributed Graph Processing

The GAP benchmarks are designed for single-node execution only (using OpenMP parallelization). However, based on the insights from our study, we briefly discuss the impact on distributed graph processing (i.e., using multiple nodes). For high-performance multinode execution, it’s crucial to minimize the communication and maximize local data and computation. This is challenging for graph applications because of the irregular and non-localized access pattern. Partitioning a graph to minimize the number of edges between partitions is an NP-complete problem in itself, and often leads to more compute time than the algorithm itself. Therefore, when you’re deploying a graph analysis algorithm on multiple nodes, the nodes should be connected by a high-bandwidth, low-latency network such as Intel® Omni-Path Architecture to deal with the unavoidably high amount of communication between the nodes.

Boosting Performance through Analysis

Graph applications form a challenging workload for current processors because of their memory intensiveness and irregularity. By carefully crafting their implementation and exploiting vector units and thread parallelism, we can increase performance significantly. However, more investigation is needed, including redesigning the algorithms to fully exploit the capabilities of an Intel Xeon processor-based server, especially when moving to distributed processing.

#DataScience

References

¹S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP Benchmark Suite,” 2015. http://gap.cs.berkeley.edu/benchmark.html
²I. Katsov, “Fast Intersection of Sorted Lists Using SSE Instructions,” 2012. https://highlyscalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/
³S. Eyerman et al., “Many-Core Graph Workload Analysis,” 2018. https://dl.acm.org/citation.cfm?id=3291686.

Related Articles
Related Webinars

Article

Harp-DAAL for High-Performance Big Data Computing

Article

Driving Performance with Intel® Advisor’s Flow Graph Analyzer

Article

Tuning for Success with the Latest SIMD Extensions and Intel® Advanced Vector Extensions 512

Essentials

OpenMP and TBB Task Graphs: Unraveling the Spaghetti with Flow Graph Analyzer

Watch now

Developer Tools

Intel® Advisor

Design and optimize high-performing code for modern computer architectures. Effectively use more cores, vectorization, and heterogeneous processing.

See details

Download now

See all tools

Sign up for Updates

By submitting this form, you are confirming you are an adult 18 years or older and you agree to share your personal information with Intel to stay connected to the latest Intel technologies and industry trends by email and telephone. You can unsubscribe at any time. Intel’s web sites and communications are subject to our Privacy Notice and Terms of Use.

Thank you for your submission.

Intel Developer Zone