The Present and Future of the OpenMP* API Specification
How the Gold Standard Parallel Programming Language Has Improved with Each New Version
There are two decades of history associated with the OpenMP* API and, since its inauguration, OpenMP features have been added to keep up with developments in hardware and software to ensure that you can use it to program the hardware that you have. Since the release of version 4.0 in 2013, the OpenMP language has supported heterogeneous and SIMD programming. Similarly, support for programs with irregular parallelism was improved in 2008 with the addition of tasking constructs. OpenMP Technical Report 4: Version 5.0 Preview 1 (TR4 for short) is the next step in the evolution of the OpenMP language. It adds task reductions, extends SIMD parallel programming, and considerably extends the productivity of heterogeneous programming. In this article, we review existing OpenMP features and provide a preview of what will be coming soon in implementations supporting TR4.
Tasking: Express Yourself with Tasks
Tasking, or task-based programming, is an important concept for applications that require irregular parallelism (e.g., recursive algorithms, graph traversals, and algorithms operating on unstructured data). Since OpenMP version 3.0, the task construct has provided a convenient way to express the concurrent execution of small units of work that are handed to a scheduler in the OpenMP runtime system.
void taskloop_example() { #pragma omp taskgroup { #pragma omp task long_running_task() // can execute concurrently #pragma omp taskloop collapse(2) grainsize(500) nogroup for (int i = 0; i < N; i++) for (int j = 0; j < M; j++) loop_body();
Figure 1 illustrates the creation of an OpenMP task to execute a long-running function and thena loop that has been parallelized using the taskloop construct. This construct appeared with OpenMP 4.5 and provides syntactic sugar to allow programmers to easily parallelize loops usingOpenMP tasks. It divides the loop iteration space into chunks and creates one task for each chunk. The construct supports several clauses to allow fine control (e.g., grainsize to control the amount of work per task and collapse to create a product loop out of the i and j loops). TR4 extends the expressiveness of OpenMP tasks by defining new clauses for the taskgroup, task, and taskloop constructs to perform reductions across the generated tasks.
Figure 2 illustrates the creation of tasks to process a linked list and find the minimum value of all elements in the list. The parallel construct creates a parallel region to have worker threads available for task execution. The single construct then restricts execution to one thread that traverses the linked list and generates one task for each list item via omp task. This is a common way to implement a producer-consumer pattern in OpenMP.
Task reductions in TR4 use the taskgroup construct that was introduced in OpenMP version4.0. It was designed to group tasks logically and to provide a way to await completion of all the tasks in the group. TR4 extends the taskgroup construct to perform reductions through the task_reduction clause, as illustrated in Figure 2. If this clause is added to the construct, all partial results gathered by the individual tasks are aggregated to form the final result at the end of the taskgroup region. Tasks that contribute to the reduction operation must have an in_reduction clause that matches the reduction clause of their taskgroup.
Starting with TR4, the taskloop construct supports the reduction and in_reduction clauses with their task reduction semantics. If a reduction clause appears on the taskloop construct, an implicit task group is created, which performs the requested reduction operation at the end of the loop. If an in_reduction clause is added, the tasks generated by the taskloop construct participate in the reduction of an outer taskgroup region.
int find_minimum(list_t * list) { int minimum = INT_MAX; list_t * ptr = list; #pragma omp parallel #pragma omp single #pragma omp taskgroup task_reduction(min:minimum) { for (ptr = list; ptr ; ptr = ptr->next) { #pragma omp task firstprivate(ptr) in_reduction(min:minimum) { int element = ptr->element; minimum = (element < minimum) ? element : minimum; } } } return minimum; }
Offloading: Making the Most of Coprocessors
The OpenMP API strives to improve the usability of offloading pragmas based on user feedback. To that end, new features have been added to TR4 and some existing features have been enhanced. One of the key new features is the ability to automatically detect functions used in offload regions and treat them as if they appeared in a declare target directive. Previously, all functions called in an offload region had to be explicitly tagged using declare target directives. This was hard work, especially if the routines were in header files not owned by the programmer (e.g., the Standard Template Library), and would require declare target directives on the header file itself, which would create a copy of every function in the header file for the device even if some functions were not used in the offload region.
#pragma omp declare target void foo() { // ... } #pragma omp end declare target void bar() { #pragma omp target { foo(); } }
void foo() { // ... } void bar() { #pragma omp target { foo(); } }
In Figure 3, the code on the left shows what was required in OpenMP version 4.5. Starting with TR4, the code on the right side is sufficient due to the implicit detection and creation of the device function.
Automatic detection also extends to variables with static storage duration in TR4. The examples in Figure 4 are equivalent.
int x; #pragma omp declare target to (x) void bar() { #pragma omp target { x = 5; } }
int x; void bar() { #pragma omp target { x = 5; } }
OpenMP version 4.5 introduced the use_device_ptr clause. The variable in use_device_ptr must be mapped before it can be used. To achieve this, the programmer would need to use a separate #pragma target data clause, as a variable can appear in only one data clause.Thus, the OpenMP directives in Figure 5 are needed.
#pragma omp target data map(buf) #pragma omp target data use_device_ptr(buf)
In TR4, an exception has been made so that the variable can appear in both the map and use_device_ptr clauses in a single construct, as shown in Figure 6.
#pragma omp target data map(buf) use_device_ptr(buf)
Static data members are now permitted in a class inside an omp declare target construct. Class objects with static members can also be used in a map clause (Figure 7).
#pragma omp declare target class C { static int x; int y; } class C myclass; #pragma omp end declare target void bar() { #pragma omp target map(myclass) { myclass.x = 10 } }
In addition, virtual member functions are allowed in classes inside an omp declare target construct or objects used in a map clause. The only caveat is that the virtual member functions can be invoked only on a device if the object is created on the same device.
In OpenMP 4.5, scalar variables used in a reduction or lastprivate clause on a combined construct for which the first construct is target are treated as firstprivate for the target construct. That results in the host value never being updated, surprisingly. To update the value on the host, the programmer had to separate the omp target directive from the combined construct and explicitly map the scalar variable. In TR4, such variables are automatically treated as if they had a map(tofrom:variable) applied to them.
If a section of a named array is mapped using omp target data, any nested omp target inside the omp target data construct that references the array would require an implicit mapping to either the same section or a subsection of the array used in the outer omp target data map clause. If the explicit mapping is omitted on the inner omp target region, the implicit mapping rule kicks in, which would imply that the entire array is mapped according to OpenMP version 4.5. This would result in a runtime error from mapping a larger-sized array when a subsection of the array is already mapped. Similarly, mapping a field of a structure variable in the outer omp target data construct and using the address of the structure variable inside a nested omp target construct would result in an attempt to map the entire structure variable when part of the structure is already mapped. TR4 has fixed these cases to give the behavior that programmers typically expect (Figure 8).
struct {int x,y,z} st; int A[100]; #pragma omp target data map(s.x A[10:50]) { #pragma omp target { A[20] = ; // error in OpenMP 4.5, Ok in TR4 foo(&st); // error in OpenMP 4.5, OK in TR4 } #pragma omp target map(s.x, A[10:50]) { A[20] = ; // Ok OpenMP 4.5 and TR4 foo(&st); // Ok OpenMP 4.5 and TR4 } }
The new features in TR4 improve the programmability of offloading using OpenMP, requiring fewer modifications to the application. The automatic detection of variables and functions used in target regions removes the need for explicit specification. Similarly, the elimination of the need to repeat map clauses inside nested regions and allowing variables to appear in both map and use_device_ptr reduces the number of OpenMP directives required. The changes to the behavior of reduction variables aligns the language with programmer expectations. Overall, the cleaner semantics make the use of offload devices within OpenMP applications simpler and more intuitive.
Efficient SIMD Programming
SIMD Loops with Cross-Iteration Dependencies
OpenMP version 4.5 extends the ordered construct by adding a new simd clause. The ordered simd construct declares that a structured block in the SIMD loop or SIMD function must be executed in iteration order or in the order of function calls, respectively. Figure 9 shows the use of the ordered simd block to preserve read-write, write-read, and write-write ordering within each iteration and among iterations, while the entire loop can be executed concurrently using SIMD instructions. In the first ordered simd block, the index ind[i] of array a may have a write-write conflict (e.g., ind[0] = 2, ind[2] = 2), so it needs to be serialized by the ordered simd to allow vectorization of the entire loop. In the second ordered simd block, the myLock(L) and myUnlock(L) operations must be in a single ordered simd block. Otherwise, as part of the loop vectorization (e.g., for a vector length of two), the calls to myLock(L) and myLock(L) will be expanded to two calls as follows: {myLock(L);myLock(L); …; myUnlock(L); myUnlock(L);}. Nesting the lock functions will typically result in a deadlock. The ordered simd construct shown in the example creates the proper sequence {myLock(L); …; myUnlock(L); …; myLock(L); myUnlock(L);}.
#pragma omp simd for (i = 0; i < N; i++) { // ... #pragma omp ordered simd { // write-write conflict a[ind[i]] += b[i]; } // ... #pragma omp ordered simd { // atomic update myLock(L) if (x > 10) x = 0; myUnlock(L) } // ... }
#pragma omp simd for (i = 0; i < N; i++) { // ... #pragma omp ordered simd { if (c[i]) > 0) q[j++] = b[i]; } // ... #pragma omp ordered simd { if (c[i] > 0) q[j++] = d[i]; } // ... }
When using the simd clause on the ordered construct, caution is required to not violate inherent dependencies between two ordered simd blocks. Figure 9 shows incorrect uses of #pragma omp ordered simd, as the order of stores is changed under SIMD execution with respect to its serial execution. Assume c[0] = true and c[1] = true. When the above loop is executed serially, the order of stores is: q[0] = b[0], q[1] = d[0], q[2] =b[1], q[3] = d[1], and so forth. However, when the loop is executed concurrently with a vector length of two, the order of stores is: q[0] = b[0], q[1] = b[1], q[2] = d[0], q[3] = d[1], … The change in store ordering is due to a violation of the write-to-read dependency on the variable j between the two ordered simd blocks in the loop. The correct use is to merge the two ordered simd blocks into a single ordered simd block.
REF/UVAL/VAL Modifier Extensions to the Linear Clause
The linear clause provides a superset of the functionality provided by the private clause.When a linear clause is specified on a construct, the value of the new list item on each iteration of the associated loop(s) corresponds to the value of the original list item before entering the construct, plus the logical number of the iterations multiplied by the linear step. The value corresponding to the sequentially last iteration of the associated loop(s) is assigned to the original list item. When a linear clause is specified on a declarative directive, all list items must be formal parameters (or, in Fortran, dummy arguments) of a function that will be invoked concurrently on each SIMD lane.
The rationale behind adding ref/uval/val modifiers to the linear clause is to provide a way for programmers to precisely specify the linear or uniform property of memory references with respect to address and data value so the compiler can leverage the information to generate efficient SIMD code using unit-stride loads/stores instead of gathers/scatters. Essentially, for implicitly referenced linear arguments, it would be better to have reference as linear. The semantics of uval/val/ref is described as:
- linear(val(var):[step]) indicates that the value is linear even if the var is passed by its reference.The vector of addresses is passed for passed by reference. In this case, the compiler must generate gathers or scatters.
- linear(uval(var):[step]) indicates that the value passed by reference is linear while the reference itself is uniform. So the reference to the first lane is passed, but other values can be constructed using step. The compiler can use general-purpose registers to pass the base address and compute its linear value.
- linear(ref(var):step) indicates that the parameter is passed by reference, the underlying reference is linear, and the memory access will be linear unit-stride or nonstrided depending on step.The compiler can use general-purpose registers to pass the base address and compute its linear address.
Figure 10 shows a function FOO with arguments X and Y, which are pass-by-reference in Fortran. The “VALUE” attribute does not change this behavior. It says only that the updated value will not be visible to the caller per the Fortran 2008 language specification. Since the references of X and Y are not annotated as linear, the compiler must generate gather instructions to load (X0, X1, X2, X3) and (Y0, Y1, Y2, Y3), assuming the vector length is four. In Figure 11, the references to X and Y are annotated as linear, so the compiler can generate unit-stride SIMD loads for much better performance.
REAL FUNCTION FOO(X, Y) !$omp declare simd(FOO) REAL, VALUE :: Y !! pass by reference REAL, VALUE :: X !! pass by reference FOO = X + Y !! gathers generated !! based on vector !! of addresses END FUNCTION FOO ! ... !omp$ simd private(X,Y) DO I= 0, N Y = B(I) X = A(I) C(I) += FOO(X, Y) ENDDO
REAL FUNCTION FOO(X, Y) !$omp declare simd(FOO) linear(ref(X), ref(Y)) REAL, VALUE :: Y !! pass by reference REAL, VALUE :: X !! pass by reference FOO = X + Y !! unit stride !! SIMD loads END FUNCTION FOO ! ... !omp$ simd private(X,Y) DO I= 0, N Y = B(I) X = A(I) C(I) += FOO(X, Y) ENDDO
In Figure 12, the function add_one is annotated as a SIMD function. It has a C++ reference argument const int &p. Assuming a vector length of four, if p is annotated as linear(ref(p)), the compiler can generate the unit-stride load instruction with the base address p in the rax register to load p[0], p[1], p[2], and p[3] to the xmm0 register. In that case, the add_one function requires only three instructions.
#pragma omp declare simd notinbranch // linear(ref(p)) __declspec(noinline) int add_one(const int& p) { return (p + 1); }
However, if p is not annotated as linear(ref(p)), the compiler has to assume that four different addresses p0, p1, p2, and p3 are passed in via two xmm registers, and the gather operation is emulated with a sequence of scalar load and packing instructions. As a result, the add_one function now requires 16 instructions rather than three.
Overall, the additional SIMD features in OpenMP version 4.5 allow the user to provide more information to the compiler, which allows vectorization of more loops and the generation of better vector code in many circumstances.
Affinity: Thread Placement Made Easy
The OpenMP version 4.0 specification gave users a standard way to control thread affinity for the first time. It introduced two new concepts to the language:
- Binding policy
- Place partition
The binding policy, specified by the bind-var Internal Control Variable (ICV), determines where the threads of a team will be bound relative to the parent thread’s place. The place partition, specified by the place-partition-var ICV, is the set of places to which thread scan be bound. Once a thread is bound to a place for a given team, it should not be moved from that place.
There are three binding policies defined by the specification: master, close, and spread. In describing these policies, we will consider a set of four places, each one a core with two threads. We will show examples of placing three threads and six threads on those places,and we assume that the parent thread will always be on the third place. In the master policy, the master thread is bound to the parent thread’s place, and then the remaining threads in the team are assigned to the same place as the master thread (Table 1).
The close policy starts by placing the master thread on the parent thread’s place, and then proceeds in a round-robin fashion with the remaining threads in the team. To place T threads on P places, the master’s place gets roughly the first T/P threads, then the next place in the place partition gets the next T/P threads, and so on, wrapping around in the place partition as needed, giving a distribution (Table 2).
With the spread policy, things get very interesting. The placement of threads will be such that they are spread out over the available places. This is accomplished by forming T roughly even subpartitions of the place partition, or P partitions if T >= P. If T <= P, each thread gets its own subpartition, starting with the master thread, which will get the subpartition containing the place to which the parent thread is bound. Each subsequent thread is bound to the first place in each subsequent subpartition, wrapping around as needed. If T > P, sets of consecutive threads get the same subpartition, which in this case will consist of a single place. Thus, all the threads in theset will be bound to the same place. We show the subpartitions formed in Table 3 in curly braces. These are important if nested parallelism is used, since they affect the available resources used by each nested parallel region.
OpenMP version 4.0 also provides a query function for the thread affinity binding policy: omp_proc_bind_t omp_get_proc_bind(). It returns the binding policy to be used in the next parallel region (assuming that no proc_bind clause is specified on that region).
What is really interesting about the spread policy is what happens with the subpartition. With the master and close policies, each implicit task inherits the place partition of the parent implicit task. But, in the spread policy, implicit tasks get their place-partition-var ICV set to the subpartition instead. This means that a nested parallel construct will have all of its threads placed within the subpartition of its parent.
The value of bind-var can be initialized via the environment variable OMP_PROC_BIND. The value of bind-var can also be overridden by the addition of a proc_bind clause to a parallel construct. Specifying the place-partition-var is accomplished via the OMP_PLACES environment variable. Places can be hardware threads, cores, sockets, or specific quantities of those. They can also be explicit processor lists. More details can be found in the OpenMP API specification.
The OpenMP 4.5 specification enhanced the language’s affinity capabilities by providing a set of functions capable of querying aspects of the place partition and binding place of the current thread. These new API functions are useful for confirming the correctness of the settings to achieve the programmer’s desired thread affinity. This is particularly important when the complexity of the code is high and nested parallelism is used in conjunction with the spread binding policy to place threads in nested parallel regions such that they share lower-level caches. These API functions are:
- int omp_get_num_places(): Returns the number of places in the place-partition-var in theexecution environment of the initial task.
- int omp_get_place_num_procs(int place_num): Returns the number of processors available to the execution environment in the place specified by place_num in the place partition.
- void omp_get_place_proc_ids(int place_num, int *ids): Gets the processors availableto the execution environment in the place specified by place_num in the place partition, allocates an array to hold them, and puts that array at ids.
- int omp_get_place_num(void): Returns the number of the place in the place partition to whichthe encountering thread is bound.
- int omp_get_partition_num_places(void): Returns the number of places in the place partition of the innermost implicit task. Note that this differs from omp_get_num_places() in that it will show the effects of the spread binding policy as the place partition gets broken into subpartitions, whereas omp_get_num_places() will always show the full original place partition.
- void omp_get_partition_place_nums(int *place_nums): Gets the list of place numbers corresponding to the place partition of the innermost implicit task and allocates an array in place_nums to store them. Note that the place numbers are the numbers of the places in the full original place partition. This function is particularly useful to see which places from the original placepartition appear in a subpartition resulting from the use of the spread binding policy.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.