Parallelism in Python* Using Numba*
It Just Takes a Bit of Practice and the Right Fundamentals
Obtaining parallelism in Python* has been a challenge for many developers. In issue 35 of The Parallel Universe, we explored the basics of the Python language and the key ways to obtain parallelism. In this article, we’ll explore how to achieve parallelism through Numba*.
There are three key ways to efficiently achieve parallelism in Python:
- Dispatch to your own native C code through Python’s ctypes or cffi (wrapping C code in Python).
- Rely on a library that uses advanced native runtimes, such as NumPy or SciPy.
- Use a framework that acts as an engine to generate native-speed code from Python or symbolic math expressions.
All three methods escape the global interpreter lock (GIL), and do so in a way that’s accepted within the Python community. The Numba framework falls under the third method, because it uses just-in-time (JIT) and low-level virtual machine (LLVM) compilation engines to create native-speed code.
The first requirement for using Numba is that your target code for JIT or LLVM compilation optimization must be enclosed inside a function. After the initial pass of the Python interpreter, which converts to bytecode, Numba will look for the decorator that targets a function for a Numba interpreter pass. Next, it will run the Numba interpreter to generate an intermediate representation (IR). Afterwards, it will generate a context for the target hardware, and then proceed to JIT or LLVM compilation. The Numba IR is changed from a stack machine representation to a register machine representation for better optimization at runtime. From there, the range of options and parallelism directives opens up.
In the following example, we’re using pure Python to give Numba the best chance to optimize without having to specify directives:
import array import random from numba import jit a = array.array('l', [random.randint(0,10) for x in range (0,10000000)]) @jit(nopython=True, parallel=True) def ssum(x): total = 0 for items in x: total+=items return total %timeit sum(a) 111 ms ± 861 ps per loop (mean ± std. dev. of 7 runs, 10 loops each) %timeit ssum(a) 4.2 ms ± 108 ps per loop (mean ± std. dev. of 7 runs, 100 loops each) # Nearly 26X faster!
Pure CPython bytecode is easier for the Numba interpreter to deal with compared to mixed CPython and NumPy code. The @jit decorator tells Numba to create the IR, and then a compiled variant, before running the function. Note the nopython attribute on the decorator. This means that we don’t want to fall back to stock interpreter behavior if Numba fails to convert the code (more on this later). We used Python arrays instead of lists because they compile better to Numba. We also created a custom summation function because Python’s standard sum has special iterator properties that won’t compile in Numba.
The previous example works well for general Python. But what if your code requires the use of scientific or numerical packages like NumPy or SciPy? Take, for example, the following code that calculates a resistorcapacitor (RC) time constant for a circuit:
import numpy as np test voltages = np.random.rand(1,1000)*12 test constants = np.random.rand(1,1000) def filter_time_constant(voltage, time_constant): return voltage * (1-np.exp(1/time_constant)) %timeit filter_time_constant(test_voltages, test_constants) 11.2 µs ± 145 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In this case, we’ll use the @vectorize decorator instead of @jit because of NumPy’s implementation of ufuncs:
from numba import vectorize @vectorize def v_filter_time_constant(voltage, time_constant): return voltage * (1-np.exp(1/time_constant)) %timeit v_filter_time_constant(test_voltages, test_constants) 4.74 µs ± 46.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # Over 2x faster!
When dealing with specialized frameworks such as NumPy and SciPy, Numba is not only dealing with Python, but also with a special type of primitive in the NumPy/SciPy stack called a ufunc, which normally means one would need to create a NumPy ufunc with C code—a difficult proposition. In this case, the np.exp() is a good candidate, since it’s a transcendental function and can be targeted by the Intel® Compiler’s Short Vector Math Library (SVML) in conjunction with Numba. Both @vectorize and @guvectorize can use Intel’s SVML library and help with NumPy ufuncs.
While Numba does have good ufunc coverage, it’s also important to understand that not every NumPy or SciPy codebase will optimize well in Numba. This is because some NumPy primitives are already highly optimized. For example, numpy.dot() uses the Basic Linear Algebra Subroutines (BLAS), an optimized C API for linear algebra. If the Numba interpreter is used, it will actually produce a slower function because it can’t optimize the BLAS function any further. To use the ufunc optimally in Numba, we’d need to look for a stacked NumPy call, in which many operations to an array or vector are compounded. For example:
%timeit np.exp(np.arcsin(np.random.rand(1000))) 19.6 µs ± 85.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) @jit(nopython=True) def test_func(size): np.exp(np.arcsin(np.random.rand(1000))) %timeit test_func(1000) 16 µs ± 80.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The Numba @jit performance is slightly better than the straight NumPy code because this computation has not one, but three, NumPy computations. Numba can analyze the ufuncs and detect the best vectorization and alignment better than NumPy itself can.
Another area to tweak Numba’s compilation directives and performance is using the advanced compilation options. The main options used are nopython, nogil, cache, and parallel. With the @jit decorator, Numba attempts to choose the best method to optimize the c ode given to it. However, if the nature of the code is better known, you can directly specify a compilation directive.
The first option is nopython, which prevents the compilation from falling back to Python object mode. If the code is unable to convert, it will instead throw an error to the user. The second option is nogil, which releases the GIL when not processing non-object code. This option assumes you’ve thought through multithreaded considerations such as consistency and race conditions. The cache option stores the compiled function in a file-based cache to avoid unnecessary compilation the next time Numba is invoked on the same function. The parallel directive is a CPU-tailored transformation to known reliable primitives such as arrays and NumPy computations. This option is a good first choice for kernels that do symbolic math.
Stricter function signatures improve the opportunities for Numba to optimize the code. Defining the expected datatype for each parameter in the signature gives the Numba interpreter the necessary information to find the best machine representation and memory alignment of the kernel. This is similar to providing static types for a C compiler. The following examples show how to provide type information to Numba:
@jit (int32 (int32, int32)) # Expecting int32 values when being processed @jit([(int64[:], int64, int64[:])] # Expecting int64 arrays values when being processed @vectorize([float64(float64, float64)]) # Expecting float64 values when being processed
In general, accessing parallelism in Python with Numba is about knowing a few fundamentals and modifying your workflow to take these methods into account while you’re actively coding in Python. Here are the steps in the process:
- Ensure the abstraction of your core kernels is appropriate. Numba requires the optimization target to be in a function. Unnecessarily complex code can cause the Numba compilation to fall back to object code.
- Look for places in your code where you see processing data in some form of a loop with a known datatype. Examples would be a for-loop iterating over a list of integers, or an arithmetic computation that processes an array in pure Python.
- If you’re using NumPy and SciPy, look at computations that can be stacked in a single statement and that are not BLAS or LAPACK functions. These are prime candidates for using the ufunc optimization capabilities of Numba.
- Experiment with Numba’s compilation options.
- Determine the intended datatype signature of the function and core code. If it’s known (such as int8 or int32), then inform Numba about which input datatype parameters it should expect.
Achieving parallelism with Numba just takes a bit of practice and the right fundamentals. Getting both the performance advantages of stepping out of the GIL while having maintainable code is a testament to the Python community’s hard work in the scientific computing space. Numba is one of the best tools to achieve performance and exploit parallelism so it should be in every Python developer’s toolkit.