Python is the most common programming language for data science, machine learning, and numerical computing. It continues to grow in popularity among scientists and researchers. In the Python ecosystem, NumPy is the foundational Python library for performing array-based numerical computations.
NumPy’s standard implementation operates on a single CPU core, with only a limited set of operations parallelized across cores. This single-threaded, CPU-only execution restricts both the scale of data that can be processed and the speed at which computations can be performed.
While GPU-accelerated implementations of NumPy are available, scaling NumPy-based code across multiple GPUs or nodes often requires extensive code modification, including manual data partitioning and synchronization and data movement for distributed execution. Such code changes can be complicated and time-consuming to make functionally correct and performant.
Also, domain scientists with little expertise in distributed programming often pair up or consult with computer science experts to get the changes done, which further slows down the process of experimenting and validating research.
To address this productivity issue, we built NVIDIA cuPyNumeric, an open source, distributed and accelerated implementation of the NumPy API, designed to be a drop-in replacement for NumPy. With cuPyNumeric, scientists, researchers, and those working with large-scale problems can develop and test NumPy programs on moderately sized datasets using laptops or desktops. They can then scale the same programs by themselves to massive datasets using supercomputers or the cloud with no code changes.
cuPyNumeric offers you the productivity of NumPy with the performance benefits of accelerated, distributed GPU computing. cuPyNumeric supports essential NumPy features, including in-place updates, broadcasting, and comprehensive indexing semantics.
cuPyNumeric: A drop-in replacement for NumPy
You can start using cuPyNumeric by importing cupynumeric
instead of numpy
in a NumPy program. The following code example approximates PI using NumPy APIs:
# import numpy as np
import cupynumeric as np
num_points = 1000000000
# Generates random points in a square of area 4
x = np.random.uniform(-1, 1, size=num_points)
y = np.random.uniform(-1, 1, size=num_points)
# Calculates the probability of points falling in a circle of radius 1
probability_in_circle = np.sum((x ** 2 + y ** 2) < 1) / num_points
# probability_in_circle = (area of the circle) / (area of the square)
# ==> PI = probability_in_circle * (area of the square)
print(f"PI is {probability_in_circle * 4}")
The program produces the following output whether it runs with cuPyNumeric or NumPy, although the exact value can be different across different runs:
PI is 3.141572908
With only a simple change to the import statement, this code is now ready to scale to any machine:
- On laptops with no GPUs, the code is parallelized using available CPU cores.
- On a system with multiple GPUs, the same code runs a lot faster thanks to the acceleration with multiple GPUs.
- This code can even run on a supercomputer with no code modification, although the code doesn’t have enough computation to make the best use of such a powerful machine.
Stencil example in cuPyNumeric
Here’s a more interesting example that shows the benefits of cuPyNumeric. Stencil computations are one of the most common types of algorithms in scientific computing. Stencil programs are naturally expressed in NumPy using slices that shift and align neighbor cells with the center cell.
However, these programs are not always straightforward to parallelize for multi-GPU systems, as any changes to the center cells by one GPU must be propagated to other GPUs when the arrays are partitioned across multiple GPUs.
cuPyNumeric transparently scales stencil code written in pure NumPy to any number of GPUs or nodes without requiring the programmer to think about this distribution.
The following code example creates a 2D array grid and performs a 5-point stencil computation over each cell by referencing aliasing slices of the array:
import cupynumeric as np
grid = np.random.rand(N+2, N+2)
# Create multiple aliasing views of the grid array.
center = grid[1:-1, 1:-1]
north = grid[0:-2, 1:-1]
east = grid[1:-1, 2: ]
west = grid[1:-1, 0:-2]
south = grid[2: , 1:-1]
for _ in range(niters):
average = (center + north + east + west + south) * 0.2
center[:] = average
When this program is run on multiple GPUs, arrays are automatically partitioned into tiles across the GPUs of the machine, and operations are distributed so that each GPU performs the operation on its local piece of the data.
In addition to the data and compute partitioning, communication between the aliasing views of the grid array is automatically inferred by cuPyNumeric. After the operation center[:] = average
, the update to the center array must be propagated to the north
, east
, west
, and south
arrays.
cuPyNumeric performs this communication without any explicit communication code in the program, automatically discovering a halo-like communication pattern between GPUs within or across nodes, where only the data at the edges of each tile must be communicated. cuPyNumeric is the only existing distributed NumPy implementation that can support aliasing and mutation of distributed arrays in this manner.
With cuPyNumeric, this Python program can weak scale with high efficiency to large numbers of GPUs. Figure 1 shows a weak scaling plot of the stencil program out to 1,024 GPUs of the NVIDIA Eos supercomputer. NVIDIA Eos is a supercomputer announced in 2022 to advance AI research and development, and it consists of 576 NVIDIA DGX H100 nodes (4,608 NVIDIA H100 Tensor Core GPUs in total), connected by 400-Gbps NVIDIA Quantum-2 InfiniBand.
A weak-scaling experiment starts with a fixed problem size on a single GPU and then increases the number of GPUs and the problem size while keeping the problem size per GPU the same. The throughput per GPU achieved by the system is then plotted.
The near-flat line in the plot indicates that the code maintains the same throughput achieved at a single GPU, meaning that cuPyNumeric can solve larger problems in the same amount of time when given more GPUs. This is an ideal weak-scaling plot for a stencil computation that exhibits a nearest-neighbor communication pattern, one with a constant amount of communication per processor.
We also measured a single GPU baseline performance to evaluate the runtime overhead introduced by cuPyNumeric for distributed execution. As shown in Figure 1, cuPyNumeric’s single GPU performance is comparable to the baseline performance with little to no overhead of distributed execution.
How cuPyNumeric parallelizes NumPy operations
By design, NumPy APIs are built around vector operations that have ample data parallelism. cuPyNumeric uses this inherent data parallelism by partitioning arrays and performing computations in parallel on the subsets using multiple GPUs. While doing so, cuPyNumeric performs necessary communication when GPUs make overlapping accesses to the same subsets of arrays.
In this section, we show how cuPyNumeric parallelizes NumPy operations, using the stencil example.
Take the expression center + north
from the statement calculating the average. To perform this operation in parallel on multiple GPUs, cuPyNumeric partitions the center
and north
arrays across the GPUs in the following way: if a tile of center
for a GPU contains center[i,j]
, the corresponding tile of north
for the same GPU must have north[i,j]
.
When the arrays are partitioned, cuPyNumeric launches tasks on the GPUs. The task on each GPU performs element-wise additions on the tiles of center
and north
assigned to that GPU. Figure 2 shows an example parallelization of the expression with four tasks using 2 x 2 partitions of center
and north
.
Even though Figure 2 is drawn as if center
and north
are different arrays, center
and north
are in fact overlapping slices of the same array grid
, and so mapping them to separate physical allocations is inefficient.
Instead, cuPyNumeric performs a coalescing optimization that attempts to group data used by different slices of the same array into a single larger allocation. Figure 3 shows the result of this optimization on the example in Figure 2.
What happens if you do the same addition again after the center
array is updated from the previous iteration?
Because center
and north
are slices of the same array, the changes made to the center
array from the previous iteration must be propagated to north
before doing the addition in the current iteration. Thanks to the allocation coalescing, the cells in the intersection between center
and north
on each GPU are already up to date.
For those that duplicate on multiple GPUs (the cells A
, B
, C
, and D
in Figure 4), cuPyNumeric automatically copies the data from center
to north
to guarantee coherent data access for the additional tasks.
Figure 4 shows the data transfers necessary for the example in Figure 3.
Finally, expand the scope to the aggregation of five arrays in the average calculation. The code performs four element-wise additions, each of which corresponds to a separate NumPy API call. Before the additions are done, changes to the center
array made in the previous iteration should be propagated to the north
, east
, west
, and south
arrays, respectively.
cuPyNumeric executes this code asynchronously so that it can hide communication with useful, independent computation. To find opportunities for overlapping, cuPyNumeric constructs a task graph, a DAG of tasks whose edges express task dependencies. Nodes that are not connected with each other in the graph can potentially be performed simultaneously.
Figure 5 shows a task graph for the inner stencil loop, where each addition operation can be used to hide the latency of data transfers to the arrays used in the subsequent additions. For brevity, the graph has only one node for each multi-GPU operation.
Developer productivity in a real-world example
While the stencil example is a small and simple program, we’ve been able to port large scientific applications to cuPyNumeric.
One such example is TorchSWE, a GPU-accelerated shallow-water equation solver developed by Pi-Yueh Chuang and Dr. Lorena Barba. TorchSWE solves vertically averaged Navier-Stokes equations and can simulate free-surface water flow in rivers, channels, and coastal areas, as well as model flood inundation. Given a topography, TorchSWE can predict flood-prone areas and the height of water inundation, making it a valuable tool for risk mapping.
High-resolution numerical simulations—such as those on real topographies requiring hundreds of millions of data points—demand distributed computation across multiple GPUs. To achieve this, the original TorchSWE implementation, using Mpi4Py, manually partitions the problem and manages inter-GPU data communication and synchronization.
cuPyNumeric enables a distributed implementation of TorchSWE that avoids the complexities of an MPI implementation. After porting TorchSWE to cuPyNumeric by removing all domain decomposition logic, it scaled effortlessly across multiple GPUs and nodes without further code modifications. This scalability enabled high-fidelity simulations exceeding 1.2B data points using 32 GPUs, enabling researchers to tackle critical scientific problems in flood inundation modeling without needing specialized distributed computing expertise.
Overall, cuPyNumeric implementation eliminated 20% of the code but more importantly simplified development and maintenance for domain scientists by eliminating the need to understand and implement complex logic required for distributed execution.
For more information about the productivity aspects of the cuPyNumeric library in the TorchSWE example, see the TorchSWE case study in the cuPyNumeric documentation.
Get started
The easiest way to download cuPyNumeric is by using conda:
$ conda install -c conda-forge -c legate cupynumeric
For more information, see the following resources:
For more examples of cuPyNumeric, see the Examples in the documentation. For a self-guided tutorial on cuPyNumeric, see Chapter X: Distributed Computing with cuPyNumeric.
Even though cuPyNumeric allows zero-code-change scaling for NumPy programs, not every NumPy program scales effectively. For more information about scalable performance with cuPyNumeric and also common anti-patterns that can adversely affect performance, see Best Practices in the NVIDIA cuPyNumeric documentation.
Summary
This post introduced NVIDIA cuPyNumeric, an accelerated and distributed drop-in replacement for NumPy. With cuPyNumeric, scaling NumPy programs across a range of systems—from laptops to supercomputers—is as simple as changing an import statement. We also showed how cuPyNumeric parallelizes NumPy operations for multi-GPU execution using a stencil example.
If you have any comments or questions, please contact the team through the /nv-legate/discussion GitHub repo page or email.