Models / Libraries / Frameworks

Effortlessly Scale NumPy from Laptops to Supercomputers with NVIDIA cuPyNumeric

A photo of two GPU clusters and another picture of four scientific computing workflows demonstrating computational fluid dynamics.

Python is the most common programming language for data science, machine learning, and numerical computing. It continues to grow in popularity among scientists and researchers. In the Python ecosystem, NumPy is the foundational Python library for performing array-based numerical computations. 

NumPy’s standard implementation operates on a single CPU core, with only a limited set of operations parallelized across cores. This single-threaded, CPU-only execution restricts both the scale of data that can be processed and the speed at which computations can be performed.  

While GPU-accelerated implementations of NumPy are available, scaling NumPy-based code across multiple GPUs or nodes often requires extensive code modification, including manual data partitioning and synchronization and data movement for distributed execution. Such code changes can be complicated and time-consuming to make functionally correct and performant. 

Also, domain scientists with little expertise in distributed programming often pair up or consult with computer science experts to get the changes done, which further slows down the process of experimenting and validating research.

To address this productivity issue, we built NVIDIA cuPyNumeric, an open source, distributed and accelerated implementation of the NumPy API, designed to be a drop-in replacement for NumPy. With cuPyNumeric, scientists, researchers, and those working with large-scale problems can develop and test NumPy programs on moderately sized datasets using laptops or desktops. They can then scale the same programs by themselves to massive datasets using supercomputers or the cloud with no code changes.

cuPyNumeric offers you the productivity of NumPy with the performance benefits of accelerated, distributed GPU computing. cuPyNumeric supports essential NumPy features, including in-place updates, broadcasting, and comprehensive indexing semantics.

cuPyNumeric: A drop-in replacement for NumPy

You can start using cuPyNumeric by importing cupynumeric instead of numpy in a NumPy program. The following code example approximates PI using NumPy APIs:

# import numpy as np
import cupynumeric as np

num_points = 1000000000

# Generates random points in a square of area 4
x = np.random.uniform(-1, 1, size=num_points)
y = np.random.uniform(-1, 1, size=num_points)

# Calculates the probability of points falling in a circle of radius 1
probability_in_circle = np.sum((x ** 2 + y ** 2) < 1) / num_points

# probability_in_circle = (area of the circle) / (area of the square)

#   ==> PI = probability_in_circle * (area of the square)
print(f"PI is {probability_in_circle * 4}")

The program produces the following output whether it runs with cuPyNumeric or NumPy, although the exact value can be different across different runs:

PI is 3.141572908

With only a  simple change to the import statement, this code is now ready to scale to any machine: 

  • On laptops with no GPUs, the code is parallelized using available CPU cores. 
  • On a system with multiple GPUs, the same code runs a lot faster thanks to the acceleration with multiple GPUs. 
  • This code can even run on a supercomputer with no code modification, although the code doesn’t have enough computation to make the best use of such a powerful machine.

Stencil example in cuPyNumeric

Here’s a more interesting example that shows the benefits of cuPyNumeric. Stencil computations are one of the most common types of algorithms in scientific computing. Stencil programs are naturally expressed in NumPy using slices that shift and align neighbor cells with the center cell. 

However, these programs are not always straightforward to parallelize for multi-GPU systems, as any changes to the center cells by one GPU must be propagated to other GPUs when the arrays are partitioned across multiple GPUs. 

cuPyNumeric transparently scales stencil code written in pure NumPy to any number of GPUs or nodes without requiring the programmer to think about this distribution.

The following code example creates a 2D array grid and performs a 5-point stencil computation over each cell by referencing aliasing slices of the array:

import cupynumeric as np

grid = np.random.rand(N+2, N+2)
# Create multiple aliasing views of the grid array.
center = grid[1:-1, 1:-1]
north  = grid[0:-2, 1:-1]
east   = grid[1:-1, 2:  ]
west   = grid[1:-1, 0:-2]
south  = grid[2:  , 1:-1]
for _ in range(niters):
  average = (center + north + east + west + south) * 0.2
  center[:] = average

When this program is run on multiple GPUs, arrays are automatically partitioned into tiles across the GPUs of the machine, and operations are distributed so that each GPU performs the operation on its local piece of the data. 

In addition to the data and compute partitioning, communication between the aliasing views of the grid array is automatically inferred by cuPyNumeric. After the operation center[:] = average, the update to the center array must be propagated to the north, east, west, and south arrays. 

cuPyNumeric performs this communication without any explicit communication code in the program, automatically discovering a halo-like communication pattern between GPUs within or across nodes, where only the data at the edges of each tile must be communicated. cuPyNumeric is the only existing distributed NumPy implementation that can support aliasing and mutation of distributed arrays in this manner.

With cuPyNumeric, this Python program can weak scale with high efficiency to large numbers of GPUs. Figure 1 shows a weak scaling plot of the stencil program out to 1,024 GPUs of the NVIDIA Eos supercomputer. NVIDIA Eos is a supercomputer announced in 2022 to advance AI research and development, and it consists of 576 NVIDIA DGX H100 nodes (4,608 NVIDIA H100 Tensor Core GPUs in total), connected by 400-Gbps NVIDIA Quantum-2 InfiniBand.

A weak-scaling experiment starts with a fixed problem size on a single GPU and then increases the number of GPUs and the problem size while keeping the problem size per GPU the same. The throughput per GPU achieved by the system is then plotted. 

The near-flat line in the plot indicates that the code maintains the same throughput achieved at a single GPU, meaning that cuPyNumeric can solve larger problems in the same amount of time when given more GPUs. This is an ideal weak-scaling plot for a stencil computation that exhibits a nearest-neighbor communication pattern, one with a constant amount of communication per processor. 

We also measured a single GPU baseline performance to evaluate the runtime overhead introduced by cuPyNumeric for distributed execution. As shown in Figure 1, cuPyNumeric’s single GPU performance is comparable to the baseline performance with little to no overhead of distributed execution.

The plot demonstrates near-perfect scalability out to 128 GPUs on the NVIDIA Eos supercomputer.
Figure 1. Weak scaling plot for a cuPyNumeric 2D stencil computation

How cuPyNumeric parallelizes NumPy operations

By design, NumPy APIs are built around vector operations that have ample data parallelism. cuPyNumeric uses this inherent data parallelism by partitioning arrays and performing computations in parallel on the subsets using multiple GPUs. While doing so, cuPyNumeric performs necessary communication when GPUs make overlapping accesses to the same subsets of arrays. 

In this section, we show how cuPyNumeric parallelizes NumPy operations, using the stencil example.

Take the expression center + north from the statement calculating the average. To perform this operation in parallel on multiple GPUs, cuPyNumeric partitions the center and north arrays across the GPUs in the following way: if a tile of center for a GPU contains center[i,j], the corresponding tile of north for the same GPU must have north[i,j]

When the arrays are partitioned, cuPyNumeric launches tasks on the GPUs. The task on each GPU performs element-wise additions on the tiles of center and north assigned to that GPU. Figure 2 shows an example parallelization of the expression with four tasks using 2 x 2 partitions of center and north.

A diagram shows the original grid, which is first split into center and north sections. Each section is then further divided across four GPUs.
Figure 2. An example parallelization of expression center + north on four GPUs

Even though Figure 2 is drawn as if center and north are different arrays, center and north are in fact overlapping slices of the same array grid, and so mapping them to separate physical allocations is inefficient. 

Instead, cuPyNumeric performs a coalescing optimization that attempts to group data used by different slices of the same array into a single larger allocation. Figure 3 shows the result of this optimization on the example in Figure 2. 

The diagram shows how the center and north partitions overlap when they are originally partitioned across four GPUs.
Figure 3. Coalescing of center and north for the previous example

What happens if you do the same addition again after the center array is updated from the previous iteration? 

Because center and north are slices of the same array, the changes made to the center array from the previous iteration must be propagated to north before doing the addition in the current iteration. Thanks to the allocation coalescing, the cells in the intersection between center and north on each GPU are already up to date.

For those that duplicate on multiple GPUs (the cells A, B, C, and D in Figure 4), cuPyNumeric automatically copies the data from center to north to guarantee coherent data access for the additional tasks. 

Figure 4 shows the data transfers necessary for the example in Figure 3.

A diagram shows how data transfer from the center array to the north array is implemented across the four partitions of the original grid.
Figure 4. Data transfers to propagate changes from center to north

Finally, expand the scope to the aggregation of five arrays in the average calculation. The code performs four element-wise additions, each of which corresponds to a separate NumPy API call. Before the additions are done, changes to the center array made in the previous iteration should be propagated to the north, east, west, and south arrays, respectively. 

cuPyNumeric executes this code asynchronously so that it can hide communication with useful, independent computation. To find opportunities for overlapping, cuPyNumeric constructs a task graph, a DAG of tasks whose edges express task dependencies. Nodes that are not connected with each other in the graph can potentially be performed simultaneously. 

Figure 5 shows a task graph for the inner stencil loop, where each addition operation can be used to hide the latency of data transfers to the arrays used in the subsequent additions. For brevity, the graph has only one node for each multi-GPU operation.

A diagram includes four boxes corresponding to NumPy API calls, and four more boxes with arrows pointing to the first four, representing data transfers inserted by cuNumeric. The arrows between the boxes indicate dependencies between the various operations.
Figure 5. A task graph for the summation of the center, north, east, west, and south arrays (green = NumPy API calls; gray = data transfers inserted by cuPyNumeric)

Developer productivity in a real-world example

While the stencil example is a small and simple program, we’ve been able to port large scientific applications to cuPyNumeric. 

One such example is TorchSWE, a GPU-accelerated shallow-water equation solver developed by Pi-Yueh Chuang and Dr. Lorena Barba. TorchSWE solves vertically averaged Navier-Stokes equations and can simulate free-surface water flow in rivers, channels, and coastal areas, as well as model flood inundation. Given a topography, TorchSWE can predict flood-prone areas and the height of water inundation, making it a valuable tool for risk mapping.

High-resolution numerical simulations—such as those on real topographies requiring hundreds of millions of data points—demand distributed computation across multiple GPUs. To achieve this, the original TorchSWE implementation, using Mpi4Py, manually partitions the problem and manages inter-GPU data communication and synchronization.

cuPyNumeric enables a distributed implementation of TorchSWE that avoids the complexities of an MPI implementation. After porting TorchSWE to cuPyNumeric by removing all domain decomposition logic, it scaled effortlessly across multiple GPUs and nodes without further code modifications. This scalability enabled high-fidelity simulations exceeding 1.2B data points using 32 GPUs, enabling researchers to tackle critical scientific problems in flood inundation modeling without needing specialized distributed computing expertise. 

Overall, cuPyNumeric implementation eliminated 20% of the code but more importantly simplified development and maintenance for domain scientists by eliminating the need to understand and implement complex logic required for distributed execution. 

For more information about the productivity aspects of the cuPyNumeric library in the TorchSWE example, see the TorchSWE case study in the cuPyNumeric documentation.

Get started

The easiest way to download cuPyNumeric is by using conda:

$ conda install -c conda-forge -c legate cupynumeric

For more information, see the following resources:

For more examples of cuPyNumeric, see the Examples in the documentation. For a self-guided tutorial on cuPyNumeric, see Chapter X: Distributed Computing with cuPyNumeric.

Even though cuPyNumeric allows zero-code-change scaling for NumPy programs, not every NumPy program scales effectively. For more information about scalable performance with cuPyNumeric and also common anti-patterns that can adversely affect performance, see Best Practices in the NVIDIA cuPyNumeric documentation.

Summary

This post introduced NVIDIA cuPyNumeric, an accelerated and distributed drop-in replacement for NumPy. With cuPyNumeric, scaling NumPy programs across a range of systems—from laptops to supercomputers—is as simple as changing an import statement. We also showed how cuPyNumeric parallelizes NumPy operations for multi-GPU execution using a stencil example.

If you have any comments or questions, please contact the team through the /nv-legate/discussion GitHub repo page or email.

Discuss (1)

Tags