Data Science

NVIDIA RAPIDS 24.10 Introduces Accelerated NetworkX with Zero Code Change, Updates for UMAP and cuDF-Pandas

The RAPIDS v24.10 release takes another step forward in bringing accelerated computing to data scientists and developers with a seamless user experience. This blog post highlights the new features including:

  • Zero code change accelerated NetworkX is now generally available (GA)
  • Polars GPU engine in open beta
  • Bringing UMAP to larger-than-GPU-memory datasets
  • Improved cuDF pandas compatibility with NumPy and PyArrow
  • Guidelines for incorporating GPUs into GitHub-based CI systems
  • RAPIDS-wide support for Python 3.12 and and NumPy 2.x

Zero code change accelerated NetworkX

NetworkX accelerated by RAPIDS cuGraph is now GA in the 24.10 release beginning with NetworkX 3.4. This release adds GPU-accelerated graph creation, a new user experience, and expanded documentation.

Accelerated graph construction enables full end-to-end acceleration for NetworkX workflows, which is particularly valuable for workflows with large graphs where conversion between CPU and GPU can reduce performance.

The full end-to-end accelerated NetworkX experience is now enabled by setting the NX_CUGRAPH_AUTOCONFIG environment variable to True.

%env NX_CURGAPH_AUTOCONFIG=True

import pandas as pd
import networkx as nx

url = "https://data.rapids.ai/cugraph/datasets/cit-Patents.csv"
df = pd.read_csv(url, sep=" ", names=["src", "dst"], dtype="int32")
G = nx.from_pandas_edgelist(df, source="src", target="dst")

%time result = nx.betweenness_centrality(G, k=10)

End-to-end acceleration enables workflows using algorithms like betweenness centrality, PageRank, and more to experience speedups of up to 10, 50, or even 500x for some algorithms on larger graphs.

Horizontal bar chart showing PageRank algorithm used to compute values for a citation graph of U.S. patents (4M nodes, 16M edges) is 70x faster than NetworkX on CPU. 
Figure 1. The PageRank algorithm run on a citation graph of U.S. patents (4M nodes, 16M edges) is 70x faster than NetworkX on CPU; SW: NetworkX 3.4.1, cuGraph/nx-cugraph 24.10; GPU: NVIDIA A100 80GB; CPU: Intel Xeon w9-3495X (56 cores) 250GB
Horizontal bar chart showing betweenness centrality algorithm used to compute values for the Live Journal social network (5M nodes, 69M edges, k=100) is 485x faster than NetworkX on CPU.
Figure 2. The betweenness centrality algorithm run on the Live Journal social network (5M nodes, 69M edges) is 485x faster than NetworkX on CPU for number of samples (k) set to 100.; SW: NetworkX 3.4.1, cuGraph/nx-cugraph 24.10; GPU: NVIDIA A100 80GB; CPU: Intel Xeon w9-3495X (56 cores) 250GB

You can learn more about NetworkX accelerated by cuGraph in the documentation and explore the code for the benchmarks above here.

Zero code change accelerated Polars in open beta

In September, the Polars GPU engine powered by cuDF was released in open beta. With GPU support available in Polars, users can access up to 13x faster workflows compared to running on CPUs with zero code change required.

A bar chart comparing query execution times between Polars CPU and Polars GPU engines across 22 queries. The y-axis shows execution time in seconds from 0 to 45. Most GPU bars are significantly shorter than their CPU counterparts, indicating faster performance. The title states "Accelerate Polars workflows up to 13x”. Additional notes provide benchmark details including scale factor, hardware specs, and a disclaimer about comparability to TPC-H results.
Figure 3. These are the four best speedups across a set of 22 queries from the PDS-H benchmark. The Polars GPU engine powered by RAPIDS cuDF offers up to 13x speedup compared to CPU on queries with many complex groupby and join operations.

PDS-H benchmark scale factor 80 | GPU: NVIDIA H100 | CPU: Intel Xeon W9-3495X (Sapphire Rapids) | Storage: Local NVMe. Note: PDS-H is derived from TPC-H but these results are not comparable to TPC-H results.

Built directly into the Polars Lazy API, users can configure Polars to use the GPU with the `engine` keyword to `collect` when they trigger computation.

import polars as pl

df = pl.LazyFrame({"a": [1.242, 1.535]})
q = df.select(pl.col("a").round(1))
result = q.collect(engine="gpu")

To learn more, read the NVIDIA and Polars announcement blogs or dive into the Polars GPU Support documentation. Or, jump right into a Google Colab notebook and take it for a test drive.

Bringing UMAP to larger-than-GPU-memory datasets

Beginning in v24.10, cuML’s UMAP algorithm now supports processing larger-than-GPU-memory datasets that would have resulted in an Out Of Memory error in earlier releases. By using a novel batched approximate nearest neighbor algorithm and optionally storing the full dataset in CPU memory, we’re able to build the approximate KNN graph while only processing subsets of the data on the GPU at any given time.

Users can tap into this new optional functionality by setting the new `nnd_n_clusters` keyword to any value greater than 1 (the default) and (if necessary) passing `data_on_host=True` keyword to `fit` or `fit_transform`.

from cuml.manifold import UMAP
import numpy as np

# Generate synthetic data using numpy (random float32 matrix)
X = np.random.rand(n_samples, n_features).astype(np.float32)

# UMAP parameters
num_clusters = 4  # Number of clusters for NN Descent batching, 1 means no clustering
data_on_host = True  # Whether the data is stored on the host (CPU)

# UMAP model configuration
reducer = UMAP(
    n_neighbors=10,
    min_dist=0.01,
    build_algo="nn_descent",
    build_kwds={"nnd_n_clusters": num_clusters},
)

# Fit and transform the data
embeddings = reducer.fit_transform(X, data_on_host=data_on_host)

Users can start with an initial value of n_clusters (e.g., 4) and increase it as needed to manage GPU memory usage. Setting the value too high may lead to performance overhead due to multiple iterations of graph building, so it may be beneficial to find a balance based on the size of the dataset and GPU memory available.

Improved cuDF pandas ecosystem compatibility

Improved code compatibility

cuDF’s pandas accelerator mode is now fully compatible with NumPy arrays. Previously,  running Python isinstance checks on NumPy arrays produced by the pandas API would return False when using cuDF pandas but True when using standard pandas. As this is a common code design pattern, some user workflows require workarounds to run smoothly.

Starting in v24.10, cudf.pandas now functionally produces true NumPy arrays when the accelerator mode is active and a user tries to convert the DataFrame or column to an array — eliminating this issue. For example:

%load_ext cudf.pandas
import pandas as pd
import numpy as np

arr = pd.Series([1, 2, 3]).values # now returns a true numpy array
isinstance(arr, np.ndarray) # returns True

This change also enables code relying on the NumPy C API to work smoothly with cuDF pandas.

Improved arrow compatibility

cuDF also now supports a range of PyArrow versions. Arrow compatibility has been a long-running pain point for cuDF users. Every release of cuDF until now had been tied to a very specific release of Arrow due to our usage of the Arrow C++ API and the binary compatibility requirements that usage imposed.

With this release, we’ve rewritten those features to exclusively use the Arrow C Data Interface, which in turn has allowed us to stop using Arrow C++ entirely. With that change, cuDF Python can now support any PyArrow version since PyArrow 14.

Guidelines for incorporating GPUs into GitHub-based CI systems

We’ve heard from the community that it can be challenging to figure out a simple and effective way to incorporate GPUs into GitHub based CI systems. New guidelines for doing this effectively was added to the RAPIDS Deployment documentation, based on the scikit-learn team’s experience.

GitHub Actions now has support for hosted GPU runners. This means that any project on GitHub can leverage NVIDIA GPUs in their CI workloads for testing. This makes it much easier for projects to integrate with RAPIDS libraries and test that changes are compatible without needing GPU hardware locally.

GPU Hosted runners are not included in the GitHub Action free-tier. Runners with GPUs typically cost a few cents per minute and projects can add a monthly spending cap to help keep costs under control.

To set up a GPU runner, navigate to the GitHub Actions section of your organization’s settings and add a new runner. Then select the NVIDIA Partner Image and give your runner a GPU by changing the Size to a GPU-powered VM.

creenshot showing the setup interface for creating a new GitHub Actions GPU runner. Fields include the runner's name, 'linux-nvidia-gpu,' with a 'Linux x64' platform selected. The image option displays 'Partner, NVIDIA GPU-Optimized Image for AI and HPC,' and size is set to 'GPU-powered' with specifications: '1 x NVIDIA T4 | 4-core | 16 GB VRAM | 176 GB SSD.' The 'Maximum concurrency' is configured to '50,' indicating the runner’s capability for concurrent tasks.
Figure 4. Setup interface for creating a new GitHub Actions GPU runner.

Then you can configure your workflows to use your new runners with the runs-on option.

name: GitHub Actions GPU Demo
run-name: ${{ github.actor }} is testing out GPU GitHub Actions
on: [push]
jobs:
  gpu-workflow:
    runs-on: linux-nvidia-gpu
    steps:
      - name: Check GPU is available
        run: nvidia-smi

For more detailed information on setting up GitHub Actions GPU powered workflows check out the RAPIDS Deployment documentation, which also includes best practices on when to run your GPU CI to get the best bang for your buck.

The scikit-learn project recently set up GPU runners on GitHub Actions, using labels to manually trigger a GPU workflow on select PRs. Check out their blog post to learn about their experience.

RAPIDS platform updates

In 24.10, RAPIDS packages picked up some important updates allowing them to be used alongside newer versions of other scientific computing software. The packages now support Python 3.10-3.12 and NumPy 1.x and 2.x. They also now support fmt 11 and spdlog 1.14, the versions of those libraries now used across most of conda-forge. As part of these enhancements, this release also drops support for Python 3.9 or NCCL older than 2.19.

Conclusion

The RAPIDS 24.10 release takes another step forward in our mission to make accelerated computing more accessible to data scientists and engineers. We can’t wait to see what people do with these new capabilities.

If you’re new to RAPIDS, check out these resources to get started.

Discuss (0)

Tags