RAPIDS Brings Zero-Code-Change Acceleration, IO Performance Gains, and Out-of-Core XGBoost

Over the past two releases, RAPIDS introduced zero-code-change acceleration for Python machine learning, huge IO performance improvements, larger-than-memory XGBoost training, improved user experiences, and even more scalable ETL.

We spotlighted some of these updates and announcements at NVIDIA GTC 25. In this post, you can catch up on some of the highlights.

NVIDIA cuML brings zero code change acceleration to scikit-learn and more

Announced as open beta, NVIDIA cuML now brings zero-code-change acceleration to workflows using scikit-learn, UMAP, and hdbscan.

This new UX for cuML enables data scientists to continue using familiar PyData APIs while automatically using NVIDIA GPUs for significant performance gains, with speedups ranging from 5-175x depending on the algorithm and dataset, as seen in Figure 1.

To start using this new capability, just load the IPython extension before you import your standard CPU machine learning libraries.

%load_ext cuml.accel
 
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
 
X, y = make_classification(n_samples=500000, n_features=100, random_state=0)
rf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)
rf.fit(X, y)

To learn more about these new capabilities, visit the cuML documentation.

Major IO performance improvements in cuDF

Over the past two releases, we’ve landed significant performance gains to NVIDIA cuDF file readers, whether you’re working in the cloud or on-prem.

Cloud object storage

For data processing workloads in the cloud, it’s common to read files stored in remote object storage (e.g., Amazon S3, Google Cloud Storage, Azure Blob Storage, etc).

By using NVIDIA KvikIO under the hood and parallelizing reading of Parquet file footers, workloads using cuDF and Dask can now read Parquet files from Amazon S3 more than 3x faster than before.

This benchmark reads a Parquet data set from S3 to a g4dn.12xlarge EC2 instance, which has a published bandwidth of up to 50 Gbps. The dataset had 360 Apache Parquet files of about 128 MB each, for a total of about 46 GB. The Dask cluster had 4 workers. These results use cuDF 25.04, which includes an optimization to read parquet footers in parallel.

This functionality is now on by default, so you should see faster performance without changing anything.

You can learn more about this work in the recent High-Performance Remote IO With NVIDIA KvikIO blog.

Improved decompression capabilities

The NVIDIA Blackwell architecture includes a hardware-based decompression engine designed with data processing in mind. With support for Blackwell in the 25.02 release, cuDF can now take advantage of this capability, bringing significant performance gains to IO-heavy workloads wherever they’re running.

When running the Polars Decision Support (PDS) benchmark at SF100 with GPU-accelerated Polars on a system with an NVIDIA B100 GPU and NVMe storage, we saw 35% faster end-to-end runtimes using the hardware decompression engine instead of (standard) software-based GPU kernel decompression.

The comparison uses cuDF 25.04, nvCOMP 4.2.0.11, and Snappy-compressed Parquet files. Please note that in this case, the “Kernel decompress” condition uses a CUDA async pool MR, the “Hardware decompress” uses an RMM pool MR, and the performance improvement in “Hardware decompress” is due to the low latency and high throughput of the Blackwell Decompression Engine.

Usability enhancements for Polars GPU engine

The Blackwell decompression engine can supercharge performance when using the Polars GPU engine (powered by cuDF).

Performance is mission-critical, but ergonomics and developer experience are part of what people love about Polars.

We’ve worked with the Polars community to make accelerated Polars more user-friendly, with two highly requested features now available beginning with RAPIDS 25.04 and Polars 1.25.

Global configuration

Users can now select a default engine using the set_engine_affinity interface in Polars configuration. This means that, rather than selecting GPU execution in the collect call of every Polars query you run, you can configure it once globally at the top of your workflow.

import polars as pl
pl.set_engine_affinity(“gpu”)

df = pl.LazyFrame({"a": [1.242, 1.535]})
q = df.select(pl.col("a").round(1))
result = q.collect()
print(result)

shape: (2, 1)
┌─────┐
│ a   │
│ --- │
│ f64 │
╞═════╡
│ 1.2 │
│ 1.5 │
└─────┘

If the GPU engine doesn’t support a particular query, execution transparently falls back to the default Polars CPU engine.

See global engine configuration for a complete list of options supported by Polars.

GPU-aware profiling

The Polars profiler is a great way to understand the performance of a Polars query. Now, you can use the profiler regardless of whether you’re running on CPUs or GPUs.

The .profile() method on LazyFrame now supports an engine parameter, enabling profiling on the GPU.

To use the profiler with GPU execution, just tell the profiler to use the GPU engine, as seen in this query:

q = (
    pl.scan_parquet("lineitem.parquet")
    .filter(pl.col("l_shipdate") <= var1)
    .group_by("l_returnflag", "l_linestatus")
    .agg(
        pl.sum("l_quantity").alias("sum_qty"),
        pl.sum("l_extendedprice").alias("sum_base_price"),
        (pl.col("l_extendedprice") * (1.0 - pl.col("l_discount")))
        .sum()
        .alias("sum_disc_price"),
        (
            pl.col("l_extendedprice")
            * (1.0 - pl.col("l_discount"))
            * (1.0 + pl.col("l_tax"))
        	)
        	.sum()
        	.alias("sum_charge"),
        	pl.mean("l_quantity").alias("avg_qty"),
        	pl.mean("l_extendedprice").alias("avg_price"),
        	pl.mean("l_discount").alias("avg_disc"),
        	pl.len().alias("count_order"),
    )
    .sort("l_returnflag", "l_linestatus")
)

df, profile = q.profile("gpu")

To learn more, visit the Polars LazyFrame Profile documentation.

Out-of-core XGBoost for the largest datasets

In partnership with the DMLC community, we released XGBoost 3.0 in March. This release is a major milestone, with a redesigned external memory interface that makes it possible to efficiently train models on datasets too large to fit in memory.

We’ve optimized this functionality for coherent memory systems, like NVIDIA GH200 Grace Hopper and NVIDIA GB200 Grace Blackwell. As a result, a single Grace Hopper system can comfortably train models on datasets over 1 TB when using XGBoost with the RAPIDS Memory Manager (RMM).

You can get started with larger-than-memory training by using the new ExtMemQuantileDMatrix interface and a data iterator.

import cupy as cp
import rmm
from rmm.allocators.cupy import rmm_cupy_allocator

mr = rmm.mr.PoolMemoryResource(rmm.mr.CudaAsyncMemoryResource())
rmm.mr.set_current_device_resource(mr)
cp.cuda.set_allocator(rmm_cupy_allocator)

with xgboost.config_context(use_rmm=True):
    # Construct the iterators for ExtMemQuantileDMatrix
    # ...
    # Build the ExtMemQuantileDMatrix and start training
    Xy_train = xgboost.ExtMemQuantileDMatrix(it_train, max_bin=n_bins)
    # Use the training DMatrix as a reference
    Xy_valid = xgboost.ExtMemQuantileDMatrix(it_valid, max_bin=n_bins, ref=Xy_train)
    booster = xgboost.train(
        {
            "tree_method": "hist",
            "max_depth": 6,
            "max_bin": n_bins,
            "device": device,
        },
        Xy_train,
        num_boost_round=n_rounds,
        evals=[(Xy_train, "Train"), (Xy_valid, "Valid")]
    )

The external memory interface also supports multi-GPU and distributed training, for scaling to the absolute largest datasets.

Redesigned Forest Inference Library

With the release of cuML 25.04, our experimental version of the Forest Inference Library (FIL) with higher performance is now stable and ready for production.

For years, organizations have relied on FIL to get maximum inference performance for tree models like XGBoost, LightGBM, and Random Forest in production.

This fully redesigned FIL delivers large performance gains, with a median speedup of 40% over the original FIL based on tests across a broad range of model parameters.

Speedups vary depending on characteristics of the model, like depth and number of trees and batch size being inferred, with the new implementation providing significant speedups in most cases

Based on feedback, we’ve also introduced three new features to improve the experience of deploying tree models:

An optimize method, providing automatic optimization to find the best configuration. Users just need to call the method, and internally, FIL will use the optimal parameters transparently for the user.
The ability to analyze contributions from each tree to the final prediction (.predict_per_tree) and access and manipulate individual leaf nodes in a tree (.apply)
A new CPU execution mode with seamless on-ramp to GPUs.

Platform updates: Blackwell support and Conda improvements

In the past few months, we have made two significant improvements to where and how you can run accelerated data science workloads.

Beginning with the 25.02 release, all RAPIDS projects now support NVIDIA Blackwell-architecture GPUs, including hardware-based functionality like the decompression engine.

Also, installing RAPIDS libraries with Conda can be done with the “strict” channel priority with CUDA 12 on x86 and ARM SBSA-based systems, including releases as far back as 24.06. This long-standing community request can speed up creating environments and installing packages. Previously, RAPIDS libraries required setting the channel priority as “flexible”.

This has some implications for releases from 2 or more years ago. Visit the RAPIDS documentation to learn more.

Google Colab AI assistants

Google Colab is one of the most popular managed notebook platforms for data science, with simple interfaces to GPU-enabled runtimes across free and paid tiers.

Now, cuML and GPU-accelerated Polars are built into Google Colab, expanding Colab’s batteries-included accelerated data science libraries from cuDF, pandas and NetworkX. With these additions, you can now simply load the cuDF, cuML, and cuGraph-powered extensions for these libraries at the top of your notebook and tap into accelerated data science with zero code change required.

If you prefer to use AI copilot, the Collab Gemini assistant is now “RAPIDS-aware.” You can use Gemini to generate GPU-accelerated pandas code powered by cuDF’s zero-code-change-required UX.

Just tell the Gemini assistant you’d like to use a GPU.

Conclusion

The 25.02 and 25.04 RAPIDS releases introduce major enhancements, with zero-code-change acceleration for Python ML, major IO performance gains, and expanded XGBoost training support, laying the groundwork for even more powerful data science workflows ahead.

We welcome your feedback on GitHub. Join the 3,500+ members of the RAPIDS Slack community to talk GPU-accelerated data processing. If you’re new to RAPIDS, check out these resources to get started.

At NVIDIA GTC 2025, accelerated data science was everywhere. In case you missed it, you can still explore the list of data science sessions and workshops, and watch sessions on demand, including: