Train with Terabyte-Scale Datasets on a Single NVIDIA Grace Hopper Superchip Using XGBoost 3.0

Gradient-boosted decision trees (GBDTs) power everything from real-time fraud filters to petabyte-scale demand forecasts. XGBoost open source library has long been the tool of choice thanks to state-of-the-art accuracy, SHAP-ready explainability, and flexibility to run on laptops, multi-GPU nodes, or Spark clusters. XGBoost version 3.0 was developed with scalability as its north star. A single NVIDIA GH200 Grace Hopper Superchip can now process datasets from gigabyte scale all the way to 1 terabyte (TB) scale.

The coherent memory architecture allows the new external-memory engine to stream data over the 900 GB/s NVIDIA NVLink-C2C, so a 1 TB model can be trained in minutes—up to 8x faster than a 112-core (dual socket) CPU box. This reduces the need for complex multinode GPU clusters, and makes scalability simpler to achieve.

This post explains new features and enhancements in the milestone XGBoost 3.0 release, including a deep dive into external memory and how it leverages the Grace Hopper Superchip to reach 1 TB scale.

Use case for financial systems

XGBoost powers critical financial systems such as fraud detection, credit risk prediction, and algorithmic trading. RBC, one of the world’s largest banks based on market capitalization, runs a lead scoring system that demands speed, accuracy, and explainability. To modernize their ML pipeline and handle constant model tuning on hundreds of thousands of records, RBC selected XGBoost.

“We’re confident that XGBoost, powered by NVIDIA GPUs, will make our predictive lead scoring model possible for the data volumes we’re projecting,” said Christopher Ortiz, Director, Gen AI Planning and Valuation, RBC. “We’ve seen up to a 16x end-to-end speedup by leveraging GPUs, and for our pipeline testing, we’ve seen a remarkable 94% reduction in TCO for model training. This represents a transformative leap in efficiency and cost-effectiveness. We can optimize features faster to deliver better results as our data scales.”

XGBoost on a TB-sized dataset

To make GPU acceleration possible with XGBoost, the GPU histogram method was introduced. This method delivered significant speedups compared to XGBoost on CPU. Two constraints, however, remained:

Even with compression of data, GPU memory is the main limiting factor for datasets that can be GPU accelerated.
Before version 3.0, the experimental external memory streamed batches that were still concatenated in GPU memory.

Two mechanisms can be used to work around these limitations:

Quantile DMatrix pre-bins every feature into a fixed number of quantile buckets. This can compress the dataset, slashing memory use without affecting model accuracy significantly.
Sharding of computations with distributed frameworks like Dask and Spark.

New External-Memory Quantile DMatrix

XGBoost 3.0 introduces a third mechanism in External-Memory Quantile DMatrix, which enables scaling up to TB-scale datasets on a single GH200 Grace Hopper Superchip. This overcomes the complexity of setting up a distributed framework and leverages the ultrafast C2C bandwidth of Grace Hopper superchips.

The External Memory Quantile DMatrix is built on top of the existing Data Iterators that define methods to read the dataset files. It takes care of all the management of the dataset memory and is used as a parameter to your XGBoost booster object (Figure 1). You can think of External Memory Quantile DMatrix as the familiar QuantileDMatrix, which pre-bins every feature, but uses the same quantile logic. This means you can keep all your existing hyper-parameters to get the same accuracy. Meanwhile, the data itself sits in host RAM and streams to the GPU at every iteration. For more details, see the XGBoost documentation.

Data processing flowchart of the External Memory Quantile DMatrix with Dataset Files and Data Iterator on the left and ExtMemQuantileDMatrix and xgboost.train on the right. — *Figure 1. The External Memory Quantile DMatrix manages all the dataset memory*

How the NVIDIA Grace Hopper Superchip makes streaming practical

This setup is ideal for NVIDIA Grace-based superchips. A GH200 superchip packages a 72-core Grace CPU and a Hopper GPU, linked by NVLink C2C. This means there is 900 GB/s bidirectional bandwidth, which is around 7x the bandwidth of x16 PCIe Gen 5 with far lower latency.

Handling a 1 TB training job typically requires either a CPU box with roughly 2 TB of DRAM or a small GPU cluster with 8 to 16 NVIDIA H100 GPUs. This setup can be faster, but it adds complexity with managing distributed frameworks. With XGBoost 3.0 external-memory streaming, a single GH200 superchip (80 GB HBM3 plus 480 GB LPDDR5X fed by 900 GB/s NVLink-C2C) now tackles the same dataset on its own, replacing both the RAM-monster server and the multiGPU pod.

Benchmarking Grace Hopper on a 1 TB dataset

GPUs excel on dense (or nearly dense) tables because XGBoost compresses them sharply, cutting bus traffic and allowing histograms to sit in fast shared memory. This means that for the GPU, shape hardly matters: wide-short and narrow-tall tables finish in roughly the same time.

ExtMemQuantileDMatrix, however, is sensitive to shape. When training XGBoost on a feature matrix (x) and labels (y), only the feature matrix is paged along the number of rows, not the labels or other data. This means that, for slim datasets, the size of the labels (the number of rows) is the limiting factor for what dataset can fit in a single GH200 superchip.

Figure 2 demonstrates this, where the total data size remains constant but the number of rows and columns in the dataset vary.

Graph showing the speedups achieved when comparing the NVIDIA Grace Hopper Superchip to a 112 core dual socket Xeon 8480CL. — *Figure 2. A single NVIDIA GH200 superchip can now process 1 TB of data, with speedups up to 8x when compared to 112 CPU cores of a dual socket Xeon 8480CL on wide datasets*

Best practices for external memory

To leverage external memory with GH200 superchip systems, get started with the tips below. For more details, see the XGBoost documentation.
Set grow_policy=’depthwise’ to build trees layer by layer. This is better for external memory:

xgb_model = xgb.XGBRegressor(tree_method='hist',
				   device='cuda',
				   seed=42,
                   grow_policy='depthwise')

Always start inside a fresh RAPIDS Memory Manager (RMM) pool when using in combination with RAPIDS:

import cupy as cp, rmm, xgboost as xgb
from rmm.allocators.cupy 
import rmm_cupy_allocator

mr = rmm.mr.ArenaMemoryResource(rmm.mr.CudaAsyncMemoryResource())
rmm.mr.set_current_device_resource(mr)
cp.cuda.set_allocator(rmm_cupy_allocator)

with xgb.config_context(use_rmm=True):
    dtrain = xgb.ExtMemQuantileDMatrix(it_train, max_bin=256)
    bst = xgb.train({“device”:“cuda”, “tree_method”:“hist”}, dtrain)

Run on CUDA 12.8 or higher with an HMM-enabled driver (Grace Hopper).

What else is new in XGBoost 3.0?

In addition to the external-memory overhaul, XGBoost 3.0 delivers many performance upgrades and API clean-ups, including:

Experimental support for distributed external memory. You can now train out-of-core models across a cluster falling back to host memory when GPU RAM is tight.
Reduced GPU memory during DMatrix construction and speeds up initialization for batched inputs, making XGBoost 3.0 faster.
GPU hist and approx methods see roughly 2x speedups and lower memory use on “mostly-dense” data.
External memory now supports categorical features, every objective (including quantile regression), and all prediction types—SHAP included.

Get started with XGBoost 3.0

External memory in XGBoost 3.0 improvements of GPU external memory work towards making it the default way to work when your data outgrows GPU memory. Highlights include:

Same accuracy, smaller footprint: ExtMemQuantileDMatrix streams pre-binned, compressed pages from host RAM into the GPU just in time for each boosting round.
Extended GPU RAM with Grace Hopper: NVLink turns ordinary host RAM into a fast extension of the GPU RAM, so one Grace Hopper Superchip can seamlessly finish a TB-sized job that previously required multiple GPUs (across one or many nodes).
Drop-in upgrade: Already using RAPIDS Memory Manager? Swapping in the external-memory path is an easy change for GPU workflows. Existing pipelines continue to run untouched.

XGBoost 3.0 enables you to process TB-scale GBDT training on a single Grace Hopper chip with the same XGBoost calls you’ve always used.

To get started, download XGBoost 3.0 and check out the Installation Guide. For more information about external memory, see the XGBoost documentation.

Join our community Slack channel to post feedback or questions you have about GPU-acceleration for XGBoost. If you’re new to accelerated data science check out the Accelerated Data Science Learning Path for hands-on workshops.