How to Work with Data Exceeding VRAM in the Polars GPU Engine

In high-stakes fields such as quant finance, algorithmic trading, and fraud detection, data practitioners frequently need to process hundreds of gigabytes (GB) of data to make quick, informed decisions. Polars, one of the fastest-growing data processing libraries, meets this need with a GPU engine powered by NVIDIA cuDF that accelerates compute-bound queries that are common in these fields.

However, a common challenge when working with GPUs is that VRAM (the dedicated memory of the GPU) is typically smaller than system RAM. This can cause problems when using the GPU engine to handle very large datasets.

This post explores two options within the Polars GPU engine to overcome this constraint. Using these strategies, you can process data that is larger than available VRAM while still benefiting from GPU acceleration:

Unified Virtual Memory (UVM): A technique that allows the GPU to spill over to system RAM.
Multi-GPU streaming execution: An experimental feature for distributing workloads across multiple GPUs, ideal for workloads processing hundreds of GBs to a few TBs in scale.

Option 1: UVM for single-GPU flexibility

When your dataset size begins to exceed the VRAM of your GPU, you can leverage NVIDIA UVM technology.

UVM creates a unified memory space between the system RAM (host memory) and the GPU VRAM (device memory). This allows the Polars GPU engine to spill data over to the system RAM when VRAM is full, preventing out-of-memory errors and enabling you to work with larger-than-VRAM datasets. When the GPU needs to access data that is currently in system RAM, that data is automatically brought into VRAM for processing.

This approach is ideal when you’re working on a single GPU and need the flexibility to handle datasets that are moderately larger than your available VRAM. While it provides a seamless experience with virtually no code changes, a performance overhead can come with the data migration between system RAM and VRAM. However, with smart configurations using the RAPIDS Memory Manager (RMM)—a library that provides fine-grained control over how GPU memory is allocated—this performance cost can be significantly reduced.

For a deep dive into how UVM works, its performance, and how to fine-tune its configuration for your specific needs, check out Introducing UVM for Larger than VRAM Data on the Polars GPU Engine.

Option 2: Multi-GPU streaming execution for TB-scale performance

For users tackling datasets that stretch beyond a few hundred GBs into terabytes (TB), the Polars GPU engine now offers an experimental multi-GPU streaming execution configuration. Unlike the standard in-memory execution where data is processed in a single partition, the streaming executor introduces data partitioning and parallel processing capabilities designed to distribute workloads across multiple GPUs.

At its core, this streaming executor works by taking the optimized internal representation (IR) graph produced by Polars and rewriting it for batched execution. The resulting graph is then partitioned based on the size of data and number of available workers. The streaming executor uses a task-based execution model where each partition is processed independently, enabling tasks to be performed in parallel.

The streaming executor supports both single-GPU execution through the Dask synchronous scheduler, and multi-GPU execution through the Dask distributed scheduler. A number of parameters are available for controlling join strategies and partition sizes.

In testing, the team has seen strong performance on the PDS-H benchmark at 3 TB scale, processing all 22 queries in seconds. Check out the example notebook and try multi-GPU streaming on your datasets.

For a deep dive on how streaming execution in the Polars GPU engine works under the hood, see the NVIDIA GTC Paris session, Scaling DataFrames with Polars.

Choosing the right approach

Both UVM and multi-GPU streaming execution offer powerful ways to handle datasets larger than your GPU VRAM in the Polars GPU engine. The best choice depends on your specific needs.

UVM is best for datasets moderately larger than VRAM. By default, the Polars GPU engine is configured to utilize UVM for the best mix of performance and scalability at most data sizes.
The multi-GPU streaming execution experimental feature is best for very large datasets (hundreds of GB to TB) where you can leverage multiple GPUs for distributed processing.

To learn more about these configurations, check out the Polars User Guide.