Accelerating Large-Scale Data Analytics with GPU-Native Velox and NVIDIA cuDF

As workloads scale and demand for faster data processing grows, GPU-accelerated databases and query engines have been shown to deliver significant price-performance gains compared to CPU-based systems. The high memory bandwidth and thread count of GPUs especially benefit compute-heavy workloads like multiple joins, complex aggregations, strings processing, and more. The growing availability of GPU nodes and the broad feature coverage of GPU algorithms makes GPU data processing more accessible than ever before.

By addressing performance bottlenecks, both data and business analysts can now query massive datasets to generate real-time insights and explore analytics scenarios.

To support the increasing demand, IBM and NVIDIA are working together to bring NVIDIA cuDF to the Velox execution engine, enabling GPU-native query execution for widely used platforms like Presto and Apache Spark. This is an open project.

How Velox and cuDF work together to translate query plans

Velox acts as an intermediate layer, translating query plans from systems like Presto and Spark into executable GPU pipelines powered by cuDF, as shown in Figure 1. For more details, see Extending Velox – GPU Acceleration with cuDF.

In this post, we’re excited to share initial performance results of Presto and Spark using the GPU backend in Velox. We dive into:

End-to-end Presto acceleration
Scaling up Presto to support multi-GPU execution
Demonstrating hybrid CPU-GPU execution in Apache Spark

Flow chart illustrating the architecture of a data processing system involving Velox and cuDF, which is integrated with other tools like Spark and Presto. — *Figure 1. A query flows from Presto or Apache Spark through the Velox engine, where it is converted into executable GPU pipelines powered by cuDF*

Moving the entire Presto query plan to GPU for faster execution

The first step of query processing is to translate incoming SQL commands into query plans with tasks for each node in the cluster. On each worker node, the cuDF backend for Velox receives a plan from the Presto coordinator, rewrites the plan using GPU operators, and then executes the plan.

Running Presto plans using Velox with cuDF required improvements to the GPU operators for TableScan, HashJoin, HashAggregation, FilterProject, and more.

TableScan: The Velox TableScan was extended on CPU to be compatible with GPU I/O, decompression, and decoding components in cuDF.
HashJoin: The available join types were expanded to include left, right, and inner, as well as support for filters and null semantics.
HashAggregations: A streaming interface was introduced to manage partial and final aggregations.

Overall, the operator expansion in the cuDF backend for Velox enables end-to-end GPU execution in Presto, making full use of the Presto SQL parser, optimizer, and coordinator.

The team collected query runtime data using benchmarks in Presto tpch (derived from TPC-H) using Parquet data sources with both the Presto C++ and Presto-on-GPU worker types. Please note that Presto C++ was not able to complete Q21 with standard configuration options, so the figure highlights the total runtime for the 21 successful queries.

As shown in Figure 2, at scale factor 1,000, we observed 1,246 seconds runtime for Presto C++ on AMD 7965WX, 133.8 seconds runtime for Presto on NVIDIA RTX PRO 6000 Blackwell Workstation, and 99.9 seconds runtime for Presto on NVIDIA GH200 Grace Hopper Superchip. We also used CUDA managed memory to complete Q21 on GH200 (see Figure 2 asterisk), yielding 148.9 seconds runtime for Presto GPU on the full query set.

Bar chart with an X-axis showing categories for Presto C++ on CPU and Presto on NVIDIA GPU results and Y-axis showing runtime in seconds. — *Figure 2. Runtime results for 21 of 22 queries defined in Presto tpch, executed with single-node Presto C++ on CPU and Presto on NVIDIA GPUs at scale factor 1,000*

Multi-GPU Presto for faster data exchange and lower query runtime

In distributed query execution, Exchange is a critical operator that controls the data movement between workers on the same node and also between nodes. GPU-accelerated Presto uses a UCX-based Exchange operator that supports running the entire execution pipeline on GPU. The UCX core leverages high bandwidth NVLink for intra-node connectivity and RoCE or InfiniBand for internode connectivity. UCX, or Unified Communication – X Framework, is an open source communication library designed to achieve the highest performance for HPC applications.

Velox supports several Exchange types for different types of data movements: Partitioned, Merge, and Broadcast. Partitioned Exchange uses a hash function to partition input data and then sends the partitions to other workers for further processing. Merge Exchange receives multiple input partitions from other workers and then produces a single, sorted output partition. Broadcast Exchange loads the data in one worker and then copies the data to all other workers. Integration of GPU exchange into the cuDF backend for Velox is in progress, and the implementation is available on mainline Velox.

As shown in Figure 3, Presto achieves efficient performance on GPU with new UCX-based exchange, especially when high-bandwidth intranode connectivity is provisioned between GPUs. An eight-GPU NVIDIA DGX A100 node delivered >6x speedup when using NVLink in the exchange operator compared to using the Presto baseline HTTP exchange. Results were collected for Presto on GPU with both the baseline HTTP Exchange method, and the UCX-based cuDF Exchange method. With eight GPU workers, Presto can finish all 22 queries with the default async memory allocation, without using managed memory.

Note that Figure 3 uses multi-node execution plans from the Presto Coordinator and cold-cache remote data sources. These results are not directly comparable to the single-node, hot-cache runtimes shown in Figure 2.

Bar chart with an X-axis showing categories for Presto C++ and Presto on GPU results and Y-axis showing runtime in seconds. — *Figure 3. Runtime results for the 22 queries defined in Presto tpch benchmark, executed with Presto GPU on NVIDIA DGX A100 (eight A100 GPUs) at scale factor 1,000*

Hybrid CPU-GPU execution in Apache Spark

While the Presto integration focuses on end-to-end GPU execution, the Apache Spark integration with Apache Gluten and cuDF is currently focused on offloading specific query stages. This capability allows the most compute-intensive parts of workloads to be dispatched to GPUs, and this strategy can make the best use of GPU resources in hybrid clusters containing both CPU and GPU nodes.

For example, the second stage of TPC-DS Query 95 SF100 is compute intensive and can slow down CPU-only clusters. Offloading this stage to GPU achieves significant performance gains. CPU capacity remains on the cluster, available for other queries or workloads.

As shown in Figure 4, even when the first stage of TableScan is run with CPU execution, efficient interoperability between CPU and GPU enables a faster total runtime when the second stage offloads to GPU. The condition CPU only uses eight vCPUs and First Stage CPU+GPU uses eight vCPUs and one NVIDIA T4 GPU (g4dn.2xlarge).

Bar chart with an X-axis showing categories for Second Stage execution on CPU and GPU. Y-axis shows runtime in seconds. — *Figure 4. Runtime results for the query 95, as defined in Gluten* *tpcds, executed with single-node, single-GPU at scale factor 100*

Get involved with GPU-powered, large-scale data analytics

Driving GPU acceleration in the shared execution engine Velox unlocks performance gains for a wide array of downstream systems across the data processing ecosystem. The team is working with contributors across many companies to implement reusable GPU operators in Velox, and in turn accelerate Presto, Spark (through Gluten), and other systems. This approach reduces duplication, simplifies maintenance, and introduces new innovations across the open data stack.

We’re excited to share this open source work with the community and hear your feedback. We invite you to:

Try benchmarking Presto with the Velox GPU backend
Contribute new operators or test workloads
Join discussions about Velox, cuDF, and Presto
Check out the GitHub repos:
- Velox
- Presto
- Apache Gluten
- cuDF

Acknowledgments

Many developers contributed to this work. IBM contributors include Zoltán Arnold Nagy, Deepak Majeti, Daniel Bauer, Chengcheng Jin, Luis Garcés-Erice, Sean Rooney, and Ali LeClerc. NVIDIA contributors include Greg Kimball, Karthikeyan Natarajan, Devavret Makkar, Shruti Shivakumar, and Todd Mostak.