Data Center / Cloud

Designing GPU-Accelerated Query Engines with NVIDIA GQE

Decorative image.

GPU-accelerated query engines are often constrained by memory and I/O bandwidth. NVIDIA hardware advances—including high bandwidth memory (HBM), NVIDIA NVLink-C2C, and dedicated decompression engines featured in NVIDIA GB200 NVL4—help remove these bottlenecks by increasing effective storage capacity, accelerating data movement between CPUs and GPUs, and speeding data access without consuming streaming multiprocessor (SM) resources.

In this post, we show how databases can use these technologies to accelerate GPU query execution. You’ll learn techniques for efficient CPU-GPU data movement, compression, partition pruning, and overlapping data transfer with computation.

Architecture overview of GQE

GQE (GPU Query Engine) is a reference architecture designed to execute SQL queries at high performance over large data sets on modern NVIDIA hardware. Under the hood, GQE uses NVIDIA cuDF and other NVIDIA CUDA-X libraries, including CCCL, nvCOMP, and nvSHMEM.

GQE can help influence query engines to:

  • Move execution to GPUs.
  • Move decompression to nvCOMP.
  • Make data formats GPU-friendly.
  • Close end-to-end performance gaps when running on GPUs. 

In Figure 1, we give an overview of the system design by breaking down GQE into a query,  data, and execution layer. These manage the transition from a SQL query and input data to hardware-level execution. The layers fit together as follows.

The query layer complements the execution engine with a SQL parser and a query optimizer. The query layer natively accepts Substrait plans, an open-source query plan format, for execution in GQE. Substrait makes it possible to evaluate the benefits of GPU execution by exporting query plans from an existing database product and running the plan in GQE. In Figure 2, Apache DataFusion transforms a SQL string into a Substrait plan. GQE consumes that plan as an optimized logical query plan, adds GQE-specific refinements, and transforms the query into a physical plan.

The data layer stores and organizes user data for fast access by the executor. In GQE, storage is abstracted into pluggable, specialized readers that handle different data formats and storage mediums—it currently supports GPU memory, CPU memory, and disk. In this post, we focus on the high-performance GQE in-memory table format and assume this data is stored in CPU memory. GQE transfers data chunks to the GPU on-demand to saturate the GPU with work without storing the full dataset in GPU memory. When a chunk arrives on the GPU, the data layer hands off to the execution layer.

The execution layer executes the physical query plan against the data to produce query results. GQE generates the physical plan into a task graph, which defines the execution schedule. The task graph contains relational operators built on the open-source NVIDIA cuDF library, which implements the operators in highly optimized CUDA C++ code. Because the data layer transfers in chunks, GQE can decompose operators and execute tasks on those chunks concurrently as pipelined CUDA streams.

In summary, GQE unlocks the high throughput of the hardware through a GPU-native design.

Data layout and transfer orchestration

The GQE data layer is optimized to efficiently transfer data from host memory to device memory. We minimize data transfer latency by maximizing throughput and reducing the amount of data moved. In the following, we give an overview of our in-memory data layout and the host-to-device transfer orchestration, which are instrumental to minimizing transfer latency.

GQE design goals 

As GQE builds on cuDF, the design assumes that in-GPU data is structured as cuDF-native tables. However, the host memory layout can optimize transfers for NVIDIA NVLink C2C and PCIe. cudaMemcpy is the standard transfer method. In this approach, the CPU orchestrates GPU execution and copies data in a bulk transfer. This also forms the basis for compressed transfers.

Data layout

Figure 2 shows the table data layout, which is horizontally subdivided into row groups. Each row group consists of columns and encapsulates metadata. Within a row group, GQE stores columns as non-contiguous partitions. During a transfer, the storage layer converts a set of partitions into a cuDF column. Thus, the data layer hides the implementation details of compression and partition pruning from the execution layer.

Transfer Orchestration

In Figure 3, we show how the CPU orchestrates a transfer. Following best CUDA practice, transfers use pipeline parallelism to efficiently utilize hardware components. A pipelined transfer consists of multiple stages. In compressed, partitioned data, there are four stages.

  • In Stage 0, a host thread performs scheduling. Scheduling involves computing the memory range to transfer, allocating a destination buffer, and invoking the necessary CUDA methods.
  • In Stage 1, the GPU performs the H2D transfer.
  • Stage 2 decompresses the data.
  • Stage 3, added outside the data layer, in which the CUDA kernels compute the query.

These four stages should overlap. Ideally, the query runtime equals the longest-running stage, and all remaining stages are hidden by the pipeline.

Data transfer optimizations

Fast data access plays a significant role in the performance advantage achieved by GQE. The main data access optimizations employed are compression and partition pruning. In the following, we describe how these optimizations work.

Compression

GQE receives two main benefits from compression: query dataset capacity and query acceleration. Compression enables a query engine to expand the dataset size that can be processed using a given memory allotment by reducing the overall in-memory footprint. Data transfer of compressed buffers, combined with fast decompression by the GPU, speeds up transfers even on fast interconnects like NVLink C2C. GQE compresses the datasets with GPU-optimized formats that improve compression ratios and provide superior GPU decompression speeds compared to using legacy formats.

NVIDIA nvCOMP library 

NVIDIA nvCOMP is a library for GPU-accelerated compression and decompression. It provides a range of standard and GPU-optimized compression formats. The user can pick from the supported algorithms to balance compression ratio, compression, and decompression throughput. nvCOMP can wrap CPU libraries such as lz4hc within its high-level interface, providing additional configuration options. GQE uses nvCOMP for its compression and decompression routines.

NVIDIA Blackwell Decompression Engine

NVIDIA introduced a new Decompression Engine (DE) in the NVIDIA Blackwell architecture that enables nvCOMP to quickly decompress LZ77-based formats like LZ4, Snappy, and Deflate without using SM resources. Decompression with DE, SM kernels, and CE copies can fully overlap when using multiple CUDA streams. 

DE on a single NVIDIA Blackwell B200 GPU can reach up to 400 GB/s in database applications. For example, at a 4x compression ratio, it achieves roughly 400 GB/s effective host-to-device throughput while leaving 100 GB/s C2C host-to-device bandwidth available. The remaining bandwidth can be harnessed to transfer other data, including encoded data that is decompressed on the SMs.

NVIDIA GQE’s compression approach  

Figure 4 shows the hybrid compression approach, which uses lightweight algorithms, such as Cascaded, to use specific patterns in the structured data where possible, and the DE when LZ-based algorithms are needed to achieve good compression ratios.

When considering how to compress a given column, a query engine has a few options. It could require users to specify an algorithm for each column, but this is unwieldy for very large databases. The approach we’ve taken is to attempt both LZ4 and Cascaded. LZ4 is our choice for generic data because it achieves high ratios compared to other LZ77-only compressors, and is supported by the Decompression Engine.

To determine the compression algorithm to use, we compress the data using both LZ4 and Cascaded algorithms. Cascaded can achieve extremely fast compression rates, at approximately 500 GB/s on B200. This enables us to try the extra algorithm without significant overhead in the data loading stage. 

We balance when to use Cascaded vs LZ4 using two heuristics: 

  1. Cascaded and LZ4 have different compression ratio thresholds, which establish minimums for us to use that algorithm.
  2. Cascaded must achieve a higher compression ratio than LZ4 to be chosen over LZ4. The trigger is a configurable multiple of the LZ4 compression ratio. 

We use the choice of algorithm to help balance C2C bandwidth, DE, and SM resources.

Partition pruning

Before transferring data from the CPU to the GPU, GQE employs filter pruning to skip partitions that don’t contribute to the query result. This mechanism relies on metadata summarizing the table contents and the predicates defined in the SQL query.

Metadata and storage

GQE uses zone maps to support filter pruning. When data is loaded as in-memory tables, GQE horizontally splits the table into row groups and fixed-size partitions with a default of 10M rows. For each partition, GQE computes the minimum and maximum values for every column and stores this metadata as cuDF tables in GPU memory, so pruning can run without becoming a bottleneck. Computing the zone maps adds about 1% to the initial Parquet load time and happens only once, not during query execution.

Pruning and task orchestration

Figure 5 shows the filter pruning process. During task graph construction, GQE derives a pruning expression by transforming query predicates into comparisons against the row groups’ zone maps. Partitions that can’t contribute to the query result are pruned. In this example, partition 1 is pruned because the zone map indicates that all values stored in this partition are less than 9, and therefore also less than 15, the lower bound. The remaining partitions are transferred to GPU memory and decompressed if necessary. Even when partitions are discontiguous in CPU memory after pruning, e.g., because they are contained in multiple row groups, they are transferred and assembled into a contiguous memory block wrapped as a cuDF table on the GPU.

Filter pruning in GQE is highly effective. In the TPC-H benchmark using the 1 TB scale dataset, filter pruning skips 31% of data across all 22 queries. The impact is an end-to-end speedup of 1.43×.

The evaluation of zone maps adds minimal overhead, on average, 2.2 ms for benchmark queries on 1 TB of data.

Data transfer optimizations

In GQE, we conceive a novel batched transfer optimization for partitions.

Multiple partitions are transferred to the GPU in a single batch using cudaMemcpyBatchAsync, reducing overhead for fine-grained partitions. Batching also helps avoid delays from interleaved CUDA streams. When partitions are transferred individually, transfers from other streams can delay the next kernel launch. Moving partitions in the same batch avoids this delay.

Performance highlights

To evaluate the B200 GPU features discussed above in a full Grace Blackwell system, we benchmarked GQE on TPC-H at Scale Factor 1000 (1TB) using one of the two B200 GPUs in an NVIDIA GB200 NVL4 server, where B200 GPUs are connected to the Grace CPU with NVLink-C2C.  We used DuckDB 1.4.1 on the Turin Epyc 9755 CPU as the baseline. Each query was averaged over 5 hot-cache runs, with compression and pruning enabled on both sides. We tune the GQE parameters per query, including the degree of parallelism and physical operator planning. 

The TPC-H dataset is optimized for partition pruning and compression by clustering the lineitem table on l_shipdate and the orders table on o_orderdate, and partitioning both tables by month. Internally, each partition is sorted on l_orderkey and o_orderkey, respectively.

In Figure 6, we show the runtime of the 22 queries. GQE outperforms DuckDB on 20 of 22 queries, with the largest gains on Q11, Q14, and Q15, where partition pruning and compression sharply cut data movement across NVLink C2C. GQE showcases that even bandwidth-heavy queries like Q1 and Q6 execute quickly on the GPU with these optimizations. In sum, GQE runs all queries in 9.0 s, compared to 74.0 s and 70.6 s for DuckDB in single and dual-socket configurations, respectively.

We present the speedups in Figure 7. GQE delivers up to 25.5x over DuckDB’s best CPU socket configuration, outperforming it on 20 of 22 queries and reaching 3x or higher on 17. Aggregated across all queries, GQE on the GB200 achieves a 7.5x speedup on total execution time.

The test results in this blog post are derived from TPC-H decision support benchmark and aren’t comparable to published TPC-H results, as the test results in this blog do not comply with the TPC-H specification.

Apply GQE best practices to data platforms

Database engines can translate NVIDIA Grace Blackwell hardware features into measurable query performance gains with targeted optimizations. In GQE, partition pruning and hybrid compression minimize transfer volume while NVLink-C2C and DE hardware increase transfer throughput. These optimizations reduce transfer time and compose into sophisticated query execution using NVIDIA cuDF, NVIDIA nvCOMP, and other CUDA-X libraries.

On TPC-H SF1000, GQE achieved a 7.5x speedup on total execution time over a state-of-the-art CPU database, showing how data layout, compression strategy, and execution can be designed together for modern database engines.

Leverage the GQE open-source reference architecture and design, and performance optimizations, and explore how GQE can accelerate your data platforms. 

Acknowledgements

The authors would like to thank Tanmay Gujar for his technical contributions to GQE and his review of this post. We also extend our thanks to all GQE contributors—Hao Gao, Yadu Kiran, James Xia, Eyal Soha, Lingyan Yin, Daniel Juenger, Siyuan Lin, Bret Alfieri, Nico Iskos, Zhengru Wang, Rui Bao, Dhruv Sundararaman, Jiachun Li, and Kate Cheng—for their technical contributions. Finally, we’d like to thank Nikolay Sakharnykh and Nuttiiya Seekhao for their review.

Discuss (0)

Tags