NVIDIA CUDA-X Powers the New Sirius GPU Engine for DuckDB, Setting ClickBench Records

Sirius, an open-source GPU native SQL engine, achieved a new performance record on Clickbench—a widely used analytics benchmark. Developed by University of Wisconsin-Madison with support from NVIDIA engineers, Sirius brings GPU-accelerated analytics to DuckDB.

DuckDB has seen rapid adoption among organizations such as DeepSeek, Microsoft, and Databricks due to its simplicity, speed, and versatility. As analytics workloads are highly amenable to massive parallelism, GPUs have emerged as the natural next step with higher performance, throughput, and better total cost of ownership (TCO) compared to CPU-based databases. However, this growing demand for GPU acceleration is hindered by the challenge of building a database system from the ground up.

This is solved with the jointly developed Sirius, a composable GPU-native execution backend for DuckDB that reuses its advanced subsystems while accelerating query execution with GPUs. Using NVIDIA CUDA-X libraries, Sirius delivers GPU acceleration.

This blog post outlines the Sirius architecture and demonstrates how it achieved record-breaking performance on ClickBench, a widely used analytics benchmark.

Sirius: A GPU-native SQL engine

Diagram of the Sirius GPU-native SQL engine architecture, showing multiple query engines feeding a shared Substrait query plan executed on NVIDIA GPU libraries, with connections to local and cloud storage. — *Figure 1. Sirius architecture*

Sirius is a GPU-native SQL engine that provides drop-in acceleration for DuckDB—and, in the future, other data systems.

The team recently published an article detailing the Sirius architecture and demonstrated state-of-the-art performance on TPC-H at SF100.

Implemented as a DuckDB extension, Sirius requires no modifications to DuckDB’s codebase and only minimal changes to the user-facing interface. At the execution boundary, Sirius consumes query plans in the universal Substrait format, ensuring compatibility with other data systems. To minimize engineering effort and maximize reliability, Sirius is built on well-established NVIDIA libraries:

NVIDIA cuDF: High-performance, columnar-oriented relational operators (e.g., joins, aggregations, projections) natively designed for GPUs.
NVIDIA RAPIDS Memory Manager (RMM): An efficient GPU memory allocator, reducing fragmentation and allocation overheads.

Sirius constructs its GPU-native execution engine and buffer management on top of these high-performance libraries, while reusing DuckDB’s advanced subsystems —including its query parser, optimizer, and scan operators, where appropriate. This combination of mature ecosystems gives Sirius a head start, enabling it to break the ClickBench record with minimal engineering effort.

Diagram of a Sirius query where DuckDB scans a table, converts data to Apache Arrow, and NVIDIA cuDF executes aggregates and projections on the GPU. — *Figure 2. Sirius Query on CPU and GPUs*

As illustrated in Figure 2, the process begins when Sirius receives an already optimized query plan from DuckDB’s internal format, ensuring robust logical and physical optimizations are preserved. For table scans, Sirius invokes DuckDB’s scan functionality, which provides features such as min-max filtering, zone skipping, and on-the-fly decompression—these operations efficiently load the relevant data into host memory.

Next, the result of the table scan is transformed from DuckDB’s native format into a Sirius data format (closely aligned with Apache Arrow), which is then transferred to GPU memory. In benchmarks like ClickBench, Sirius can cache frequently accessed tables on the GPU, accelerating repeated query execution.

The Sirius format can be mapped directly to a cudf::table for zero-copy interoperability, enabling all remaining SQL operators (aggregates, projections, and joins) to execute at GPU speed through cuDF primitives. Once computation completes, results are transferred back to the CPU, converted to DuckDB’s expected output format, and returned to the user—offering both raw speed and a seamless, familiar analytics experience.

Hitting #1 on Clickbench

Sirius running on an NVIDIA GH200 Grace Hopper Superchip instance from Lambda Labs ($1.5/hour) was evaluated against the top five systems on ClickBench. The alternative systems ran on CPU-only instances—AWS c6a.metal ($7.3/hour), AWS c8g.metal-48xl ($7.6/hour), and AWS c7a.metal-48xl ($9.8/hour). Hot-run execution time and relative runtime are reported, following the ClickBench methodologies, where lower values indicate better performance, and 1.0 represents the best possible score. Figure 3 shows the geometric mean of the relative runtime across all benchmark queries. In the ClickBench runs, Sirius achieved the lowest relative runtime on cheaper hardware, resulting in at least 7.2x higher cost-efficiency under this setup. Note that these benchmark results were obtained at the time of evaluation and are subject to change in the future.

Bar chart of ClickBench overall performance and cost, showing Sirius (lambda-GH200) as the fastest and lowest-cost system compared with Umbra, DuckDB, and Salesforce Hyper. — *Figure 3. ClickBench cost and relative runtime*

Figure 4 shows the hot-run query performance in Sirius and the top two systems in ClickBench: Umbra and DuckDB. Sirius achieved the lowest relative runtime in most queries, driven by efficient GPU computation through cuDF. For instance, in q4, q5, and q18, Sirius shows substantial performance gains on commonly used operators such as filtering, projection, and aggregation.

A few queries, however, reveal opportunities for further improvement. For example, q23 is bottlenecked by the “contains” operation on string columns, q24 and q26 by top-N operators, and q27 by aggregation over huge inputs. Future versions of Sirius will include continual improvements to these operators.

Grouped bar chart of ClickBench relative runtimes per query, comparing Umbra, DuckDB, and Sirius, with Sirius generally showing the lowest runtime across most queries. — *Figure 4. Relative Runtime of Individual ClickBench Queries*

Figure 5 is a closer look at one of the most complex ClickBench queries, the regular expression query (q28). When implemented naively, regular expression matching on GPUs can produce massive kernels with high register pressure and complex control flow, leading to severe performance degradation.

To address this, Sirius leverages cuDF’s JIT-compiled string transformation framework for user-defined functions. Figure 5 compares the performance of the JIT approach to cuDF’s precompiled API (cudf::strings::replace_with_backrefs), showing a 13x speedup.

The JIT-transformed kernel achieves 85% warp occupancy, compared to only 32% for the precompiled version, demonstrating better GPU utilization. By decomposing the regular expression into standard string operations such as character comparisons and substring operations, the cuDF JIT framework can fuse these operations into a single kernel, improving data locality and reducing register pressure.

Horizontal bar chart of ClickBench Q28 execution time showing Sirius with JIT-compiled transform running much faster than precompiled Sirius, DuckDB, and Umbra. — *Figure 5. Performance comparison of Sirius on Q28 using JIT-compiled transform vs. precompiled regular expression*

What’s next for Sirius

Looking ahead, Sirius plans to integrate new foundational, shareable building blocks for GPU data processing being developed by NVIDIA. These building blocks are guided by the modular, interoperable, composable, extensible (MICE) principles described in the Composable Codex. Priority areas include:

Advanced GPU memory management: Developing robust strategies to manage GPU memory efficiently, including seamless spilling of data beyond physical GPU limits to maintain performance and scale.
GPU file readers and intelligent I/O prefetching: Plugging in GPU-native file readers with smart prefetching to accelerate data loading, minimize stalls, and reduce I/O bottlenecks.
Pipeline-oriented execution model: Evolving Sirius’s core to a fully composable pipeline architecture that streamlines data flows across GPUs, host, and disk, efficiently overlapping computation and communication while enabling plug-and-play interoperability with open standards.
Scalable multi-node, multi-GPU architecture: Expanding Sirius’s capability to scale out efficiently to multiple nodes and multiple GPUs, unlocking petabyte-scaled data processing.

By investing in these MICE-compliant components, Sirius aims to make GPU analytics engines easier to build, integrate, and extend—not just for Sirius, but for the entire open-source analytics ecosystem.

Join Sirius

Sirius is open source with the permissive Apache License 2.0. Led by the University of Wisconsin-Madison with support from NVIDIA, the project welcomes contributions from researchers and practitioners with the shared mission of driving the GPU era in data analytics.

We invite you to: