NVIDIA CUDA-Q (formerly NVIDIA CUDA Quantum) is an open-source programming model for building quantum accelerated supercomputing applications that take full advantage of CPU, GPU, and QPU compute abilities. Developing these applications today is challenging and requires an easy-to-use coding environment coupled with powerful quantum simulation capabilities to efficiently evaluate and improve the performance of new algorithms.
CUDA-Q includes many new features that significantly improve performance, enabling users to push the limits of what can be simulated on classical supercomputers. This post demonstrates the performance enhancement of CUDA-Q for quantum simulation and provides a brief explanation of the improvements.
Improving performance
Computing expectation values is the primary quantum task in a Variational Quantum Eigensolver (VQE) application. You can easily compute these values in CUDA-Q using the observe
function. The performance of the three most recent CUDA-Q releases was tested using 24 and 28 qubit VQE problems aimed at determining the ground state energy of two small molecules (C2H2 and C2H4). The experiments used the standard UCCSD ansatz and were written in Python.
For each version (v0.6, v0.7, v0.7.1), three state vector simulator backends were tested: nvidia
(single precision), nvidia-fp64
(double precision), and nvidia-mgpu
(nvidia-fp64
with gate fusion). The number following nvidia-mgpu
designates the gate fusion level, previously hard coded as 6, but now a tunable parameter in v0.7.1.
Gate fusion is an optimization technique where consecutive quantum gates are combined or merged into a single gate to reduce the overall computational cost and improve circuit efficiency. The number of gates combined (gate fusion level) can significantly affect simulation performance and needs to be optimized for every application. You can now adjust the CUDAQ_MGPU_FUSE
parameter and specify custom gate fusion levels different from the v0.7.1 default of 4.

observe
calls in 24 and 28 qubit UCCSD-VQE experimentsFigure 1 presents the runtime for each simulator and CUDA-Q version using NVIDIA H100 GPUs. The two simulators without gate fusion experienced at least a 1.7x speedup from v0.6 to v0.7.1.
The nvidia-mgpu-6
v0.7.1 simulator results were 2.4x and 2.9x faster than the v0.6 results for the 24 and 28 qubit experiments, respectively. Tuning the gate fusion level improved the performance by an additional 10x and 1.3x, respectively, indicating how important and system-dependent this parameter can be.
The nvidia-mgpu
simulator will be the new default starting in v0.8 (yet to be released), offering the best overall performance and enabling immediate utilization of multiple GPUs for many-qubit simulations.
Note that the original 0.7.1 timing results were updated on July 1, 2024. An LLVM issue initially produced incorrect UCCSD results. The revised timings were collected with a bug fix to ensure correct UCCSD results.
Accelerating the code
CUDA-Q v0.7 includes a number of enhancements that improve compilation and accelerate the time required to make successive observe
calls (Figure 2).
First, the just-in-time (JIT) compilation path was improved to more efficiently compile the kernel. Previously, this procedure scaled quadratically with the number of gates in the circuit, but was reduced to linear scaling.

observe
callsSecond, improvements to the hashing for JIT change-detection checks reduce the time required to check if any code needs to be recompiled due to environment changes. This virtually eliminates the time required for these checks for each observe
call.
Finally, v0.6 would perform all log processing for every call, regardless of the specified log level. This was changed in v0.7 to only perform the necessary processing for the specified log level.
In addition to gate fusion, 0.7.1 introduced automatic Hamiltonian batching (Figure 3) which further reduces the runtime for observe
calls, by enabling batched Hamiltonian evaluations on a single GPU.

To further improve performance, future releases will include more enhancements to state preparation, handling of Pauli operators, and unitary synthesis.
Get started with CUDA-Q
The current and anticipated CUDA-Q improvements provide developers with a more performant platform to build quantum accelerated supercomputing applications. Not only is development today accelerated, but applications constructed on CUDA-Q are positioned to deploy in hybrid CPU, GPU, and QPU environments necessary for practical quantum computing.
The CUDA-Q Quick Start guide will help you to quickly set up your environment, while the Basics section will guide you through writing your first CUDA-Q application. Explore the code examples and applications to get inspiration for your own quantum application development. To provide feedback and suggestions, visit the NVIDIA/cuda-quantum GitHub repo.