Simulation / Modeling / Design

New NVIDIA CUDA-Q Features Boost Quantum Application Performance

NVIDIA CUDA-Q (formerly NVIDIA CUDA Quantum) is an open-source programming model for building quantum accelerated supercomputing applications that take full advantage of CPU, GPU, and QPU compute abilities. Developing these applications today is challenging and requires an easy-to-use coding environment coupled with powerful quantum simulation capabilities to efficiently evaluate and improve the performance of new algorithms.  

CUDA-Q includes many new features that significantly improve performance, enabling users to push the limits of what can be simulated on classical supercomputers. This post demonstrates the performance enhancement of CUDA-Q for quantum simulation and provides a brief explanation of the improvements.

Improving performance  

Computing expectation values is the primary quantum task in a Variational Quantum Eigensolver (VQE) application. You can easily compute these values in CUDA-Q using the observe function. The performance of the three most recent CUDA-Q releases was tested using 24 and 28 qubit VQE problems aimed at determining the ground state energy of two small molecules (C2H2 and C2H4). The experiments used the standard UCCSD ansatz and were written in Python. 

For each version (v0.6, v0.7, v0.7.1), three state vector simulator backends were tested: nvidia (single precision), nvidia-fp64 (double precision), and nvidia-mgpu (nvidia-fp64 with gate fusion). The number following nvidia-mgpu designates the gate fusion level, previously hard coded as 6, but now a tunable parameter in v0.7.1.

Gate fusion is an optimization technique where consecutive quantum gates are combined or merged into a single gate to reduce the overall computational cost and improve circuit efficiency. The number of gates combined (gate fusion level) can significantly affect simulation performance and needs to be optimized for every application. You can now adjust the CUDAQ_MGPU_FUSE parameter and specify custom gate fusion levels different from the v0.7.1 default of 4. 

The image shows a comparison of the performance of different versions of the NVIDIA CUDA-Q software development kit (SDK) for quantum computing. The SDK is used to develop and run quantum computing applications on NVIDIA GPUs. The image shows that the latest version of the SDK, version 0.7.1, offers significant performance improvements over previous versions. For example, on a 24-qubit system, the latest version of the SDK is up to 7 times faster than the previous version. On a 28-qubit system, the latest version of the SDK is up to 4.7 times faster than the previous version. These performance improvements are due to a number of factors, including improvements to the compiler, the runtime system, and the libraries.
Figure 1. Execution times for 10 observe calls in 24 and 28 qubit UCCSD-VQE experiments

Figure 1 presents the runtime for each simulator and CUDA-Q version using NVIDIA H100 GPUs. The two simulators without gate fusion experienced at least a 2x speedup from v0.6 to v0.7.1. 

The nvidia-mgpu-6 v0.7.1 simulator results were 3.2x and 4.7x faster than the v0.6 results for the 24 and 28 qubit experiments, respectively. Tuning the gate fusion level improved the performance by an additional 12x and 1.2x, respectively, indicating how important and system-dependent this parameter can be.

The nvidia-mgpu simulator will be the new default starting in v0.8 (yet to be released), offering the best overall performance and enabling immediate utilization of multiple GPUs for many-qubit simulations.

Accelerating the code

CUDA-Q v0.7 includes a number of enhancements that improve compilation and accelerate the time required to make successive observe calls (Figure 2). 

First, the just-in-time (JIT) compilation path was improved to more efficiently compile the kernel. Previously, this procedure scaled quadratically with the number of gates in the circuit, but was reduced to linear scaling.

Graph showing that there have been a number of significant improvements to the performance of the JIT compiler. These improvements have resulted in a significant reduction in the amount of time it takes to compile and execute code.
Figure 2. Representation of the changes included in CUDA-Q v0.7 and v0.7.1 and the runtime improvements to four observe calls

Second, improvements to the hashing for JIT change-detection checks reduce the time required to check if any code needs to be recompiled due to environment changes. This virtually eliminates the time required for these checks for each observe call. 

Finally, v0.6 would perform all log processing for every call, regardless of the specified log level.  This was changed in v0.7 to only perform the necessary processing for the specified log level.

In addition to gate fusion, 0.7.1 introduced automatic Hamiltonian batching (Figure 3) which further reduces the runtime for observe calls, by enabling batched Hamiltonian evaluations on a single GPU.

Graph showing that Hamiltonian batching greatly reduces the time spent computing the Hamiltonian elements while leaving the time to execute the base circuit unchanged.
Figure 3. Representation of the speedup from Hamiltonian batching

To further improve performance, future releases will include more enhancements to state preparation, handling of Pauli operators, and unitary synthesis. 

Get started with CUDA-Q

The current and anticipated CUDA-Q improvements provide developers with a more performant platform to build quantum accelerated supercomputing applications. Not only is development today accelerated, but applications constructed on CUDA-Q are positioned to deploy in hybrid CPU, GPU, and QPU environments necessary for practical quantum computing.

The CUDA-Q Quick Start guide will help you to quickly set up your environment, while the Basics section will guide you through writing your first CUDA-Q application. Explore the code examples and applications to get inspiration for your own quantum application development. To provide feedback and suggestions, visit the NVIDIA/cuda-quantum GitHub repo.

Discuss (0)