NVIDIA CUDA-Q 0.12 introduces new simulation tools for accelerating how researchers develop quantum applications and design performant quantum hardware.
With the new run
API, users can obtain more detailed statistics on individual runs (or shots) of a simulation, rather than being restricted to aggregated statistical outputs from simulations. Access to raw shot data is important to researchers for a variety of use cases such as analyzing noise correlation between qubits, result postselection, precise circuit benchmarking, and more.
The 0.12 release also includes additional features for the CUDA-Q dynamics backend, which enables users to simulate the evolution of quantum systems. This is an important capability for modeling and improving quantum hardware. This release adds better multidiagonal sparse matrix support and batching of states and operators that allow users to scale dynamics techniques. CUDA-Q Dynamics also now supports generic super-operators equations, providing researchers with more flexibility.
CUDA-Q is an open source project, and this release includes community contributions from the unitaryHACK event, as well as Python 3.13 support. This post explains some of these new features in detail. For more detailed information, see the CUDA-Q 0.12 release notes.
Enabling more expressive applications
CUDA-Q is built from the ground up to support writing hybrid quantum-classical applications, using a kernel programming model to orchestrate QPUs, GPUs, and CPUs. Logic to run on a quantum device is encapsulated in quantum kernels. There are multiple ways to execute a kernel. One way is with the sample
API that returns aggregated statistics of measurements counts of the qubits in the kernel.
For example, for a kernel that takes three qubits and applies the GHZ state to them, calling sample
with this kernel, and specifying 1,000 shots will return the aggregated statistics of what measurement outcomes are observed over those 1,000 shots: { 000:492 111:508 }. As expected for a GHZ state, outcomes of 000 and 111 are observed with roughly equal probability. However, it’s not possible to learn anything more detailed about each shot.
import cudaq
@cudaq.kernel
def simple_ghz(num_qubits: int) -> int:
qubits = cudaq.qvector(num_qubits)
# Create GHZ state
h(qubits[0])
for i in range(1, num_qubits):
x.ctrl(qubits[0], qubits[i])
result = 0
for i in range(num_qubits):
if mz(qubits[i]):
result += 1
return result
shots = 20 # using small number of shots for simplicity
sample_results = cudaq.sample(simple_ghz, 3, shots_count=shots)
print(f"Sample results: {sample_results}")
run_results = cudaq.run(simple_ghz, 3, shots_count=shots)
print(f"Run results: {run_results}")
$ python3 test.py
Sample results: { 000:11 111:9 }
Run results: [0, 3, 0, 0, 0, 0, 3, 0, 0, 3, 3, 3, 3, 3, 0, 3, 0, 3, 3, 3]
Unlike the sample
API, the run
API preserves individual return values from each shot, which is useful when the application needs to analyze the distribution of returned results. With run, kernels can be more expressive and have conditional measurements of specific qubits. The return value of these kernels will be explicit and can contain multiple data types, including custom data types using Python data classes.
In addition, run
has an asynchronous version, run_async
, useful for long-running executions. Currently, run
and run_async
are supported for simulation backends only. For more information and code examples, see the CUDA-Q documentation.
Achieve better performance for dynamics simulation
The CUDA-Q dynamics backend enabled the design, simulation, and execution of quantum dynamics systems. The 0.12 release adds multiple enhancements to this backend.
Previously, system dynamics was limited to the Lindblad master equation, specified by the Hamiltonian operator and collapse operators. Now users can simulate any arbitrary state evolution equation, specifying the evolution as a generic super-operator. A super-operator can be constructed as a linear combination of left and/or right multiplication actions of operator instances.
Updated support was also added for multidiagonal sparse matrices. Depending on the sparsity of the operator matrix or the subsystem dimension, CUDA-Q will automatically use the dense or multidiagonal data formats for optimal performance.
The CUDA-Q evolve
API can evolve multiple initial states and multiple Hamiltonians over time. With the 0.12 release, both states and Hamiltonians can be batched on multiple GPUs. This can significantly improve the performance of simulating many small identical system dynamics for the purpose of parameter sweeping or tomography. Collapse operators and super-operators can be batched in a similar manner.
For example, a dynamics simulation of an electrically driven silicon spin qubit involves a parameter sweep of amplitude values and creating a Hamiltonian for each amplitude value. Without batching, this will result in multiple calls to evolve
, one for each amplitude value. With batching, users can create the following Hamiltonian batch with 1,024 different parameter values:
# Sweep the amplitude
amplitudes = np.linspace(0.0, 0.5, 1024)
# Construct a list of Hamiltonian operators for each amplitude so that we can
# batch them all together
batched_hamiltonian = []
for amplitude in amplitudes:
# Electric dipole spin resonance (`EDSR`) Hamiltonian
H = 0.5 * resonance_frequency * spin.z(0) + amplitude * ScalarOperator(
lambda t: 0.5 * np.sin(resonance_frequency * t)) * spin.x(0)
# Append the Hamiltonian to the batched list
# This allows us to compute the dynamics for all amplitudes in a single
# simulation run
batched_hamiltonian.append(H)
And then use it in one call to evolve:
results = cudaq.evolve(
batched_hamiltonian,
dimensions,
schedule,
psi0,
observables=[boson.number(0)],
collapse_operators=[],
store_intermediate_results=cudaq.IntermediateResultSave.EXPECTATION_VALUE,
integrator=ScipyZvodeIntegrator())
Running this example on an NVIDIA H100 GPU with different batch sizes yields the results shown in Figure 1 for different parameter values. The more Hamiltonians batched, the lower the overall runtime. Batching all 1,024 Hamiltonians in one evolve
call results in an 18x speedup over no batching.

For more details including code examples, see the CUDA-Q documentation.
unitaryHack community contributions to CUDA-Q 0.12
unitaryHACK is an open source quantum computing stack hackathon, organized by Unitary Foundation, a nonprofit supporting the quantum computing community with open source projects, microgrants, and community events. As a recent event sponsor, NVIDIA submitted five CUDA-Q bounties, leading to the following three community contributions in CUDA-Q 0.12:
- Gopal-Dahale added a code example using dynamics to prepare a GHZ state with trapped ions. The example is based on the paper, Multi-Particle Entanglement of Hot Trapped Ions.
- ACE07-Sev added a tutorial on Approximate State Preparation Using MPS Sequential Encoding showing how to prepare an initial state by decomposing the initial state vector into matrix product state. This is beneficial when preparing an arbitrary input state to run on quantum hardware. In this case, the matrix product state decomposition ensures a low depth approximated circuit for the input state vector.
- Randl added an initial implementation of getting the matrix associated with a quantum kernel. This new API returns the matrix representing the unitary of the execution path (that is, the trace) of the provided kernel.
CUDA-Q is an open source project that accepts community contributions year-round. To learn more, visit NVIDIA/cuda-quantum on GitHub.
Get started with CUDA-Q
Visit CUDA-Q Quick Start to learn more and get started. Explore CUDA-Q applications and dynamics examples and engage with the team on the NVIDIA/cuda-quantum GitHub repo. To learn more about other tools for enabling accelerated quantum supercomputing, check out NVIDIA Quantum.