Advanced Large-Scale Quantum Simulation Techniques in cuQuantum SDK v25.11

Simulating large-scale quantum computers has become more difficult as the quality of quantum processing units (QPUs) improves. Validating the results is key to ensure that after the devices scale beyond what is classically simulable, we can still trust the outputs.

Similarly, when generating large-scale datasets for various AI models that aim to aid in the operation of quantum processors, we see the need to offer useful training data at all scales and abstractions accelerated by GPUs. Examples include AI quantum error correction decoders, AI compilers, AI agents for calibration and control, and models to generate new device designs.

cuQuantum SDK is a set of high-performance libraries and tools for accelerating quantum computing simulations at both the circuit and device levels by orders of magnitude. The latest version cuQuantum SDK, v25.11 introduces components that accelerate two new workloads: Pauli propagation and stabilizer simulations. Each of these is critical for simulating large scale quantum computers.

This post dives into how you can start running Pauli propagation simulations and accelerate sampling from your stabilizer simulations to solve these problems with GPU-accelerated supercomputers.

cuQuantum cuPauliProp

Pauli propagation is a relatively new method for efficiently simulating the observables of large-scale quantum circuits, which can include noise models of real quantum processors. By expressing states and observables as weighted sums of Pauli tensor products, circuit simulation can dynamically discard terms which contribute insignificantly to a sought expectation value. This permits estimation of experimental quantities which are otherwise intractable for exact simulation.

Many relevant quantum computing applications are centered around computation of expectation values, for example VQE and quantum simulation of physical dynamics. Various exact and approximate classical simulation techniques enable calculating such observables for large circuits, though they become prohibitively expensive in differing settings. For example, the Matrix Product State technique, a very popular approximate tensor network state method for circuit simulation, is typically ill-suited for large circuits which encode the dynamics of two or three dimensional physical systems.

Pauli propagation is a complementary and useful addition to the approximate circuit simulation toolbox, for both pure and noisy circuits. Beyond being provably efficient for simulating near-Clifford and/or very noisy circuits, Pauli propagation has shown impressive performance when simulating circuits which Trotterize the evolution of certain quantum spin systems. This includes some “utility circuits” named in reference to their use in IBM’s utility experiment involving a 127 qubit device as detailed in Evidence for the Utility of Quantum Computing Before Fault Tolerance. Characterizing which circuits can be efficiently simulated with Pauli propagation is an ongoing research effort, as significant as refinement of the algorithmic details of the method itself.

cuQuantum 25.11 offers primitives to accelerate Pauli propagation or derivative methods on NVIDIA GPUs with the release of this new cuQuantum library, enabling developers and researchers to advance the frontier of classical circuit simulation. Core functions are described in the following sections.

Library initialization

Initialize the library handle and workspace descriptor required for operations:

import cupy as cp
from cuquantum.bindings import cupauliprop
from cuquantum import cudaDataType

# Create library handle and workspace descriptor
handle = cupauliprop.create()
workspace = cupauliprop.create_workspace_descriptor(handle)

# Assign GPU memory to workspace
ws_size = 1024 * 1024 * 64  # Example: 64 MiB
d_ws = cp.cuda.alloc(ws_size)
cupauliprop.workspace_set_memory(
    handle, workspace, cupauliprop.Memspace.DEVICE,
    cupauliprop.WorkspaceKind.WORKSPACE_SCRATCH, d_ws.ptr, ws_size
)

Define an observable

To start the simulation, allocate device memory for the Pauli expansions (sums of products of Pauli operators expressed as a set of unsigned integers as well as their coefficients) and initialize the input expansion with an observable (for example, $Z_{62}$ ).

# Helper to encode Pauli string into packed integers (2 bits per qubit: X and Z masks)
def encode_pauli(num_qubits, paulis, qubits):
    num_ints = cupauliprop.get_num_packed_integers(num_qubits)
    # Packed integer format: [X_ints..., Z_ints...]
    packed = np.zeros(num_ints * 2, dtype=np.uint64)
    x_mask, z_mask = packed[:num_ints], packed[num_ints:]
    for p, q in zip(paulis, qubits):
        idx, bit = divmod(q, 64)
        if p in (cupauliprop.PauliKind.PAULI_X, cupauliprop.PauliKind.PAULI_Y):
            x_mask[idx] |= (1 << bit)
        if p in (cupauliprop.PauliKind.PAULI_Z, cupauliprop.PauliKind.PAULI_Y):
            z_mask[idx] |= (1 << bit)
    return packed

# 1. Allocate Device Buffers
# Define capacity (max number of Pauli strings) and allocate buffers
max_terms = 10000 
num_packed_ints = cupauliprop.get_num_packed_integers(num_qubits)
d_pauli = cp.zeros((max_terms, 2 * num_packed_ints), dtype=cp.uint64, order="C")
d_coef = cp.zeros(max_terms, dtype=cp.float64, order="C")

# 2. Populate Initial Observable (Z_62)
encoded_pauli = encode_pauli(num_qubits, [cupauliprop.PauliKind.PAULI_Z], [62])

# Assign the first term
d_pauli[0] = cp.array(encoded_pauli)
d_coef[0] = 1.0

# 3. Create Pauli Expansions
# Input expansion: pre-populated with our observable
expansion_in = cupauliprop.create_pauli_expansion(
    handle, num_qubits,
    d_pauli.data.ptr, d_pauli.nbytes,
    d_coef.data.ptr, d_coef.nbytes,
    cudaDataType.CUDA_R_64F,
    1, 1, 1  # num_terms=1, is_sorted=True, is_unique=True
)

# Output expansion: empty initially (num_terms=0), needs its own buffers
d_pauli_out = cp.zeros_like(d_pauli)
d_coef_out = cp.zeros_like(d_coef)

expansion_out = cupauliprop.create_pauli_expansion(
    handle, num_qubits,
    d_pauli_out.data.ptr, d_pauli_out.nbytes,
    d_coef_out.data.ptr, d_coef_out.nbytes,
    cudaDataType.CUDA_R_64F,
    0, 0, 0
)

Operator creation

Define quantum gates or operators, such as a Pauli rotation $e^{-i \frac{\theta}{2} P}$ .

# Create a Z-rotation gate on qubit 0
paulis = [cupauliprop.PauliKind.PAULI_Z]
qubits = [0]
gate = cupauliprop.create_pauli_rotation_gate_operator(
    handle, theta, 1, qubits, paulis
)

Operator application

Apply an operator (a gate or noise-channel) to the expansion, evolving the system. Note that most applications work in the so-called Heisenberg picture, which means that the gates in the circuit are applied in reverse order to the observable. This also requires passing the adjoint argument as True when applying the operator.

# Get a view of the current terms in the input expansion
num_terms = cupauliprop.pauli_expansion_get_num_terms(handle, expansion_out)
view = cupauliprop.pauli_expansion_get_contiguous_range(
    handle, expansion_in, 0, num_terms)

# Apply gate: in_expansion -> gate -> out_expansion
cupauliprop.pauli_expansion_view_compute_operator_application(
    handle, view, expansion_out, gate,
    True,         # adjoint?
    False, False,  # make_sorted?, keep_duplicates?
    0, None,       # Truncation strategies (optional)
    workspace
)

Expectation values

Compute the expectation value (trace with the zero state $\langle 0 | O | 0 \rangle$ ).

import numpy as np
result = np.zeros(1, dtype=np.float64)

# Compute trace
cupauliprop.pauli_expansion_view_compute_trace_with_zero_state(
    handle, view, result.ctypes.data, workspace
)

Combining these methods shows that NVIDIA DGX B200 GPUs offer significant speedups over CPU based codes. For small coefficient cutoffs, multiple order of magnitude speedups are observed over single-threaded Qiskit Pauli-Prop on the most recent dual-socket data center CPUs.

Bar chart showing cuQuantum cuPauliProp 55x to 177x speedup for varying coefficient cutoff values, 0.0001, 0.00005, 0.000025 respectively, for GPU simulations of pi/4 rotations of the 127 qubit IBM Utility Circuit, when leveraging NVIDIA DGX B200 GPU compared to Qiskit PauliProp on Intel Xeon Platinum 8570 CPU. — Figure 1. cuQuantum GPU simulations for pi/4 rotations of the 127 qubit IBM utility circuit show multiple orders of magnitude speedups for a range of truncation schemes on NVIDIA DGX B200 compared to Qiskit PauliProp on an Intel Xeon Platinum 8570 CPU

cuQuantum cuStabilizer

Stabilizer simulations arise from the Gottesman-Knill theorem, which states that gates within the Clifford group (normalizer of the qubit Pauli group) can be efficiently simulated classically in polynomial time. This Clifford group is made up of CNOT, Hadamard and Phase gates (S). For this reason, stabilizer simulations have been critical for resource estimation and testing quantum error correcting codes at large scales.

There are a few different approaches to building stabilizer simulators, from tableau simulators to frame simulators. cuStabilizer currently addresses improving the throughput for sampling rates in a frame simulator.

Frame simulation only focuses on effects of quantum noise on the quantum state. As the quantum devices are imperfect, it’s possible to model the imperfections in circuit execution by inserting random “noisy” gates in it. If the noise-free result is known, getting the noisy result requires only to track the difference, or how the noisy gates change the circuit output.

It turns out that this effect is much easier to compute compared to full circuit simulation. The number of possible combinations of how noisy gates can be inserted grows very fast with the size of the circuit, which means that in order to reliably model the error correcting algorithm a large number of shots is required.

For users interested in developing quantum error correcting codes, testing new decoders, or generating data for AI decoders, frame simulation is ideal. APIs are available to improve sampling and accelerate any frame simulation on NVIDIA GPUs. The cuQuantum SDK cuStabilizer library exposes C API and Python API. While the C API will provide better performance, the Python API is best for getting started, as it is more flexible and handles memory allocation for the user.

Create a circuit and apply frame simulation

cuStabilizer has two main classes involved in the simulation: Circuit and FrameSimulator. The circuit can accept a string that contains circuit instructions, similar to the format used in the Stim CPU simulator. To create a FrameSimulator you need to specify information about the circuit, to allocate enough resources.

import cuquantum.stabilizer as cust
# Circuit information
num_qubits = 5
num_shots = 10_000
num_measurements = 2

# Create a circuit on GPU
circ = cust.Circuit("""
H 0 1
X_ERROR(0.1) 1 2
DEPOLARIZE2(0.5) 2 3
CX 0 1 2 3
M 0 3
"""

sim = cust.FrameSimulator(
      num_qubits,
      num_shots,
      num_measurements
)
sim.apply(circ)

You can reuse a simulator between different circuits, as long as your simulator has enough qubits available. The following code will apply a circuit to a state modified by the first circuit circ.

circ2 = cust.Circuit("""
Z_ERROR(0.01) 1 4 
""")
sim.apply(circ2)

Read simulation results

The state of simulator consists of three bit-tables:

x_bits
z_bits
measurement_bits

The first two tables store the Pauli Frame (similar to the cuPauliProp Pauli Expansion, but in a different layout and without the weights). The third stores the difference between noise-free measurement and the noisy measurements in each shot.

The most efficient way to store the bits is to encode them in an integer value. This is referred to as “bit-packed” format, where each byte in memory stores eight significant bits. While this format is most efficient, manipulating individual bits requires extra steps in your program. The bit-packed format is not easily integrated with the common notion of “array,” as those are considered to contain values of several bytes, such as int32.

To provide an easy representation in numpy, cuStabilizer supports the bit_packed argument, which can toggle between different formats. If bit_packed=False, each bit is encoded in one uint8 value, thus using 8x more memory. When specifying input bit tables, the format is also important for performance, as described in the cuQuantum documentation.

# Get measurement flips
m_table = sim.get_measurement_bits(bit_packed=False)
print(m_table.dtype)
# uint8
print(m_table.shape)
# (2, 10000)
print(m_table)
# [[0 0 0 ... 0 0 0]
#  [1 0 0 ... 0 1 1]]

x_table, z_table = sim.get_pauli_xz_bits(bit_packed=True)
print(x_table.dtype)
# uint8
print(x_table.shape)
# (5, 1252)

For easy access to the underlying Pauli frames, cuStabilizer provides a PauliTable class, which can be indexed by the shot index:

# Get pauli table
pauli_table = sim.get_pauli_table()
num_frames_print = 5
for i in range(num_frames_print):
    print(pauli_table[i])
# ...XZ
# ZXX..
# ...Z.
# .....
# ...Z.

When leveraging the sampling API we see that we can drastically improve the throughput when compared to Google Stim, state of the art code on the latest data center CPUs.

Surface code simulation

cuStabilizer can accept Stim circuits as input, and you can use it to simulate surface code circuits:

import stim

p = 0.001
circ_stim = stim.Circuit.generated(
    "surface_code:rotated_memory_z",
    distance=5,
    rounds=5,
    after_clifford_depolarization=p,
    after_reset_flip_probability=p,
    before_measure_flip_probability=p,
    before_round_data_depolarization=p,
)
circ = cust.Circuit(circ_stim)
sim = cust.FrameSimulator(
    circ_stim.num_qubits,
    num_shots,
    circ_stim.num_measurements,
    num_detectors=circ_stim.num_detectors,
)
sim.apply(circ)

pauli_table = sim.get_pauli_table()
for i in range(num_frames_print):
    print(pauli_table[i])

Note that the most efficient simulation is achieved for a large number of samples and number of qubits. Furthermore, the best performance is achieved when the resulting bit tables are kept on GPU, as when using the cupy package.

Figure 2 demonstrates the best use of cuStabilizer and expected performance on the NVIDIA B200 GPU and Intel Xeon Platinum 8570 CPU. It shows that the optimal performance for a code distance 31 is achieved at about a million shots. Users can get a 1,060x speedup for large code distances.

When comparing stim on Intel Xeon Platinum 8570 CPU to stim plus cuStabilizer on NVIDIA DGX B200 GPU for surface code from distance 2 to 75 each with 1 million shots, we see significantly better runtime scaling and performance. Users can expect to see between 6.7x and 1060.7x faster speedups at code distance 2 and 30 respectively. — *Figure 2. Runtime performance on surface code of different distances and 1 million shots, comparing stim plus cuStabilizer on an NVIDIA DGX B200 GPU with stim on an Intel Xeon Platinum 8570 CPU*

Get started with new cuQuantum libraries

The latest functionalities in cuQuantum continue to push the bounds of what is possible with GPU based quantum computer emulations enabling two new major classes of workloads. These workloads are critical for quantum error correction, verification and validation, and algorithm engineering for intermediate to large scale quantum devices.

Get started with cuQuantum cuPauliProp using pip install cupauliprop-cu13. To learn more, review the cuPauliProp documentation.
Get started with cuQuantum cuStabilizer using pip install custabilizer-cu13. To learn more, review the cuStabilizer documentation.