Real-Time Decoding, Algorithmic GPU Decoders, and AI Inference Enhancements in NVIDIA CUDA-Q QEC

Real-time decoding is crucial to fault-tolerant quantum computers. By enabling decoders to operate with low latency concurrently with a quantum processing unit (QPU), we can apply corrections to the device within the coherence time. This prevents errors from accumulating, which reduces the value of results received. We can do this online, with a real quantum device, or offline, with a simulated quantum processor.

To help solve these problems and enable research into better solutions, NVIDIA CUDA-Q QEC version 0.5.0 includes a range of improvements. These include support for online real-time decoding, new GPU-accelerated algorithmic decoders, infrastructure for high-performance AI decoder inference, sliding window decoder support, and more Pythonic interfaces.

We’ll cover all of these improvements in this post and dive into how you can use them to accelerate your quantum error correction research, or operationalize real-time decoding with your quantum computer.

Real-time decoding is real with CUDA-Q QEC

Users can perform this in a four-stage workflow. In order, these are: DEM generation, decoder configuration, decoder loading and initialization, and real-time decoding.

First, we characterize how the device errors behave during operation. Using a helper function, we can generate the detector error model (DEM) from a quantum code, noise model, and circuit parameters. The function will generate a complete DEM that maps error mechanisms to syndrome patterns.

# Step 1: Generate detector error model
    print("Step 1: Generating DEM...")
    cudaq.set_target("stim")

    noise = cudaq.NoiseModel()
    noise.add_all_qubit_channel("x", cudaq.Depolarization2(0.01), 1)

    dem = qec.z_dem_from_memory_circuit(code, qec.operation.prep0, 3, noise)

The next step is to choose a decoder and configure it. We’ll discuss new decoders in greater detail in the following sections.

Using the DEM, the user configures the decoder and then saves this configuration to a YAML file. This file ensures that the decoders can correctly interpret the syndrome measurements.

# Create decoder config
    config = qec.decoder_config()
    config.id = 0
    config.type = "nv-qldpc-decoder"
    config.block_size = dem.detector_error_matrix.shape[1]
    . . . 
    # check out nvidia.github.io/cudaqx/examples_rst/qec/realtime_decoding.html
    . . .

Before circuit execution, the user loads the YAML file. CUDA-Q QEC interprets the information, sets up the appropriate implementation in the decoder, and registers it with the CUDA-Q runtime.

# Save decoder config
   with open("config.yaml", 'w') as f:
        f.write(config.to_yaml_str(200))

Now, users can begin executing quantum circuits. Inside CUDA-Q kernels, the decoding API interacts with the decoders. As the stabilizers of the logical qubits are measured, syndromes are enqueued to the corresponding decoder, which processes them. When corrections are needed, the decoder suggests operations to apply to the logical qubits.

# Load config and run circuit
    qec.configure_decoders_from_file("config.yaml")
    run_result = cudaq.run(qec_circuit, shots_count=10)

GPU-accelerated RelayBP

A recently developed decoder algorithm helps solve the pitfalls of belief propagation decoders, a popular class of quantum low-density parity check algorithmic decoders. BP+OSD (Belief Propagation with Ordered Statistics Decoding) relies on a GPU-accelerated BP decoder and then uses an Ordered Statistics Post-Processing Algorithm on CPU. If BP fails, OSD kicks in. This is fine, but makes it hard to optimize and parallelize for the low latency needed to enable real-time error decoding.

RelayBP modifies BP methods with the concept of memory strengths, at each node of a graph, and controls how much each node remembers or forgets past messages. This dampens or breaks the harmful symmetries that usually trap BP, preventing it from converging.

Peak decoding throughput for RelayBP FP32 on NVIDIA DGX GB200 for XYZ and XZ decoding of the 1 and 2-gross code show high rates of iterations per second for both single and batched syndromes, reaching 1.6 million and 500k iterations per second for XZ 1-Gross and 2-Gross respectively. The number of iterations required to decode a syndrome varies significantly based on the physical error rates and actual syndromes. — Figure 1. Peak decoding throughput (iterations/sec) for RelayBP FP32 on NVIDIA DGX GB200, measured for XYZ and XZ decoding of 1-Gross and 2-Gross quantum error-correction codes, with syndrome complexity held constant to isolate peak performance.

Users can instantiate a RelayBP decoder easily with a few lines of code, outlined below.

import numpy as np
import cudaq_qec as qec   
  
# Simple 3x7 parity check matrix for demonstration
H_list = [[1, 0, 0, 1, 0, 1, 1], [0, 1, 0, 1, 1, 0, 1],
         [0, 0, 1, 0, 1, 1, 1]]
H = np.array(H_list, dtype=np.uint8)

# Configure relay parameters
srelay_config = {
   'pre_iter': 5,  # Run 5 iterations with gamma0 before relay legs
   'num_sets': 3,  # Use 3 relay legs
   'stopping_criterion': 'FirstConv'  # Stop after first convergence
}

# Create a decoder with Relay-BP
decoder_relay = qec.get_decoder("nv-qldpc-decoder",
                               H,
                               use_sparsity=True,
                               bp_method=3,   
                               composition=1,
                               max_iterations=50,
                               gamma0=0.3,
                               gamma_dist=[0.1, 0.5],
                               srelay_config=srelay_config,
                               bp_seed=42)
print("   Created decoder with Relay-BP (gamma_dist, FirstConv stopping)")

# Decode a syndrome
syndrome = np.array([1, 0, 1], dtype=np.uint8)
decoded_result = decoder_relay.decode(syndrome)

AI decoder inference

AI decoders are becoming increasingly popular for handling specific error models, offering better accuracy or latency than algorithmic decoders.

Users can develop AI decoders by generating training data, training a model, and exporting the model to ONNX. Once this is complete, use the CUDA-Q QEC NVIDIA TensorRT-based AI decoder inference engine to operate low-latency AI decoders.

CUDA-Q QEC recently introduced infrastructure for integrated AI decoder inference with offline decoding. This means that it’s now easy to run any AI decoder saved to an ONNX file with CUDA-Q QEC and an emulated quantum computer.

import cudaq_qec as qec
import numpy as np

# Note: The AI decoder doesn't use the parity check matrix.
# A placeholder matrix is provided here to satisfy the API.
H = np.array([[1, 0, 0, 1, 0, 1, 1],
              [0, 1, 0, 1, 1, 0, 1],
              [0, 0, 1, 0, 1, 1, 1]], dtype=np.uint8)

# Create TensorRT decoder from ONNX model
decoder = qec.get_decoder("trt_decoder", H,
                          onnx_load_path="ai_decoder.onnx")

# Decode a syndrome
syndrome = np.array([1.0, 0.0, 1.0], dtype=np.float32)
result = decoder.decode(syndrome)
print(f"Predicted error: {result}")

We also offer a range of recommendations to reduce the initialization time by creating pre-built TensorRT engines. With ONNX files supporting a range of precisions (int8, fp8, fp16, bf16, and tf32) you can explore a range of model and hardware combinations to optimize AI decoder operationalization.

Sliding window decoding

Sliding window decoders enable a decoder to handle circuit-level noise across multiple syndrome extraction rounds. These decoders process the syndrome before the complete measurement sequence is received, which can help reduce the overall latency. The tradeoff is that this can increase logical error rates.

Exploring how and when to use this tool relies on the noise model, error correcting code parameters, and the latency budget of a given quantum processor. With the introduction of the sliding window decoder in 0.5.0, users can now perform experiments using any other CUDA-Q decoder as the “inner” decoder. Additionally, users can vary the window size with simple parameter changes.

import cudaq
import cudaq_qec as qec
import numpy as np

cudaq.set_target('stim')
num_rounds = 5
code = qec.get_code('surface_code', distance=num_rounds)
noise = cudaq.NoiseModel()
noise.add_all_qubit_channel("x", cudaq.Depolarization2(0.001), 1)
statePrep = qec.operation.prep0
dem = qec.z_dem_from_memory_circuit(code, statePrep, num_rounds, noise)
inner_decoder_params = {'use_osd': True, 'max_iterations': 50, 'use_sparsity': True}
opts = {
    'error_rate_vec': np.array(dem.error_rates),
    'window_size': 1,
    'num_syndromes_per_round': dem.detector_error_matrix.shape[0] // num_rounds,
    'inner_decoder_name': 'nv-qldpc-decoder',
    'inner_decoder_params': inner_decoder_params,
}
swdec = qec.get_decoder('sliding_window', dem.detector_error_matrix, **opts)

Each syndrome extraction round must produce a constant number of measurements. The decoder will make no assumptions about the temporal correlations or periodicity in the underlying noise, so users have maximal flexibility in investigating noise variations per round.

Getting started with CUDA-Q QEC

CUDA-Q QEC 0.5.0 brings a wide range of tools to quantum error correction researchers and QPU operators, to accelerate research into operationalizing fault-tolerant quantum computers.

To get started using the CUDA-Q QEC, you can pip install cudaq-qec and see the CUDA-Q QEC documentation.