Back to all posts
Developer Tools & Tech

Real-Time Decoding, Algorithmic GPU Decoders, and AI Inference Enhancements in NVIDIA CUDA-Q QEC

Tom Lubowe, Ben Howe, Melody Ren, Scott Thornton, and Kevin MatoPublished on Dec 17, 2025 • NVIDIA Developer Blog

Real-time decoding is crucial to fault-tolerant quantum computers. By enabling decoders to operate with low latency concurrently with a quantum processing unit (QPU), we can apply corrections to the device within the coherence time. This prevents errors from accumulating, which reduces the value of results received. We can do this online, with a real quantum device, or offline, with a simulated quantum processor.

To help solve these problems and enable research into better solutions, NVIDIA CUDA-Q QEC version 0.5.0 includes a range of improvements. These include support for online real-time decoding, new GPU-accelerated algorithmic decoders, infrastructure for high-performance AI decoder inference, sliding window decoder support, and more Pythonic interfaces.

Real-Time Decoding Workflow with CUDA-Q QEC

Users can perform this in a four-stage workflow. In order, these are:
1. DEM generation: Characterizing device error rates.
2. Decoder configuration: Setting parameters and generating config.
3. Decoder loading and initialization: Setting up the implementation in the runtime.
4. Real-time decoding: Enqueuing stabilizers inside active kernels.

Step 1: Generate Detector Error Model (DEM)

First, we characterize how the device errors behave during operation. Using a helper function, we can generate the detector error model (DEM) from a quantum code, noise model, and circuit parameters.

import cudaq
import cudaq_qec as qec

print("Step 1: Generating DEM...")
cudaq.set_target("stim")

noise = cudaq.NoiseModel()
noise.add_all_qubit_channel("x", cudaq.Depolarization2(0.01), 1)

dem = qec.z_dem_from_memory_circuit(code, qec.operation.prep0, 3, noise)

Step 2: Choose and Configure the Decoder

Using the DEM, the user configures the decoder and then saves this configuration to a YAML file. This file ensures that the decoders can correctly interpret the syndrome measurements.

# Create decoder config
config = qec.decoder_config()
config.id = 0
config.type = "nv-qldpc-decoder"
config.block_size = dem.detector_error_matrix.shape[1]
# For complete details check out:
# nvidia.github.io/cudaqx/examples_rst/qec/realtime_decoding.html

Step 3 & 4: Load Config and Run Circuit

Before circuit execution, the user loads the YAML file. CUDA-Q QEC interprets the information, sets up the appropriate implementation in the decoder, and registers it with the CUDA-Q runtime.

# Save decoder config
with open("config.yaml", 'w') as f:
    f.write(config.to_yaml_str(200))

# Load config and run circuit
qec.configure_decoders_from_file("config.yaml")
run_result = cudaq.run(qec_circuit, shots_count=10)

GPU-Accelerated RelayBP Decoders

A recently developed decoder algorithm helps solve the pitfalls of belief propagation decoders, a popular class of quantum low-density parity check (QLDPC) algorithmic decoders. Traditional BP+OSD (Belief Propagation with Ordered Statistics Decoding) relies on a GPU-accelerated BP decoder and then uses an Ordered Statistics Post-Processing Algorithm on CPU. If BP fails, OSD kicks in. This makes it hard to parallelize and optimize for the low latency needed to enable real-time error decoding.

RelayBP modifies BP methods with the concept of memory strengths, at each node of a graph, and controls how much each node remembers or forgets past messages. This dampens or breaks the harmful symmetries that usually trap BP, preventing it from converging.

RelayBP Decoding Throughput
Figure 1: Peak decoding throughput (iterations/sec) for RelayBP FP32 on NVIDIA DGX GB200, measured for XYZ and XZ decoding of 1-Gross and 2-Gross codes.

Users can instantiate a RelayBP decoder easily with a few lines of code:

import numpy as np
import cudaq_qec as qec   
   
# Simple 3x7 parity check matrix for demonstration
H_list = [[1, 0, 0, 1, 0, 1, 1], [0, 1, 0, 1, 1, 0, 1],
         [0, 0, 1, 0, 1, 1, 1]]
H = np.array(H_list, dtype=np.uint8)
 
# Configure relay parameters
srelay_config = {
   'pre_iter': 5,  # Run 5 iterations with gamma0 before relay legs
   'num_sets': 3,  # Use 3 relay legs
   'stopping_criterion': 'FirstConv'  # Stop after first convergence
}
 
# Create a decoder with Relay-BP
decoder_relay = qec.get_decoder("nv-qldpc-decoder",
                               H,
                               use_sparsity=True,
                               bp_method=3,   
                               composition=1,
                               max_iterations=50,
                               gamma0=0.3,
                               gamma_dist=[0.1, 0.5],
                               srelay_config=srelay_config,
                               bp_seed=42)
print("Created decoder with Relay-BP")
 
# Decode a syndrome
syndrome = np.array([1, 0, 1], dtype=np.uint8)
decoded_result = decoder_relay.decode(syndrome)

AI Decoder Inference with TensorRT

AI decoders are becoming increasingly popular for handling specific error models, offering better accuracy or latency than algorithmic decoders. Once trained, users can export their neural network to ONNX, and run it with the CUDA-Q QEC NVIDIA TensorRT-based AI decoder inference engine.

import cudaq_qec as qec
import numpy as np
 
# A placeholder matrix is provided here to satisfy the API
H = np.array([[1, 0, 0, 1, 0, 1, 1],
              [0, 1, 0, 1, 1, 0, 1],
              [0, 0, 1, 0, 1, 1, 1]], dtype=np.uint8)
 
# Create TensorRT decoder from ONNX model
decoder = qec.get_decoder("trt_decoder", H,
                          onnx_load_path="ai_decoder.onnx")
 
# Decode a syndrome
syndrome = np.array([1.0, 0.0, 1.0], dtype=np.float32)
result = decoder.decode(syndrome)
print(f"Predicted error: {result}")

Sliding Window Decoding

Sliding window decoders enable a decoder to handle circuit-level noise across multiple syndrome extraction rounds. These decoders process the syndrome before the complete measurement sequence is received, which reduces latency, although it can slightly increase logical error rates.

import cudaq
import cudaq_qec as qec
import numpy as np
 
cudaq.set_target('stim')
num_rounds = 5
code = qec.get_code('surface_code', distance=num_rounds)
noise = cudaq.NoiseModel()
noise.add_all_qubit_channel("x", cudaq.Depolarization2(0.001), 1)
statePrep = qec.operation.prep0
dem = qec.z_dem_from_memory_circuit(code, statePrep, num_rounds, noise)
inner_decoder_params = {'use_osd': True, 'max_iterations': 50, 'use_sparsity': True}
opts = {
    'error_rate_vec': np.array(dem.error_rates),
    'window_size': 1,
    'num_syndromes_per_round': dem.detector_error_matrix.shape[0] // num_rounds,
    'inner_decoder_name': 'nv-qldpc-decoder',
    'inner_decoder_params': inner_decoder_params,
}
swdec = qec.get_decoder('sliding_window', dem.detector_error_matrix, **opts)

About the Authors

Tom Lubowe

Tom Lubowe

Tom Lubowe is the Senior Product Manager for NVIDIA Ising, cuQuantum, and cuTENSOR. Prior to joining NVIDIA, he led product management teams focused on quantum computing and machine learning at GenMat, Rigetti, and Everettian Technologies.

Ben Howe

Ben Howe

Ben Howe is a Senior CUDA-Q Software Engineer at NVIDIA. He specializes in designing hybrid classical-quantum software systems. Prior to NVIDIA, he was an Engineering Fellow at RTX developing real-time signal processing and HPC applications.

Melody Ren

Melody Ren

Melody Ren is a Senior Quantum Software Engineer on the CUDA-QX team at NVIDIA. Her work is focused on quantum error correction. Previously, she was a developer for the Intel Quantum SDK.

Comments & Discussion

ae_pascal January 28, 2026
From the plot it says iterations/sec. Is that BP iterations (e.g. related to the max_iterations arguments passed to the decoder) or is decoding iterations with each iteration being a syndrome being decoded?
bhowe February 12, 2026
Those are BP iterations, related to the max_iterations, not full syndrome decode cycles. Also, note that the RelayBP algorithm can be configured to do multiple legs with each leg having up to max_iterations iterations.

The reason we provide BP iteration per second rather than full decodes per second is because the number of BP iterations required for each syndrome is highly dependent on the exact syndrome, so any timing for full decode cycles is only valid for a very specific error rate under a very specific noise model. But the number of BP iterations per second is relatively constant, and is hence a more reproducible timing metric.
ae_pascal February 12, 2026
Makes sense. It’s easy to get back the actual iteration counts from the decoder using the "num_iter” additional option. Was the experiment a circuit level noise simulation or something simpler, like code capacity where DEMs were not used?
bhowe February 12, 2026
This was for a circuit-level noise simulation. The peak performance was measured with an optimized build of CUDA-Q QEC 0.6.0.