Accelerate Quantum Computing Research

Quantum computing has the potential to offer giant leaps in computational capabilities. The ability of scientists, developers, and researchers to simulate quantum circuits on classical computers is vital to getting us there.

The research community across academia, laboratories, and industries are using simulators to help design and verify algorithms to run on quantum computers. These simulators capture the properties of superposition and entanglement and are built on quantum circuit simulation frameworks.

NVIDIA cuQuantum is an SDK of optimized libraries and tools for accelerating quantum computing workflows. With NVIDIA Tensor Core GPUs, developers can use cuQuantum to speed up quantum circuit simulations based on state vector and tensor network methods by orders of magnitude.

cuQuantum icon

Quick Links

cuQuantum Appliance

A full simulation stack based on cuQuantum in a ready-to-deploy container.


Documentation for cuQuantum and the cuQuantum Appliance.


The cuQuantum public repository, including the cuQuantum Python bindings and examples.


The NVIDIA quantum computing developer forums.

NVIDIA cuQuantum Appliance

cuQuantum Appliance helps developers get started by making simulation software available in a container optimized to run on the latest NVIDIA DGX™ systems, and HGX systems.

The stack includes Google’s Cirq framework and qsim simulator along with NVIDIA cuQuantum.

The appliance software achieved best-in-class performance on key problems in quantum computing, including Shor’s algorithm, random quantum circuits, and quantum fourier transform. Recent software only updates to our container offering have enabled 4.4x the speedup over previously reported numbers.

cuQuantum Appliance is available now in the NVIDIA® NGC™ catalog.

Multi-GPU Speedups

cuQuantum Appliance speeds up simulations of popular quantum algorithms like quantum fourier transform, Shor’s algorithm, and quantum supremacy circuits by 70X-290x over CPU implementations on Dual AMD EPYC 7742 CPU.

Features and Benefits



Choose the best approach for your work from algorithm-agnostic accelerated quantum circuit simulation methods.

State vector method features include optimized memory management and math kernels, efficiency index bit swaps, gate application kernels, and probability array calculations for qubit sets.

Tensor network method features include accelerated tensor and tensor network contraction, order optimization, approximate contractions, and multi-GPU contractions.



Leverage the power of multi-node, multi-GPU clusters using the latest GPUs on premises or in the cloud..

Low-level C++ APIs provide increased control and flexibility for a single GPU and single-node multi-GPU clusters.

The high-level Python API supports drop-in multi-node execution.



Simulate bigger problems faster and get more work done sooner.

Using an NVIDIA A100 Tensor Core GPU over CPU implementations delivers orders-of-magnitude speedups on key quantum problems, including random quantum circuits, Shor’s algorithm, and the Variational Quantum Eigensolver.

Leveraging the NVIDIA Selene supercomputer, cuQuantum generated a sample from a full-circuit simulation of the Google Sycamore processor in less than 10 minutes.

Framework Integrations

cuQuantum is integrated with leading quantum circuit simulation frameworks.

Download cuQuantum to dramatically accelerate performance using your framework of choice, with zero code changes.



State Vector Method

Quantum Machine Learning

CPU vs Single GPU (1 thread and 32 thread comparisons)

line graph

Evaluation of the Jacobian of a strongly entangling layered circuit leveraging adjoint backpropagation. Run lightning.gpu on an NVIDIA DGX A100, compared to lightning.qubit on an Epyc 7742 CPU. Results are averaged across 3 runs.

State vector simulation tracks the entire state of the system over time, through each gate operation. It’s an excellent tool for simulating deep or highly entangled quantum circuits, and for simulating noisy qubits.

An NVIDIA DGX™ A100 system with eight NVIDIA A100 80GB Tensor Core GPUs can simulate up to 36 qubits, delivering an orders-of-magnitude speedup on leading state vector simulations over a dual-socket CPU server.

cuStateVec has been adopted by leading publicly available simulators, including integrations into AWS Braket, Google Cirq's qsim simulator, the IBM Qiskit Aer simulator, and Xanadu’s PennyLane Lightning simulator. Users leveraging lightning.gpu on AWS Braket experienced 900x Speedups and saved 3.5x on costs. It will soon support an even wider range of frameworks and simulators. Visit the NVIDIA Technical Blog for more details.

Tensor Network Method

Tensor network methods are rapidly gaining popularity as a way to simulate hundreds or thousands of qubits for near-term quantum algorithms. Tensor networks scale with the number of quantum gates rather than the number of qubits. This makes it possible to simulate very large qubit counts with smaller gate counts on large supercomputers.

Tensor contractions dramatically reduce the memory requirement for running a circuit on a tensor network simulator. The research community is investing heavily in improving pathfinding methods for quickly finding near-optimal tensor contractions before running a simulation.

cuTensorNet provides state-of-the-art performance for both the pathfinding and contraction stages of tensor network simulation. See the NVIDIA Technical Blog for more details.

Using cuQuantum, NVIDIA researchers were able to simulate a variational quantum algorithm for solving the MaxCut optimization problem using 1,688 qubits to encode 3,375 vertices on an NVIDIA DGX SuperPOD™ system, a 16X improvement over the previous largest simulation — and multiple orders of magnitude larger than the largest problem run on quantum hardware to date.

Pathfinding and Contraction Performance

State-of-the-Art Performance for Pathfinding


Performance for cuTensorNet pathfinding compared to Cotengra in terms of seconds per sample. Both runs are leveraging a single core EPYC 7742 CPU.

Sycamore refers to 53 qubit random quantum circuits of depth 10, 12, 14, and 20 from Arute et. al. Quantum Supremacy using a Programmable Superconducting Processor.

Cotengra: Gray & Kourtis, Hyper-optimized Tensor Network Contraction, 2021.

State-of-the-Art Performance for Contraction Time

Contraction performance for cuTensorNet compared to Torch, cuPy and numPy. All runs leverage the same best contraction path. cuTensorNet, cuPy, Torch, all ran on 1 NVIDIA A100 GPU. Numpy was run on single socket EPYC 7742. cuPy and numPy cannot execute Sycamore depth 12 and 14 as they have restrictions on maximum tensor rank of 32, as both circuits have tensors greater than this limit these jobs are not supported.

BQSKit: circuits with 48 and 64 qubits: Berkeley Quantum Synthesis Toolkit
QAOA: 36 qubits with 4 parameters
PEPS: tensor network with dimensions of 3x3 and operator depth 30.

Get started with NVIDIA cuQuantum.

Download Now