cuQuantum
Accelerate quantum computing research.
Quantum computing has the potential to offer giant leaps in computational capabilities. The ability of scientists, developers, and researchers to simulate quantum circuits on classical computers is vital to getting us there.
NVIDIA cuQuantum is an SDK of optimized libraries and tools for accelerating quantum computing workflows. With NVIDIA Tensor Core GPUs, developers can use cuQuantum to accelerate quantum circuit simulations based on state vector and tensor network methods by orders of magnitude.
Looking to Run in the Cloud?
Quick Links
cuQuantum Appliance
A full simulation stack based on cuQuantum in a ready-to-deploy container.
Documentation
Documentation for cuQuantum and the cuQuantum Appliance.
GitHub
The cuQuantum public repository, including the cuQuantum Python bindings and examples.
Latest Notes
The cuQuantum release notes, including the latest and greatest features
NVIDIA cuQuantum Appliance
cuQuantum Appliance helps developers get started by making simulation software available in a container optimized to run on the latest NVIDIA DGX™ systems, and HGX™ systems.
The stack includes Google’s Cirq framework and qsim simulator along with NVIDIA cuQuantum.
The appliance software achieved best-in-class performance on key problems in quantum computing, including Shor’s algorithm, random quantum circuits, and quantum Fourier transform. Recent software updates to our container offering have enabled a 4.4X speedup over previously reported numbers. Combined with ~2x speedups offered by Hopper GPUs, users see even greater speedups over CPU implementations despite CPU hardware and software improvements.
cuQuantum Appliance is available now in the NVIDIA® NGC™ catalog and as machine image on each major cloud marketplace.
Multi-GPU Speedups
cuQuantum Appliance speeds up simulations of popular quantum algorithms like quantum fourier transform, Shor’s algorithm, and quantum supremacy circuits by 90-369x on NVIDIA H100 80GB Tensor Core GPUs over CPU implementations on dual Intel Xeon Platinum 8480C CPUs.
Multi-Node Speedups
Performance is benchmarked leveraging Quantum Volume with a depth of 10 and depth of 30, along with QAOA and a small Quantum Phase Estimation, run on NVIDIA H100 80GB GPUs. On average cuQuantum with H100 GPUs is ~2x faster than A100s.
Our latest multi-node update introduces support for IBM’s Qiskit Aer, which enables users to scale their Qiskit code with no code changes to the largest NVIDIA machines.
This new capability enables users of the NVIDIA Quantum platform to achieve the most performant quantum circuit simulations at supercomputer scales. On key problems like Quantum Phase Estimation, QAOA, Quantum Volume, and more, the newest cuQuantum Appliance is over two orders of magnitude faster than previous implementations, and seamlessly scales from a single GPU to a supercomputer.
cuQuantum Appliance users are only restricted by the number of GPUs they have access to.
cuQuantum Appliance is available now in the NVIDIA® NGC™ catalog and as machine image on each major cloud marketplace.
Features and Benefits
Flexible
Choose the best approach for your work from algorithm-agnostic accelerated quantum circuit simulation methods.
State vector method features include optimized memory management and math kernels, efficiency index bit swaps, gate application kernels, and probability array calculations for qubit sets.
Tensor network method features include accelerated tensor and tensor network contraction, order optimization, approximate contractions, and multi-GPU contractions.
Scalable
Leverage the power of multi-node, multi-GPU clusters using the latest GPUs on premises or in the cloud.
Low-level C++ APIs provide increased control and flexibility for a single GPU and single-node multi-GPU clusters.
The high-level Python API supports drop-in multi-node execution.
Fast
Simulate bigger problems faster and get more work done sooner.
Using an NVIDIA H100 Tensor Core GPU over CPU implementations delivers orders-of-magnitude speedups on key quantum problems, including random quantum circuits, Shor’s algorithm, and the Variational Quantum Eigensolver.
Leveraging the NVIDIA Selene supercomputer, cuQuantum generated a sample from a full-circuit simulation of the Google Sycamore processor in less than 10 minutes.
Framework Integrations
cuQuantum is integrated with leading quantum circuit simulation frameworks.
Download cuQuantum to dramatically accelerate performance using your framework of choice, with zero code changes.
Performance
State Vector Method
Quantum Machine Learning
CPU vs Single GPU (1 thread and 32 thread comparisons)
Evaluation of the Jacobian of a strongly entangling layered circuit leveraging adjoint backpropagation. Run lightning.gpu on an NVIDIA DGX A100, compared to lightning.qubit on an Epyc 7742 CPU. Results are averaged across three runs.
State vector simulation tracks the entire state of the system over time, through each gate operation. It’s an excellent tool for simulating deep or highly entangled quantum circuits, and for simulating noisy qubits.
An NVIDIA DGX™ A100 system with eight NVIDIA A100 80GB Tensor Core GPUs can simulate up to 36 qubits, delivering an orders-of-magnitude speedup on leading state vector simulations over a dual-socket CPU server.
cuStateVec has been adopted by leading publicly available simulators, including integrations into AWS Braket, Google Cirq's qsim simulator, the IBM Qiskit Aer simulator, and Xanadu’s PennyLane Lightning simulator. Users leveraging lightning.gpu on AWS Braket experienced 900X speedups and saved 3.5X on costs. It'll soon support an even wider range of frameworks and simulators. Read the NVIDIA Technical Blog for more details.
Tensor Network Method
Tensor network methods are rapidly gaining popularity to simulate hundreds or thousands of qubits for near-term quantum algorithms. Tensor networks scale with the number of quantum gates rather than the number of qubits. This makes it possible to simulate very large qubit counts with smaller gate counts on large supercomputers.
Tensor contractions dramatically reduce the memory requirement for running a circuit on a tensor network simulator. The research community is investing heavily in improving pathfinding methods for quickly finding near-optimal tensor contractions before running a simulation.
cuTensorNet provides state-of-the-art performance for both the pathfinding and contraction stages of tensor network simulation. See the NVIDIA Technical Blog for more details.
Using cuQuantum, NVIDIA researchers were able to simulate a variational quantum algorithm for solving the MaxCut optimization problem using 1,688 qubits to encode 3,375 vertices on an NVIDIA DGX SuperPOD™ system, a 16X improvement over the previous largest simulation — and multiple orders of magnitude larger than the largest problem run on quantum hardware to date.
Pathfinding and Contraction Performance
State-of-the-Art Performance for Pathfinding
Performance for cuTensorNet pathfinding compared to Cotengra in terms of seconds per sample. Both runs are leveraging a single core EPYC 7742 CPU.
Sycamore refers to 53 qubit random quantum circuits of depth 10, 12, 14, and 20 from Arute et. al. Quantum Supremacy using a Programmable Superconducting Processor.
www.nature.com/articles/s41586-019-1666-5
Cotengra: Gray & Kourtis, Hyper-optimized Tensor Network Contraction, 2021.
quantum-journal.org/papers/q-2021-03-15-410
State-of-the-Art Performance for Contraction Time
Contraction performance for cuTensorNet compared to Torch, cuPy and numPy. All runs leverage the same best contraction path. cuTensorNet, cuPy, Torch, all ran on 1 NVIDIA A100 GPU. Numpy was run on single socket EPYC 7742. cuPy and numPy cannot execute Sycamore depth 12 and 14 as they have restrictions on maximum tensor rank of 32, as both circuits have tensors greater than this limit these jobs are not supported.
BQSKit: circuits with 48 and 64 qubits: Berkeley Quantum Synthesis Toolkit https://github.com/BQSKit/bqskit
QAOA: 36 qubits with 4 parameters
PEPS: tensor network with dimensions of 3x3 and operator depth 30.
Approximate Tensor Network Methods
MPS gate split performance is measured in execution time as a function of bond dimension. We execute this on an NVIDIA A100 80GB GPU and compare it to NumPy running on an EPYC 7742 data center CPU.
As the quantum problems of interest can greatly vary in both size and complexity, researchers have developed highly customized approximate tensor network algorithms to address the gamut of possibilities. To enable easy integration with these frameworks and libraries, cuTensorNet provides a set of APIs to cover the following common use cases: Tensor QR, Tensor SVD, and Gate Split.
These primitives enable users to accelerate and scale different types of quantum circuit simulators. A common approach to simulating quantum computers which takes advantage of these methods is matrix product states (MPS, also known as tensor train). Users can leverage these new cuTensorNet APIs to accelerate MPS-based quantum circuit simulators.
The gate split, and Tensor SVD APIs, enable nearly an order of magnitude speedup over state-of-the-art CPU implementations. Tensor QR is the most efficient with nearly two orders-of-magnitude speedup over the same EPYC 7742 data center CPU.
Resources
- Watch GTC sessions
- Read NVIDIA blog posts
- NVIDIA, Rolls-Royce and Classiq Announce Quantum Computing Breakthrough for Computational Fluid Dynamics in Jet Engines
- Enabling Matrix Product State–Based Quantum Circuit Simulation with NVIDIA cuQuantum
- Best-in-Class Quantum Circuit Simulation at Scale with NVIDIA cuQuantum Appliance
- Achieving Supercomputing-Scale Quantum Circuit Simulation with the cuQuantum Appliance
- Growing Range of Researchers, Scientists Adopt NVIDIA cuQuantum
- NVIDIA Teams With Google Quantum AI, IBM and Other Leaders to Speed Research in Quantum Computing
- NVIDIA Sets World Record for Quantum Computing Simulation With cuQuantum Running on DGX SuperPOD
- What Is Quantum Computing?
- Accelerating Quantum Circuit Simulation with NVIDIA cuStateVec
- Scaling Quantum Circuit Simulation with NVIDIA cuTensorNet
- What is a QPU?