# cuQuantum

## Accelerate Quantum Computing Research

Quantum computing has the potential to offer giant leaps in computational capabilities. The ability of scientists, developers, and researchers to simulate quantum circuits on classical computers is vital to getting us there.

The research communities across academia, laboratories, and industries are using simulators to help design and verify algorithms to run on quantum computers. These simulators capture the properties of superposition and entanglement and are built on quantum circuit simulation frameworks.

**NVIDIA cuQuantum** is an SDK of optimized libraries and tools for accelerating quantum computing workflows. With NVIDIA Tensor Core GPUs, developers can use cuQuantum to speed up quantum circuit simulations based on state vector and tensor network methods by orders of magnitude.

## Quick Links

### cuQuantum Appliance

A full simulation stack based on cuQuantum in a ready-to-deploy container.

### Documentation

Documentation for cuQuantum and the cuQuantum Appliance.

### GitHub

The cuQuantum public repository, including the cuQuantum Python bindings and examples.

### Forum

The NVIDIA quantum computing developer forums.

## NVIDIA cuQuantum Appliance

cuQuantum Appliance helps developers get started by making simulation software available in a container optimized to run on the latest NVIDIA DGX™ systems, and HGX systems.

The stack includes Google’s Cirq framework and qsim simulator along with NVIDIA cuQuantum.

The appliance software achieved best-in-class performance on key problems in quantum computing, including Shor’s algorithm, random quantum circuits, and quantum fourier transform. Recent software only updates to our container offering have enabled 4.4x the speedup over previously reported numbers.

cuQuantum Appliance is available now in the NVIDIA® NGC™ catalog.

### Multi-GPU Speedups

cuQuantum Appliance speeds up simulations of popular quantum algorithms like quantum fourier transform, Shor’s algorithm, and quantum supremacy circuits by 70X-290x over CPU implementations on Dual AMD EPYC 7742 CPU.

### Multi-Node Speedups

Performance is benchmarked leveraging Quantum Volume with a depth of 10, run on ABCI 2.0 Compute Node (A), based upon the NVIDIA A100 40GB GPU. We compared the published results with Qiskit-Aer multi-node on the same GPUs (deployed elsewhere) and the mpiQulacs results run on Todoroki, a supercomputer cluster based on the Fugaku reference architecture and A64FX CPU.

Our latest multi-node update introduces a solver backend for IBM’s Qiskit Aer, which enables users to scale their Qiskit code with no code changes to the largest NVIDIA machines.

The Appliance achieved weak and strong scaling on key problems like Quantum Volume, QAOA, and Quantum Phase Estimation. We have compared to published data, of which, Quantum Volume, at a depth of 10, was most complete. Users will see that they may gain almost two orders of magnitude speedup over previous Qiskit implementations on the same hardware.

This new capability enables users of the NVIDIA platform and Qiskit to achieve the most performant quantum circuit simulations at supercomputer scales.

cuQuantum Appliance users are only restricted by the number of GPUs they have access to.

cuQuantum Appliance is available now in the NVIDIA® NGC™ catalog .

## Features and Benefits

### Flexible

**Choose the best approach for your work from algorithm-agnostic accelerated quantum circuit simulation methods.**

**State vector method** features include optimized memory management and math kernels, efficiency index bit swaps, gate application kernels, and probability array calculations for qubit sets.

**Tensor network method ** features include accelerated tensor and tensor network contraction, order optimization, approximate contractions, and multi-GPU contractions.

### Scalable

**Leverage the power of multi-node, multi-GPU clusters using the latest GPUs on premises or in the cloud.**.

**Low-level C++ APIs** provide increased control and flexibility for a single GPU and single-node multi-GPU clusters.

The **high-level Python API** supports drop-in multi-node execution.

### Fast

**Simulate bigger problems faster and get more work done sooner**.

Using an NVIDIA A100 Tensor Core GPU over CPU implementations delivers orders-of-magnitude speedups on key quantum problems, including **random quantum circuits, Shor’s algorithm,** and the **Variational Quantum Eigensolver**.

Leveraging the NVIDIA Selene supercomputer, cuQuantum generated a sample from a **full-circuit simulation** of the **Google Sycamore processor** in less than 10 minutes.

## Framework Integrations

cuQuantum is integrated with leading quantum circuit simulation frameworks.

Download cuQuantum to dramatically accelerate performance using your framework of choice, with zero code changes.

## Performance

### State Vector Method

### Quantum Machine Learning

CPU vs Single GPU (1 thread and 32 thread comparisons)

Evaluation of the Jacobian of a strongly entangling layered circuit leveraging adjoint backpropagation. Run lightning.gpu on an NVIDIA DGX A100, compared to lightning.qubit on an Epyc 7742 CPU. Results are averaged across 3 runs.

State vector simulation tracks the entire state of the system over time, through each gate operation. It’s an excellent tool for simulating deep or highly entangled quantum circuits, and for simulating noisy qubits.

An NVIDIA DGX™ A100 system with eight NVIDIA A100 80GB Tensor Core GPUs can simulate up to 36 qubits, delivering an orders-of-magnitude speedup on leading state vector simulations over a dual-socket CPU server.

cuStateVec has been adopted by leading publicly available simulators, including integrations into AWS Braket, Google Cirq's qsim simulator, the IBM Qiskit Aer simulator, and Xanadu’s PennyLane Lightning simulator. Users leveraging lightning.gpu on AWS Braket experienced 900x Speedups and saved 3.5x on costs. It will soon support an even wider range of frameworks and simulators. Visit the NVIDIA Technical Blog for more details.

## Tensor Network Method

Tensor network methods are rapidly gaining popularity as a way to simulate hundreds or thousands of qubits for near-term quantum algorithms. Tensor networks scale with the number of quantum gates rather than the number of qubits. This makes it possible to simulate very large qubit counts with smaller gate counts on large supercomputers.

Tensor contractions dramatically reduce the memory requirement for running a circuit on a tensor network simulator. The research community is investing heavily in improving pathfinding methods for quickly finding near-optimal tensor contractions before running a simulation.

cuTensorNet provides state-of-the-art performance for both the pathfinding and contraction stages of tensor network simulation. See the NVIDIA Technical Blog for more details.

Using cuQuantum, NVIDIA researchers were able to simulate a variational quantum algorithm for solving the MaxCut optimization problem using 1,688 qubits to encode 3,375 vertices on an NVIDIA DGX SuperPOD™ system, a 16X improvement over the previous largest simulation — and multiple orders of magnitude larger than the largest problem run on quantum hardware to date.

### Pathfinding and Contraction Performance

State-of-the-Art Performance for Pathfinding

Performance for cuTensorNet pathfinding compared to Cotengra in terms of seconds per sample. Both runs are leveraging a single core EPYC 7742 CPU.

Sycamore refers to 53 qubit random quantum circuits of depth 10, 12, 14, and 20 from Arute et. al. Quantum Supremacy using a Programmable Superconducting Processor.

www.nature.com/articles/s41586-019-1666-5

Cotengra: Gray & Kourtis, Hyper-optimized Tensor Network Contraction, 2021.

quantum-journal.org/papers/q-2021-03-15-410

State-of-the-Art Performance for Contraction Time

Contraction performance for cuTensorNet compared to Torch, cuPy and numPy. All runs leverage the same best contraction path. cuTensorNet, cuPy, Torch, all ran on 1 NVIDIA A100 GPU. Numpy was run on single socket EPYC 7742. cuPy and numPy cannot execute Sycamore depth 12 and 14 as they have restrictions on maximum tensor rank of 32, as both circuits have tensors greater than this limit these jobs are not supported.

BQSKit: circuits with 48 and 64 qubits: Berkeley Quantum Synthesis Toolkit https://github.com/BQSKit/bqskit

QAOA: 36 qubits with 4 parameters

PEPS: tensor network with dimensions of 3x3 and operator depth 30.

## Approximate Tensor Network Methods

MPS gate split performance is measured in execution time as a function of bond dimension. We execute this on an NVIDIA A100 80GB GPU and compare it to NumPy running on an EPYC 7742 data center CPU.

As the quantum problems of interest can greatly vary in both size and complexity, researchers have developed highly customized approximate tensor network algorithms to address the gamut of possibilities. To enable easy integration with these frameworks and libraries, cuTensorNet provides a set of APIs to cover the following common use cases: Tensor QR, Tensor SVD, and Gate Split.

These primitives enable users to accelerate and scale different types of quantum circuit simulators. A common approach to simulating quantum computers which takes advantage of these methods is matrix product states (MPS, also known as tensor train). Users can leverage these new cuTensorNet APIs to accelerate MPS-based quantum circuit simulators.

The gate split, and Tensor SVD APIs, enable nearly an order of magnitude speedup over state-of-the-art CPU implementations. Tensor QR is the most efficient with nearly two orders-of-magnitude speedup over the same EPYC 7742 data center CPU.

## Resources

- Watch GTC sessions

- NVIDIA GTC23 Quantum Computing Sessions
- Introducing cuQuantum: Accelerating State Vector and Tensor Network-Based Quantum Circuit Simulation
- A Deep Dive on the Latest HPC Software
- Benchmarking GPU Clusters with Universal Quantum Computing Simulations

- Read NVIDIA blog posts

- Enabling Matrix Product State–Based Quantum Circuit Simulation with NVIDIA cuQuantum
- Best-in-Class Quantum Circuit Simulation at Scale with NVIDIA cuQuantum Appliance
- Achieving Supercomputing-Scale Quantum Circuit Simulation with the cuQuantum Appliance
- Growing Range of Researchers, Scientists Adopt NVIDIA cuQuantum
- NVIDIA Teams With Google Quantum AI, IBM and Other Leaders to Speed Research in Quantum Computing
- NVIDIA Sets World Record for Quantum Computing Simulation With cuQuantum Running on DGX SuperPOD
- What Is Quantum Computing?
- Accelerating Quantum Circuit Simulation with NVIDIA cuStateVec
- Scaling Quantum Circuit Simulation with NVIDIA cuTensorNet
- What is a QPU?