Simulation / Modeling / Design

CUDA-Q Enabled Resource Reduction for Quantum Clustering Algorithms

10 coreset representation of a 1K dataset with coresets as smaller and larger stars according to weight.

Quantum computers can use the quantum properties of superposition, entanglement, and interference to generalize learnings and insights from data. Such quantum machine learning (QML) techniques will eventually run on quantum-accelerated supercomputers that combine the processing powers of CPUs, GPUs, and QPUs to solve some of the world’s most complex problems.

Many QML algorithms offer theoretical speedups by assuming that classical data can be efficiently loaded in superposition using so-called quantum random access memory (QRAM). The lack of any efficient means to implement QRAM means that early quantum computers will likely excel at compute, rather than data-intensive tasks. 

In practice, QML algorithms that are effective on near and mid-term hardware must focus on compute-intensive heuristics that can analyze data in the absence of QRAM.

This post highlights recent research by Associate Professor Dr Petros Wallden and his team at the Quantum Software Lab, part of the School of Informatics in the University of Edinburgh. Petros is an expert in quantum informatics, with research ranging from quantum algorithms and quantum cryptography to the foundations of quantum informatics.  

Petros’ team used the NVIDIA CUDA-Q (formerly CUDA Quantum) platform to develop and accelerate the simulation of new QML methods to significantly reduce the qubit count necessary to study large data sets.

Petros’ team extended Harrow’s work, which uses the concept of coresets to provide a novel way of building realistic oracles for QML applications, without the need for QRAM.  

What are coresets?

A coreset is formed by taking a full data set and optimally mapping it to a smaller weighted data set (Figure 1). The coreset can then be analyzed to approximate traits of the full data set without the need to process the full data set directly.

Scatterplot shows raw data with coresets as smaller and larger stars to represent the weights.
Figure 1. A size 10 coreset representation of a 1000-point dataset

Coresets are the result of a classical dimensionality reduction method that preprocesses data before a clustering application. By employing coresets, data-intensive QML tasks can be approximated with orders of magnitude fewer qubits and become more realistic applications for near-term quantum computing.

Standard classical coreset construction techniques usually begin with a data set and a target error and then determine the optimal size of the coreset to satisfy this error requirement. Due to experimental constraints, Petros’ team instead chose the coreset size based on the number of qubits that were available. They then assessed the error resulting from this choice following the quantum computations. 

Quantum approaches for clustering with coresets

With input data reduced to the manageably sized coreset, Petros’ team could explore three quantum clustering algorithms. 

Clustering is an unsupervised learning technique describing a family of methods that group similar data points in a meaningful way. These groups, or clusters, can then be used for informed decision making in real-world applications, such as determining whether a tumor is malignant or benign.

Petros’s team used CUDA-Q to implement the following clustering techniques:

  • Divisive clustering: An approach where coreset points begin in one cluster and are successively bipartitioned until each data point is in its own cluster. In this approach, the process can be stopped at the Kth iteration to see how data is divided into K clusters (Figure 2).
  • 3-means clustering: The data points are partitioned into K clusters (three, in this case) based on each point’s relationship to K evolving centers-of-mass (centroids). The process ends when the three clusters converge and no longer change with new iterations.
  • Gaussian mixture model (GMM) clustering: The distribution of potential coreset points locations is represented as a mixture of K Gaussian distributions. The data is sorted into K sets based on which Gaussian each coreset point most likely came from. 

Each of these clustering techniques outputs a set of coresets and a mapping from each point in the original dataset to one of these coresets. The result is an approximate clustering and dimensionality reduction of the initial, large dataset.

The divisive clustering model for an N=25 coreset results in a dendrogram that can be used to determine the coreset clusters after any number of iterations and a corresponding number of desired clusters (five, in this case.)
Figure 2. Outcome of an N=25 coreset QML divisive-clustering simulation

By using a variational quantum algorithm (VQA) framework, each technique was expressed in a way that could use a QPU. Petros and team made this possible by deriving a weighted qubit Hamiltonian (inspired by the max cut problem) encoding the respective cost functions for each clustering method described earlier. Having such a Hamiltonian enabled the iterative VQA process to repeatedly call a QPU, real or simulated, to efficiently calculate the cost minimization required in each clustering routine.

Using CUDA-Q to overcome scalability issues 

Exploring the effectiveness of these QML-clustering approaches required the ability to run simulations of how each algorithm would perform. 

The NVIDIA CUDA-Q simulation toolkits enabled comprehensive simulations of each clustering method on problem sizes up to 25 qubits. CUDA-Q sped up these simulations by providing easy access to GPU hardware. It also provided out-of-the-box primitives such as hardware-efficient ansatz kernels used to parametrize the Hamiltonian-based optimization process and spin Hamiltonians that were easily adapted to the clustering algorithm cost functions. 

In fact, the performing simulations at the scales that Petros’ team presented in their Big data applications on small quantum computers paper is only possible with the GPU-acceleration provided by CUDA-Q. 

Initial experiments simulated only 10 qubits on CPU hardware, but memory constraints made this impossible for the 25-qubit experiments of interest. Implementation in CUDA-Q meant that the initial 10-qubit simulation code was also instantly compatible, running without modification when Petros’ team needed to switch out CPU hardware for an NVIDIA DGX H100 GPU system.

Diagram shows a 320-GB state vector requirement divided between four 80-GB GPUs.
Figure 3. CUDA-Q mgpu backend can pool the memory of multiple GPUs to perform large state vector simulations

This kind of code scalability is a huge advantage. By pooling the memory of multiple GPUs with the NVIDIA mgpu backend (Figure 3), Petros and team have since scaled the simulation even further by changing the backend target, again without having to significantly alter their original simulation code.

According to Boniface Yogendran, the lead developer on this research, “CUDA-Q enabled us to not worry about qubit scalability limitations and be HPC-ready from day one.” 

Yogendran’s code can take the work beyond simulation, as CUDA-Q is inherently also QPU-ready, providing support for deployment on all major QPU modalities.

Value of CUDA-Q simulations

Being able to easily simulate all three clustering algorithms enabled Petros and his team to benchmark each of them against a brute force method, finding the global optimal solution, and a classical heuristic approach known as Lloyd’s algorithm. The results indicated that the quantum algorithms performed best for the GMM (K=2) and the divisive clustering approaches were comparable to Lloyd’s algorithm. 

Based on the success of this work, Petros’ team plans to continue working with NVIDIA to continue developing and scaling new quantum-accelerated supercomputing applications with CUDA-Q.  

Explore CUDA-Q

CUDA-Q enabled Petros and the team to easily develop novel QML implementations and simulate them with accelerated computing. Implementation in CUDA-Q enables the code to be portable for further large-scale simulations or deployment on physical QPUs. 

For more information about CUDA-Q quantum or to get started right away, see the Divisive clustering Jupyter notebook that explores the coreset-enabled divisive clustering method described in this post. The tutorial demonstrates how easy it is to scale the code and run a 34-qubit instance using GPUs. 

Download CUDA-Q using any of the methods provided in the CUDA-Q Quick Start guide and find other tutorials on the CUDA-Q Tutorials page.

Discuss (0)

Tags