Data Science

GPU-Accelerated Single-Cell RNA Analysis with RAPIDS-singlecell

Single-cell sequencing has become one of the most prominent technologies used in biomedical research. Its ability to decipher changes in the transcriptome and epigenome on a cell level has enabled researchers to gain valuable new insights. As a result, single-cell experiments have grown in size and complexity by a factor of over 100, with experiments involving more than 1 million cells becoming increasingly common. 

However, the resulting data must be analyzed in a highly iterative process. It is crucial that fast algorithms are used for these iterative steps to enable quick turnaround times.

For more consistent single-cell analysis using Python, developers at scverse have worked toward building an entire ecosystem to help researchers perform analyses. At the core of this ecosystem lies a data structure that maintains annotations of various transformations throughout the data processing pipeline during single-cell analysis. 

AnnData, a Python package for handling annotated data matrices in memory and on disk, is the core structure used in the Scanpy library, which is the main single-cell analysis suite within the scverse ecosystem. Scanpy builds on top of other libraries common to the PyData ecosystem—such as NumPy, SciPy, Numba, and Scikit-learn—for just about all of the typical analysis steps. 

However, Scanpy algorithms are mostly CPU-based and slow down significantly with larger experiments. The highly iterative nature of the single-cell analysis process only compounds this problem. 

GPUs for single-cell analysis

General feasibility of running downstream single-cell RNA sequencing (scRNA-seq) analysis on the GPU with RAPIDS and Scanpy has been shown in Accelerating Single-Cell Genomic Analysis with GPUs. This work resulted in the rapids-single-cell-examples GitHub repo, which contains a series of example notebooks written by the RAPIDS and NVIDIA Parabricks teams. RAPIDS is an open-source suite of libraries for GPU-accelerated data science with Python. Parabricks is a free suite of GPU-accelerated, industry-standard genomic analysis tools based on deep learning.

While these example notebooks demonstrate a few typical single-cell RNA workflows on the GPU, they were never intended for daily use nor as a GPU-accelerated replacement for libraries like Scanpy. 

Drawing inspiration from prior work on rapids-single-cell-examples is an emerging library called rapids-singlecell, a GPU-accelerated tool for scRNA analysis. This tool aims to be a daily drivable single-cell analysis suite that is compatible with the scverse ecosystem. It uses RAPIDS and CuPy to provide GPU-accelerated functions that are near-drop-in replacements for the corresponding functions in Scanpy.

On average, users can expect performance to increase between 10x and 20x by using rapids-singlecell. For more details, see Accelerating Single Cell Genomic Analysis Using RAPIDS.

Faster single-cell analysis using RAPIDS

rapids-singlecell follows a similar usability model as the scverse Python libraries. It is also written in Python but puts many of the performance-critical pieces on the GPU, hiding all of the complexities that are normally associated with writing CUDA applications (the language typically used to write accelerated algorithms for NVIDIA GPUs). 

rapids-singlecell consists of five categories, which are described in the sections that follow. Each category accelerates a different piece of the typical single cell analysis workflow.

For more information, including the various APIs offered by rapids-singlecell, see the rapids-singlecell documentation


The AnnData, or annotated data object, is a widely used data structure for handling single-cell RNA sequencing data. In contrast, cunnData is a minimized and lightweight version of the AnnData object for the GPU that replaces the scverse standard for preprocessing (Figure 1). Instead of storing the count matrix .X on the CPU, cunnData stores it on the GPU as CuPy sparse matrices. This makes it faster and more efficient to perform computations on the count matrix. It also contains annotation dataframes for cells (.obs attribute) and genes (.var attribute), which store additional information such as cell type and gene names.

Diagram illustrating the structure of the cunnData class, similar to the annData Python package, but with data stored in GPU memory. The main components include: .X matrix and its layers stored in GPU memory, .obs DataFrame, .var DataFrame, .uns dictionary, .obsm Mapping, and .varm Mapping.
Figure 1. Schematic representation of the cunnData class structure

cunnData also includes additional features, such as the ability to store different versions of the count matrix, such as raw integer counts, in .Layers. Unlike AnnData, which stores .Layers in Host (CPU) memory, cunnData also stores .Layers on the GPU, reducing the need to copy data from the host to GPU memory and enabling accelerated computations.

cunnData supports unstructured annotations in the .uns attribute, as well as multidimensional annotations of cells and genes in the .obsm and .varm attributes, which are stored in the host memory. These annotations enable users to include additional information about their data, such as spatial coordinates or principal component analysis (PCA) embeddings.

Similarly, cunnData supports slicing like AnnData. These slices, however, are always full copies of the original data, as opposed to views. Overall, cunnData enables a faster approach to preprocessing scRNA-seq data compared to the more feature-rich CPU-bound AnnData object.

The Python snippet below demonstrates the conversion of an AnnData object (a standard data structure for handling single-cell RNA sequencing data) into a cunnData object.

import scanpy as sc
import rapids_singlecell as rsc
cudata = rsc.cunnData.cunnData(adata=adata) 


The preprocessing functions are stored in cunnData_funcs, which provides accelerated alternatives to the Scanpy preprocessing functions. These functions work on the cunnData object and use RAPIDS cuML and CuPy to dramatically accelerate Scanpy functions based on Scikit-learn, Numpy and SciPy.

Filtering cells and genes can be accomplished with filter_cells and filter_genes functions, respectively. Quality control is handled with the calculate_qc_metrics function.

# Basic QC rapids-singlecell
rsc.pp.flag_gene_family(cudata,gene_family_name="MT", gene_family_prefix="mt-")
cudata = cudata[cudata.obs["n_genes_by_counts"] > 500]
cudata = cudata[cudata.obs["pct_counts_MT"] < 20]

To normalize your data, cunnData_funcs provides GPU alternatives to the normalize_total, log1p, and the recently introduced normalize_pearson_residuals functions from Scanpy. Annotating highly variable genes is accelerated for all flavors supported in Scanpy (including seurat, cellranger, seurat_v3, pearson_residuals), as well as poisson_gene_selection, which is adapted from scvi-tools.

# log normalization and highly variable gene selection
cudata.layers["counts"] = cudata.X.copy()
rsc.pp.highly_variable_genes(cudata,n_top_genes=5000,flavor="seurat_v3",layer = "counts")
cudata = cudata[:,cudata.var["highly_variable"]==True]

The regress_out function, used to remove unwanted sources of variation, is accelerated with the cuML linear regression estimator. It also supports multitarget regression, which was introduced in cuML in version 22.12, while staying backwards compatible with prior versions. 

Principal component analysis wraps cuML PCA, Truncated SVD, and Incremental PCA to give you the same options offered by Scanpy. With the PCA version in cunnData_funcs, you can choose which layer you want to use for the analysis, an additional feature not currently supported by the scanpy PCA function.

# Regression, scaling and PCA
rsc.pp.regress_out(cudata,keys=["total_counts", "pct_counts_MT"])
rsc.pp.pca(cudata, n_comps = 100), log=True,n_pcs=100)

cunndata_funcs can accelerate preprocessing by a factor of 10 to 20x (Tables 1-3). After preprocessing, the cunnData object is transformed into an AnnData object.

adata_preprocessed = cudata.to_AnnData()


scanpy_gpu provides functions that work on the AnnData object, with the goal of providing accelerated functions. To keep the syntax as close as possible between Scanpy and rapids-singlecell, metadata is also written to the .uns attribute. This attribute is useful for storing trained parameters such as the variance ratio, which is computed during the PCA computation. scanpy_gpu provides a PCA function for the AnnData object equivalent to cunnData_funcs.

Scanpy already includes support for computing UMAP and nearest neighbors on the GPU using cuML. scanpy_gpu extends Scanpy GPU support by adding more algorithms, such as accelerated graph-based clustering using Leiden and Louvain from cuGraph, as well as the Force Atlas 2 algorithm for visually laying out graph data. scanpy_gpu also uses PCA and kernel density estimation (KDE)  from cuML and diffusion maps are computed using the CuPy library in a similar manner to how Scanpy uses SciPy and Numpy for scientific computing. 

For batch correction, scanpy_gpu provides a GPU port of Harmony Integration, called harmony_gpu. PyMDE (minimum distortion embedding), a function that enables embedding single-cell data while jointly learning the graph and the low-dimensional representation in a probabilistic manner, has also been adapted from scvi-tools.

The near-drop-in replacement nature of rapids-singlecell relies on Scanpy for visualization. It is intuitive to use Scanpy for plotting directly within the scverse framework.


The decoupler tool uses a unified framework to implement several different statistical methods with a focus on biological activity. (Cellular, molecular, and physiological processes in living organisms, for example, gene sets and transcription factors activity.) decoupler_gpu re-implements and accelerates the weighted sum (run_wsum) and multivariate linear model (run_mlm) methods. The GPU port in rapids-singlecell uses the same nets/models as decoupler. Table 1 shows a performance increase of up to 37x for wsum.

Squidpy developments

rapids-singlecell is continually expanding with new accelerated functions for the scverse ecosystem. Comprehensive tests have been added to the library to ensure the correctness and reliability of the code. Squidpy enables detailed analysis and visualization of spatial molecular data. It facilitates understanding of complex cell interactions and spatial patterns, greatly contributing to the expansion of the scverse ecosystem. 

Some functions have been accelerated with rapids-singlecell. Spatial auto-correlation with Moran’s I and Geary’s C promises a performance increase of up to 100x. The ligand-receptor (ligrec) interaction analysis in Squidpy has also been optimized and accelerated, resulting in a performance increase of more than 10x.


Our benchmark results show that using GPU acceleration with the rapids-singlecell package and the decoupler functions can significantly improve the performance of scRNA-seq analysis. 

For example, running a sample rapids-singlecell notebook with about 90 K cells end-to-end on a server node with two AMD Epyc Milan 7543, 500 GB memory, and an NVIDIA A100 80 GB GPU was completed in just 51 seconds using the rapids-singlecell package, compared to 1,106 seconds with the traditional scanpy CPU workflow. 

Similarly, the decoupler functions also demonstrated significant speed improvements, with the mlm function running in just 12 seconds on the GPU compared to 83 seconds on the CPU, and the wsum method completing in just 26 seconds on the GPU compared to 16 minutes and 10 seconds on the CPU. 

Overall, these results demonstrate the potential for GPU acceleration to make scRNA-seq analysis faster and more efficient. These benchmark results are summarized in Table 1.

FunctionCPU GPUSpeedup
Whole notebook(excluding decoupler functions)1,106 s  (18.5 min)51 s21x
Preprocessing74 s 8 s9x
Regress out27 s1.6 s16x
PCA35 s0.7 s50x
HVG (Seurat v3)3.2 s0.4 s8x
Harmony417 s18 s23x
Neighbors22 s5.1 s4.3x
UMAP36 s0.4 s90x
TSNE133 s2.4 s55x
Louvain17 s0.6 s28x
Leiden14 s0.2 s70x
Logistic regression58 s3.7 s15x
Draw graph (FA2)256 s0.3 s850x
run_mlm (DoRothEA)83 s12 s7x
Run_wsum (PROGENy)970 s (16 min)26 s37x
Table 1. Server node benchmark for a dataset of 90,000 cells

In addition to the previous benchmark results, running a sample rapids-singlecell notebook with 500 K cells on the server node takes about 2 minutes when using rapids-singlecell. The same analysis takes about 41 minutes on the CPU.

Furthermore, using pearson_residuals for highly variable gene selection and normalization can also be accelerated using GPUs, providing additional speed improvements in scRNA-seq analysis. These benchmark results are summarized in Table 2.

rapids-singlecell is not only capable of accelerating single-cell data analysis on high-end server nodes, but also on consumer-grade hardware. Running the same notebook end-to-end with 50,000 cells on a desktop system with an AMD 5950x CPU, 64 GB memory, and an NVIDIA RTX 3090 GPU, takes around 5 minutes using rapids-singlecell. Although the system was using the RAPIDS Memory Manager (RMM) and unified memory to oversubscribe the GPUs memory, it saw a significant speedup compared to the CPU server. These benchmark results are summarized in Table 2.

FunctionCPUGPU (A100)GPU (3090)Speedup
Whole notebook(excluding PR functions)2,460 s (41 min)110 s290 s22x
Preprocessing 305 s 28 s169 s10x
HVG (Seurat v3)48 s1.5 s13 s32x
Regress out104 s5.1 s16 s20x
scale8.4 s1.3 s5 s6.4x
PCA86 s3.7 s35 s23x
Neighbors74 s17.1 s18.3 s4.3x
UMAP281 s (4.6 min)6.7 s7.6 s60x
TSNE786 s (13 min)10 s12.9 s105x
Louvain283 s (4.5 min)4.5 s5.7 s62x
Leiden282 s (4.5 min)0.6 s0.9 s470x
Logistic regression452 s (7.5 min)33 s63 s13x
Diffusion map30 s0.75 s1.3 s40x
HVG (PR)104 s 2.1 s15.6 s50x
Normalize (PR)22 s0.3 s1 s73x
Table 2. Server node and consumer system benchmark for a dataset of 500,000 cells

Running the same sample notebook (Table 1) with about 90 K cells end-to-end on the desktop system takes only 48 seconds when using rapids-singlecell. In comparison, the traditional scanpy CPU workflow takes 774 seconds. The accelerated decoupler functions also demonstrate significant speed improvements on consumer-grade hardware. These benchmark results are summarized in Table 3.

FunctionCPU GPUSpeedup
Whole notebook(excluding decoupler functions)774 s (13 min)48 s16x
Preprocessing114 s 6 s19x
Regress out62 s1.6 s39x
PCA42 s0.7 s60x
HVG (Seurat v3)2.7s0.4 s6.7x
Harmony175 s21.7 s8x
Neighbors14.9 s4.6 s3.2x
UMAP31 s0.3 s103x
TSNE95 s1.4 s68x
Louvain9.3 s0.5 s18x
Leiden13.2 s0.1 s130x
Logistic regression76 s3.75 s20x
Draw graph (FA2)191 s0.23 s830x
run_mlm (DoRothEA)55 s12 s4.5x
Run_wsum (PROGENy)690 s (11.5 min)28 s26x
Table 3. Consumer system benchmark for a dataset of 90,000 cells


There are multiple methods for installing rapids-singlecell. The easiest method is to use one of the provided yaml files provided within the GitHub repository. These set up the entire environment with everything needed for running the example notebooks.

conda create -f conda/rsc_rapids_23.02.yml

You can also install rapids-singlecell from PyPI into a Conda environment and install RAPIDS from Conda. The default installer does not include RAPIDS or CuPy. Scanpy is also excluded because it is technically not necessary.

pip install rapids-singlecell

Finally, you can install the entire library, including the RAPDIS dependencies, from PyPI using the new experimental PyPI packages from RAPIDS. However, this method of installation requires the user to properly set up CUDA so that it can be found by RAPIDS and CuPy. 

To do this, you can use the following command:

pip install 'rapids-singlecell[rapids]’ --extra-index-url=


With the rapids-singlecell library, it is possible to run the complete analysis of 500 K cells in less time than it takes a CPU to compute only its UMAP embedding. Therefore, it enables a much faster iterative process in single-cell data analysis stages. 

rapids-singlecell also enables the bioinformatician to analyze the data with a physician or biologist in real time, leading to better collaboration and interpretation of the data. In our experience, it is possible to analyze 200 K cells without any issues, even on a consumer-class 3090 series graphics card. Even better, RMM enables the GPU memory to be oversubscribed and spilled to the main memory, enabling scales well over 500 K cells.

With the datacenter-class NVIDIA A100 80 GB GPU, you can analyze matrices containing as many as 231-1 (approximately 2.15 billion) non-zero counts. (Note that this is the current limit imposed by the CuPy 32-bit integer-based indexing for sparse matrix calculations.) This powerful capability enables users to analyze datasets with over 1 million cells. 

The upwards of 20x speedup that rapids-singlecell provides enables researchers to focus more on analyzing and interpreting their single-cell data, rather than waiting for lengthy computational processes. In the true spirit of RAPIDS, this ultimately enhances productivity and fosters new insights into cellular biology that were not possible before.

Discuss (0)