Data Science

Driving Toward Billion-Cell Analysis and Biological Breakthroughs with RAPIDS-singlecell

Jun 12, 2025

By TJ Chen, Severin Dicks, Taurean Dyer and Gary Burnett

Discuss (0)

AI-Generated Summary

Dislike

RAPIDS-singlecell is an open-source tool that addresses data size and analysis speed challenges in single-cell data processing and analysis by leveraging GPU acceleration through CuPy and NVIDIA RAPIDS.
Noetik's virtual cell experiments have generated over 5.5 billion virtual cells using NVIDIA RAPIDS and RAPIDS-singlecell, achieving significant speedups in analysis, including 470x faster UMAP and 1958x faster Leiden clustering on a 1.1M cell dataset.
RAPIDS-singlecell has also integrated an optimized implementation of Harmony, a tool for batch integration, which can complete batch effect removal up to more than 350x faster than CPU for 11M cells.

AI-generated content may summarize information incompletely. Verify important information. Learn more

The future of cell biology and virtual cell models is dependent on measuring and analyzing data at scale. Single-cell experiments have been growing at an incredible rate over the last 10 years, beginning with hundreds of cells and now moving towards new data generation efforts with billions of cells.

With virtual cell models, billions of virtual cells are also being generated. This deluge of data and new development of models will help scientists discover novel biology, develop new therapeutics, and investigate and elucidate the progression of disease and aging.

Data processing and analysis are key to downstream biological interpretation as well as model building. With this extreme growth of data, two key data processing challenges have emerged, greatly limiting scientific understanding and interpretation of these large-scale data sets:

Data size: Inability to analyze large data sets (millions to billions of cells)
Analysis speed: Hours to days of wait time for important, expert-informed analysis steps

RAPIDS-singlecell solves major bottlenecks in single-cell data processing, analysis, and integration

Analysis steps, including normalization, dimensionality reduction, clustering, and batch integration, are essential to single-cell data analysis, interpretation, and model development. RAPIDS-singlecell is an open-source, MIT-licensed tool developed by scverse that addresses both data size and analysis speed challenges. Leveraging GPU acceleration through CuPy and NVIDIA RAPIDS, it operates directly on the AnnData data structure, a community standard.

RAPIDS-singlecell is primarily powered by the CuPy library that acts as a near drop-in replacement for NumPy and selected SciPy functions, enabling users to write Python code that closely mirrors standard NumPy syntax while using the parallel computing capabilities of NVIDIA GPUs. Additional tools used include:

RAPIDS and NVIDIA CUDA libraries	Example tasks for single-cell analysis
NVIDIA cuML	Dimensionality reduction including PCA, UMAP, and t-SNE
NVIDIA cuGraph	Clustering of cells leveraging graph-based computations including Leiden and Louvain
Dask	Scale to 100+ M cells through parallel execution across multiple GPUs and nodes
RAPIDS Memory Manager	Automatic spilling of data to host memory, enabling large-scale single-cell analysis across multiple GPU architectures
Dimensionality reduction, including PCA, UMAP, and t-SNE	Just-in-time compiled CUDA kernels written in Python for gene selection/iterative graph refinement

Table 1. A list of tools used for specific tasks in single-cell analysis

Scaling for the future of cell science with millions of cells on a single node

The data size challenge is a growing problem in the field of single-cell analysis, for real data and virtual cell experimentation. This problem continues to grow as real data increases and more virtual cell models are developed.

Noetik, an AI-native biotechnology company, has developed the foundation model OCTO-vc based on petabytes of spatial data from human tumor and healthy control tissues. Using a proprietary dataset of 193M cells, Noetik is building multimodal foundation models to simulate virtual cells and cellular systems.

“Without accelerated computing, analyzing datasets of this magnitude was not previously possible. With NVIDIA, our virtual cell experiments have generated over 5.5 billion virtual cells,” says Jacob Rinaldi, chief science officer at Noetik. “Not only are we now able to support datasets of this size, but we’re also able to accelerate analysis with NVIDIA RAPIDS and RAPIDS-singlecell across different algorithms and dataset scales.”

Rinaldi’s team has changed their analysis from hours or days on CPU to near-real-time, with 470x faster UMAP (12.85 minutes to 1.64 seconds) and 1958x faster Leiden clustering (7.83 hours to 14.4 seconds) on a 1.1M cell dataset, leveraging RAPIDS-singlecell.

RAPIDS-singlecell can accommodate hundreds of millions of cells in seconds, due to the efficiencies described in Table 1. It can also analyze millions of cells on an individual GPU. Table 2 outlines the latest benchmarks using the newest version of RAPIDS-singlecell available in the NVIDIA AI Blueprint for single-cell analysis. We recommend using the Zarr format when processing large datasets using RAPIDS-singlecell.

Unless noted, these benchmarks are from NVIDIA and are on a single GPU. Speeds can vary depending on the data set, GPU instance, and memory availability.

Single GPU benchmarks for 1M cells
Workload	Baseline	NVIDIA L40S GPU	NVIDIA RTX PRO 6000 Server Edition	NVIDIA DGX B200
QC	13.6	0.5	0.2	0.2
Highly variable genes	27.0	8.7	0.4	0.3
Regress out	8.2	2.7	0.2	0.2
Scale	15.4	0.3	0.2	0.1
PCA	141.0	18.1	2.0	1.2
All preprocessing	313.0	40.0	4.1	2.9
Neighbors	219.0	4.0	1.9	1.7
UMAP	574.0	2.4	1.7	1.2
Louvain clustering	422.0	4.4	1.8	1.5
Leiden clustering	1521.0	3.2	1.7	1.5
tSNE	2010.0	33.2	15.9	14.6
Diffusion map	77.0	4.4	1.3	1.2
Total processing time	5176.0	92.0	28.4	24.6

Table 2. RAPIDS-singlecell (v0.12.6) can accommodate a million cells across various GPU architectures on a single GPU

Time is represented in seconds. For the L40s instance, we set RMM managed_memory=True for the steps before PCA. We compared against a CPU (AMD EPYC 7413 24-Core Processor 48 Threads) running scanpy v1.11.1

Near-real-time single-cell analysis leveraging NVIDIA RAPIDS and NVIDIA Blackwell GPUs

With the newest RAPIDS-singlecell support for NVIDIA Blackwell GPUs, analysis time is dramatically reduced, converging upon near real-time analysis of single-cell data.

This type of workflow is essential for scientists aiming to explore cell populations and delve deeper into subclusters or rare cell subsets. By iteratively running dimensionality reduction and other methods, they can uncover novel biological insights from their data.

Additional GPUs and new architectures reduce the time of analysis. PCA run on a 95M cell dataset from Tahoe Bio with 7,000 features can be completed in under 10 seconds on Blackwell GPUs. Table 3 shows benchmarks for multi-GPU on 11M cells.

Step	NVIDIA RTX PRO 6000 Server Edition (8 GPUs)	NVIDIA DGX B200 (8 GPUs)
Log normalize	0.33	0.27
Highly variable genes	0.42	0.44
Scale	0.59	0.53
PCA	1.62	1.73
Neighbors	23.7	20.9
UMAP	10.5	11.7
Leiden clustering	18	17.6

Table 3. 11M cells run on multi-GPUs, time is in seconds

5000 Highly Variable Genes were selected. Neighbors, UMAP and Leiden Clustering were performed on a single GPU. Additional GPUs and architectures can complete analysis time in seconds, not hours. For Neighbors, the algorithm used was ivfpq from cuVS.

Introducing accelerated, open source integrative analysis using Harmony

Particularly now, when large single-cell data corpora, including CZI cellxgene and Arc’s Virtual Cell Atlas, are growing in size and complexity, there’s a growing need for tools to help integrate data sets across experiments. This is a useful step for the analysis and utilization of data for model development.

RAPIDS-singlecell has updated an optimized implementation of Harmony, a tool for batch integration that removes batch effects to help uncover biological insights. The RAPIDS-singlecell version is now MIT-licensed and uses a label-vector encoding instead of the commonly used one-hot-encoding matrix.

In the following example, using a data set from the CZI cellxgene repository, initial UMAP analysis shows that many of the cells cluster by assay version. After Harmony batch integration, however, many of these batch effects are removed, and cell types begin to emerge.

Harmony on RAPIDS-singlecell can complete this up to more than 350x faster than CPU for 11M cells, reducing analysis time from hours to seconds, as outlined in Table 4.

Number of cells	Baseline	NVIDIA A10 Tensor Core GPU	NVIDIA L40S GPU	NVIDIA RTX PRO 6000 Server Edition	NVIDIA DGX B200
90,000	120	3.3	2.6	1.6	1.6
200,000	182	3.2	2.8	1.9	1.6
2,000,000	1172	8	5.9	4.3	3.8
11,000,000	>7150	46.4	42.7	19.7	21.7

Table 4. Harmony speeds with increasing numbers of cells on GPU versus CPU

AMD EPYC 7413 24-Core Processor 48 Threads. All numbers are in seconds. Data sets are those utilized on the RAPIDS-singlecell blueprint

Get started

The following hands-on resources can help you to get started with RAPIDS-singlecell.

RAPIDS-singlecell documentation from scverse
Single-Cell Analysis Blueprint: A set of launchable Jupyter notebooks that walk a user through the features of RAPIDS-singlecell. These can be deployed manually or on pre-configured cloud instances through NVIDIA Brev.
Accelerate Data Science and Leverage Foundation Models in Digital Biology training course shows how to use RAPIDS-singlecell to clean a dataset and use the data to retrain Geneformer. It has more Jupyter notebooks as well as accompanying slides and a recorded presentation.
The NVIDIA Genomics overview page covers NVIDIA tools that support Genomics.

Acknowledgments

We want to thank the scverse core team, especially Philipp Angerer, Ilan Gold, Lukas Heumos and Issac Virshup for providing advice, insights, and collaboration, as well as prior iterations and contributions to RAPIDS+single cell data from Corey Nolet and Avantika Lal.

Additional thanks to significant contributions to the single cell blueprint and feedback on Harmony in alphabetical order: Alice Hsiung, Chelsea Gomatam, Daniel Burkhardt, Deven Yue, Eric Phan, Michelle Gill, Narges Masoudi, and Seth Poulos, as well as the Brev team including Alec Fong, Anish Maddipoti, Carter Abdallah, and Tyler Fong.

Discuss (0)

About the Authors

About TJ Chen
T.J. leads product management for genomics and biology foundation model research at NVIDIA, including NVIDIA Parabricks, for accelerated genomics analysis on GPU. She is responsible for leveraging NVIDIA expertise in AI, HPC, and data analytics stacks to address genomics workflows with accelerated high-accuracy solutions. Before NVIDIA, T.J. led a team at Chan-Zuckerberg Initiative with contributions to the single-cell, imaging, rare disease, and infectious disease communities. Her interests and experience focus on cloud platforms, machine learning, data analysis and interpretation, supporting R&D through to Clinical applications. She received a Ph.D. from Stanford University in Biomedical Informatics and a B.S. in Computer Science/Biology from Duke University.

View all posts by TJ Chen

About Severin Dicks
Severin Dicks is a Solution Architect for Genomics at NVIDIA and an scverse core member. He helps researchers accelerate single-cell analysis by developing GPU-accelerated workflows and is the primary developer behind rapids-singlecell. Before joining NVIDIA, Severin worked as a software developer in Fabian Theis’ lab at Helmholtz Munich. He holds a Master’s degree in Biology from RWTH Aachen.

View all posts by Severin Dicks

About Taurean Dyer
Taurean leads the NVIDIA data science community for GPU accelerated data science. Using his diverse domain experience, he helps people understand and apply the “whys” and “hows” of GPU-accelerated computing. Before NVIDIA, Taurean was a researcher and innovation lead for a Fortune 100 technology lab, focused on developing new IP and capabilities for the future of work. His patented work has been featured at the White House, World Economic Forum, and shown at numerous conferences and publications. Taurean completed grad school in Mechanical Engineering from Stony Brook University, after receiving his double B.Eng in Mechanical Engineering and Applied Math and Statistics.

View all posts by Taurean Dyer

About Gary Burnett
Gary is a Solution Architect at NVIDIA on the Professional Visualizations team working in Media and Entertainment. He joined NVIDIA in 2017 after graduating from MIT with degrees in computer science and neuroscience. Gary’s role involves working directly with customers in order to create applications that leverage deep learning for visual effects including image processing, character locomotion, and pose estimation.

View all posts by Gary Burnett