Data Science

Driving Toward Billion-Cell Analysis and Biological Breakthroughs with RAPIDS-singlecell

Decorative image.

The future of cell biology and virtual cell models is dependent on measuring and analyzing data at scale. Single-cell experiments have been growing at an incredible rate over the last 10 years, beginning with hundreds of cells and now moving towards new data generation efforts with billions of cells. 

With virtual cell models, billions of virtual cells are also being generated. This deluge of data and new development of models will help scientists discover novel biology, develop new therapeutics, and investigate and elucidate the progression of disease and aging.

Data processing and analysis are key to downstream biological interpretation as well as model building. With this extreme growth of data, two key data processing challenges have emerged, greatly limiting scientific understanding and interpretation of these large-scale data sets:

  1. Data size: Inability to analyze large data sets (millions to billions of cells)
  2. Analysis speed: Hours to days of wait time for important, expert-informed analysis steps

RAPIDS-singlecell solves major bottlenecks in single-cell data processing, analysis, and integration

Analysis steps, including normalization, dimensionality reduction, clustering, and batch integration, are essential to single-cell data analysis, interpretation, and model development. RAPIDS-singlecell is an open-source, MIT-licensed tool developed by scverse that addresses both data size and analysis speed challenges. Leveraging GPU acceleration through CuPy and NVIDIA RAPIDS, it operates directly on the AnnData data structure, a community standard.

RAPIDS-singlecell is primarily powered by the CuPy library that acts as a near drop-in replacement for NumPy and selected SciPy functions, enabling users to write Python code that closely mirrors standard NumPy syntax while using the parallel computing capabilities of NVIDIA GPUs. Additional tools used include:

RAPIDS and  NVIDIA CUDA librariesExample tasks for single-cell analysis
NVIDIA cuMLDimensionality reduction including PCA, UMAP, and t-SNE
NVIDIA cuGraphClustering of cells leveraging graph-based computations including Leiden and Louvain
DaskScale to 100+ M cells through parallel execution across multiple GPUs and nodes
RAPIDS Memory ManagerAutomatic spilling of data to host memory, enabling large-scale single-cell analysis across multiple GPU architectures
Dimensionality reduction, including PCA, UMAP, and t-SNEJust-in-time compiled CUDA kernels written in Python for gene selection/iterative graph refinement
Table 1. A list of tools used for specific tasks in single-cell analysis

Scaling for the future of cell science with millions of cells on a single node

The data size challenge is a growing problem in the field of single-cell analysis, for real data and virtual cell experimentation. This problem continues to grow as real data increases and more virtual cell models are developed. 

Noetik, an AI-native biotechnology company, has developed the foundation model OCTO-vc based on petabytes of spatial data from human tumor and healthy control tissues. Using a proprietary dataset of 193M cells, Noetik is building multimodal foundation models to simulate virtual cells and cellular systems. 

“Without accelerated computing, analyzing datasets of this magnitude was not previously possible. With NVIDIA, our virtual cell experiments have generated over 5.5 billion virtual cells,” says Jacob Rinaldi, chief science officer at Noetik. “Not only are we now able to support datasets of this size, but we’re also able to accelerate analysis with NVIDIA RAPIDS and RAPIDS-singlecell across different algorithms and dataset scales.”

Rinaldi’s team has changed their analysis from hours or days on CPU to near-real-time, with 470x faster UMAP (12.85 minutes to 1.64 seconds) and 1958x faster Leiden clustering (7.83 hours to 14.4 seconds) on a 1.1M cell dataset, leveraging RAPIDS-singlecell.

RAPIDS-singlecell can accommodate hundreds of millions of cells in seconds, due to the efficiencies described in Table 1. It can also analyze millions of cells on an individual GPU. Table 2 outlines the latest benchmarks using the newest version of RAPIDS-singlecell available in the NVIDIA AI Blueprint for single-cell analysis. We recommend using the Zarr format when processing large datasets using RAPIDS-singlecell.

Unless noted, these benchmarks are from NVIDIA and are on a single GPU. Speeds can vary depending on the data set, GPU instance, and memory availability.

Single GPU benchmarks for 1M cells
WorkloadBaselineNVIDIA L40S GPUNVIDIA RTX PRO 6000 Server Edition NVIDIA DGX B200 
QC13.60.50.20.2
Highly variable genes27.08.70.40.3
Regress out8.22.70.20.2
Scale15.40.30.20.1
PCA141.018.12.01.2
All preprocessing313.040.04.12.9
Neighbors219.04.01.91.7
UMAP574.02.41.71.2
Louvain clustering422.04.41.81.5
Leiden clustering1521.03.21.71.5
tSNE2010.033.215.914.6
Diffusion map77.04.41.31.2
Total processing time5176.092.028.424.6
Table 2. RAPIDS-singlecell (v0.12.6) can accommodate a million cells across various GPU architectures on a single GPU

Time is represented in seconds. For the L40s instance, we set RMM managed_memory=True for the steps before PCA. We compared against a CPU (AMD EPYC 7413 24-Core Processor 48 Threads) running scanpy v1.11.1

Near-real-time single-cell analysis leveraging NVIDIA RAPIDS and NVIDIA Blackwell GPUs

With the newest RAPIDS-singlecell support for NVIDIA Blackwell GPUs, analysis time is dramatically reduced, converging upon near real-time analysis of single-cell data.

This type of workflow is essential for scientists aiming to explore cell populations and delve deeper into subclusters or rare cell subsets. By iteratively running dimensionality reduction and other methods, they can uncover novel biological insights from their data.

Additional GPUs and new architectures reduce the time of analysis. PCA run on a 95M cell dataset from Tahoe Bio with 7,000 features can be completed in under 10 seconds on Blackwell GPUs. Table 3 shows benchmarks for multi-GPU on 11M cells.

StepNVIDIA RTX PRO 6000 Server Edition (8 GPUs)NVIDIA DGX B200 (8 GPUs)
Log normalize0.330.27
Highly variable genes0.420.44
Scale0.590.53
PCA1.621.73
Neighbors23.720.9
UMAP10.511.7
Leiden clustering1817.6
Table 3. 11M cells run on multi-GPUs, time is in seconds 

5000 Highly Variable Genes were selected. Neighbors, UMAP and Leiden Clustering were performed on a single GPU. Additional GPUs and architectures can complete analysis time in seconds, not hours. For Neighbors, the algorithm used was ivfpq from cuVS.

Introducing accelerated, open source integrative analysis using Harmony

Particularly now, when large single-cell data corpora, including CZI cellxgene and Arc’s Virtual Cell Atlas, are growing in size and complexity, there’s a growing need for tools to help integrate data sets across experiments. This is a useful step for the analysis and utilization of data for model development.

RAPIDS-singlecell has updated an optimized implementation of Harmony, a tool for batch integration that removes batch effects to help uncover biological insights. The RAPIDS-singlecell version is now MIT-licensed and uses a label-vector encoding instead of the commonly used one-hot-encoding matrix.

In the following example, using a data set from the CZI cellxgene repository, initial UMAP analysis shows that many of the cells cluster by assay version. After Harmony batch integration, however, many of these batch effects are removed, and cell types begin to emerge.

A UMAP of data before and after batch integration is applied with Harmony.
Figure 1. Data before and after Harmony is applied

Harmony on RAPIDS-singlecell can complete this up to more than 350x faster than CPU for 11M cells, reducing analysis time from hours to seconds, as outlined in Table 4.

Number of cellsBaselineNVIDIA A10 Tensor Core GPUNVIDIA L40S GPUNVIDIA RTX PRO 6000 Server EditionNVIDIA DGX B200
90,0001203.32.61.61.6
200,0001823.22.81.91.6
2,000,000 117285.94.33.8
11,000,000>715046.442.719.721.7
Table 4. Harmony speeds with increasing numbers of cells on GPU versus CPU 

AMD EPYC 7413 24-Core Processor 48 Threads. All numbers are in seconds. Data sets are those utilized on the RAPIDS-singlecell blueprint

Get started

The following hands-on resources can help you to get started with RAPIDS-singlecell.

Acknowledgments

We want to thank the scverse core team, especially Philipp Angerer, Ilan Gold, Lukas Heumos and Issac Virshup for providing advice, insights, and collaboration, as well as prior iterations and contributions to RAPIDS+single cell data from Corey Nolet and Avantika Lal.

Additional thanks to significant contributions to the single cell blueprint and feedback on Harmony in alphabetical order: Alice Hsiung, Chelsea Gomatam, Daniel Burkhardt, Deven Yue, Eric Phan, Michelle Gill,  Narges Masoudi, and Seth Poulos, as well as the Brev team including Alec Fong, Anish Maddipoti, Carter Abdallah, and Tyler Fong.

Discuss (0)

Tags