Data Science

Unlock Gene Networks Using Limited Data with AI Model Geneformer

Geneformer is a recently introduced and powerful AI model that learns gene network dynamics and interactions using transfer learning from vast single-cell transcriptome data. This tool enables researchers to make accurate predictions about gene behavior and disease mechanisms even with limited data, accelerating drug target discovery and advancing understanding of complex genetic networks in various biological contexts. 

Developed by researchers at the Broad Institute of MIT and Harvard and their collaborators, the AI model Geneformer uses the highest-expressed genes in sc-RNA expression data to generate a dense representation of each cell, which can be used as features for various downstream predictive tasks. What makes Geneformer unique, however, are the capabilities its architecture enables, even when trained on very little data. 

A BERT-like reference model for single-cell data

Geneformer has a BERT-like transformer architecture and was pre-trained on data from about 30M single-cell transcriptomes across various human tissues. Its attention mechanism enables it to focus on the most relevant parts of the input data. With this context-aware approach, the model can make predictions by considering ‌relationships and dependencies between genes.

During the pretraining phase, the model employs a masked language modeling technique. In this technique, a portion of the gene expression data is masked, and the model learns to predict the masked genes based on the surrounding context. This approach doesn’t require labeled data and enables the model to understand complex gene interactions and regulatory mechanisms. 

This architecture and training enable the model to consistently enhance predictive accuracy across various tasks relevant to chromatin and gene network dynamics, even with limited data. For example, Geneformer can reconstruct an important gene network in heart endothelial cells just as accurately using only 5,000 cells of data as the previous state-of-the-art methods when trained on >30,000 cells of data.

It also can achieve >90% accuracy for specific cell type classification tasks, one of the most common use cases for gene expression foundation models. We used a Crohn’s Disease small intestine dataset for the NVIDIA BioNeMo model evaluation on this task and found performance improvements over baseline models for accuracy (Figure 1) and F1 score (Figure 2). 

Two Geneformer models in the BioNeMo platform show improved performance in cell annotation accuracy over baseline controls.
Figure 1. The 10M and the 106M parameter Geneformer models showed improved cell annotation accuracy over baseline models
Two Geneformer models in the BioNeMo platform show improved cell annotation F1 score performance over baseline controls.
Figure 2. The 10M and 106M parameter Geneformer models show improved cell annotation F1 scores over baseline models

The comparisons in Figure 1 and Figure 2 used a baseline Logp1 PCA+RF using PCA with 10 components and a random forest model trained on normalized and log-transformed expression counts. The baseline random weights model was trained for about 100 steps with approximately random weights. The model with 10M parameters is the 6-layer model and the model with 106M parameter has 12-layers, both described in the BioNeMo documentation

Our experiments and the data in the original Geneformer publication suggest that there is value in scaling Geneformer past the 106M parameter 12-layer models that have been generated so far.

To enable the next generation of Geneformer-based models, we’ve made two new features available within the BioNeMo Framework. First, the BioNeMo model version has a data loader that accelerates data loading 4x faster than the published method while maintaining compatibility with the data types used in the original publication. Second, Geneformer now allows for both tensor and pipeline parallelism with a simple change of the training configuration. This helps manage memory constraints and reduces training time, making it feasible to train models with billions of parameters by leveraging the total computational power of multiple GPUs.

NVIDIA Clara tools combine for drug discovery

Geneformer can be accessed within the BioNeMo Framework and is part of a growing compendium of accelerated single-cell and spatial omics analysis tools in the NVIDIA Clara suite (Figure 3). These tools can be implemented in complementary research workflows for drug discovery, exemplified by research at The Translational Genomics Research Institute (TGen).

A library called RAPIDS-SINGLECELL that complements the popular Python single-cell analysis framework Scanpy has been developed by scverse. Powered by NVIDIA RAPIDS and CuPy, it provides GPU-accelerated functions for preprocessing, visualization, clustering, and trajectory inference, of single-cell omics data.

For spatially-resolved approaches, the VISTA-2D model in MONAI is designed for processing and analyzing cell images. It provides high-quality segmentation masks for identifying and quantifying cell morphologies and spatial organization within tissues. Segmentation masks generated by VISTA-2D can generate expression data that can be fed into foundation models such as Geneformer. 

AI models like VISTA 2D, Geneformer, and RAPIDS-SINGLECELL can use cell images and expression data to provide complementary downstream analyses, such as cell type annotation and predicting the effects of cell perturbation. 
Figure 3. Geneformer complements other single-cell resources within the NVIDIA Clara suite outside BioNeMo for accelerated insights

A foundation AI model for disease modeling

As its various applications demonstrate (Figure 4), Geneformer can serve as a biological foundation model. These use cases span molecular to organismal-scale problems, making it a widely practical tool for biological research. 

Many of these use cases are described in the model paper. The model is now open-source and available for research. Figure 4 shows the use cases Geneformer can handle with zero-shot learning underlined. Zero-shot learning means that Geneformer can predict a data class it hasn’t seen before or explicitly been trained for. 

This image illustrates the problems that Geneformer can solve in three categories, including gene regulation, cell type and cell state annotation, and predictive biological modeling for therapeutics. 
Figure 4. Geneformer use cases span multiple levels of biological complexity, from gene regulation to therapeutic disease modeling 

In gene regulation research, Geneformer can be fine-tuned on datasets that measure gene expression changes in response to varying levels of transcription factors. This enables accurate predictions of how different dosages of transcription factors influence gene expression and cellular phenotypes, aiding in understanding gene regulation and potential therapeutic interventions. 

Fine-tuning Geneformer on datasets capturing cell state transitions during differentiation can also enable precise classification of cell states, assisting in understanding differentiation processes and development. The model can even be used for one-shot identification of cooperative interactions between transcription factors. This can enhance understanding of complex regulatory mechanisms and how transcription factors work together to regulate gene expression.

Get started

The 6-layer (30M parameter) and 12-layer (106M parameter) models, along with fully accelerated example code for training and deployment, are available through the NVIDIA BioNeMo Framework on NVIDIA NGC.

Discuss (0)

Tags