Generative AI

Understanding the Language of Life’s Biomolecules Across Evolution at a New Scale with Evo 2

A horizontal helix.

AI has evolved from an experimental curiosity to a driving force within biological research. The convergence of deep learning algorithms, massive omics datasets, and automated laboratory workflows has allowed scientists to tackle problems once thought intractable—from rapid protein structure prediction to generative drug design, increasing the need for AI literacy among scientists. With this momentum, we find ourselves on the cusp of the next paradigm shift: the emergence of powerful AI foundation models purpose-built for biology.

These new models promise to unify disparate data sources—genomic sequences, RNA and proteomic profiles, and, in some cases, scientific literature—into a single, coherent understanding of life at the molecular, cellular, and systems levels. Learning biology’s language and structure opens doors to transformative applications, such as smarter drug discovery, rational enzyme design, and disease mechanism elucidation. 

As we set the stage for this next wave of AI-driven breakthroughs, it is clear that these foundation models will not merely accelerate progress; they stand poised to redefine what is possible in biological research.

A leap forward in sequence modeling and design from molecular to genome-scale

The first Evo model from November 2024 represented a groundbreaking milestone in genomic research, introducing a foundation model capable of analyzing and generating biological sequences across DNA, RNA, and proteins. 

Published at a time when most models were restricted to single modalities or short contexts, Evo is known for its ability to operate across scales—ranging from molecular to genomic—using a unified approach. Trained on 2.7M prokaryotic and phage genomes, encompassing 300B nucleotide tokens, Evo delivered single-nucleotide resolution across many biological evolution and function tasks.

At the core of Evo’s success is its innovative StripedHyena architecture (Figure 1), a hybrid model combining 29 Hyena layers, a new type of deep learning architecture designed to handle long sequences of information without relying on traditional attention mechanisms that are common to Transformer architectures. Instead it uses a combination of convolutional filters and gates. 

This design overcame the limitations of traditional Transformer models, enabling Evo to handle long contexts of up to 131,072 tokens efficiently. The result was a model capable of connecting small sequence changes to system-wide and organism-level impacts, bridging the gap between molecular biology and evolutionary genomics.

The diagram compares the architectural differences between the Evo 2 model and its predecessor, showcasing layers categorized into Short Explicit (SE), Medium Regularized (MR), and Long Implicit (LI) within the StripedHyena framework. 
Figure 1. Evo and Evo 2 AI model architecture

Evo’s predictive capabilities set new standards for biological modeling. It achieved competitive performance in several zero-shot tasks, including predicting the fitness effects of mutations on proteins, non-coding RNAs, and regulatory DNA, providing invaluable insights for synthetic biology and precision medicine. 

Evo also demonstrated remarkable generative capabilities, designing functional CRISPR-Cas systems and transposons. These outputs were validated experimentally, proving that Evo could predict and design novel biological systems with real-world utility.

Evo represents a notable advancement in integrating multimodal and multiscale biological understanding into a single model. Its ability to generate genome-scale sequences and predict gene essentiality across entire genomes marked a leap forward in our capacity to analyze and engineer life. 

Evo’s milestones were not just its technical achievements but also its vision. This unified framework combined biology’s vast complexity with cutting-edge AI to accelerate discovery and innovation in life sciences.

Learning the language of life across evolution

Evo 2 is the next generation of this line of research in genomic modeling, building on the success of Evo with expanded data, enhanced architecture, and superior performance. 

Evo 2 can provide insights into three essential biomolecules—DNA, RNA, and protein— and all three domains of life: Eukarya, Prokarya, and Archaea. Trained on a dataset of 8.85T nucleotides from 15,032 eukaryotic genomes and 113,379 prokaryotic genomes, this training dataset covers diverse species, enabling unprecedented cross-species generalization and significantly broadening its scope compared to Evo, which focused solely on prokaryotic genomes.

Evo 2 uses a new and improved StripedHyena 2 architecture, extended up to 40B parameters, enhancing the model’s training efficiency and ability to capture long-range dependencies with context lengths of 1M tokens. StripedHyena 2, thanks to its multihybrid design based on convolutions, trains significantly faster than Transformers and other hybrid models using linear attention or state-space models. 

The largest Evo 2 model was trained using  2,048 NVIDIA H100 GPUs using NVIDIA DGX Cloud on AWS. As part of NVIDIA’s partnership with Arc, they gained access to this high-performance, fully managed AI platform optimized for large-scale, distributed training with NVIDIA AI software and expertise. 

These advances mark a significant increase from Evo’s 7B parameters and a 131,000-token context length, positioning Evo 2 as a leader in multimodal and multiscale biological modeling (Table 1). 

FeatureEvoEvo 2
Genomic Training DataBacterial 
+ bacteriophage
(300B nucleotides)
All domains of life 
+ bacteriophage
(9T nucleotides)
Model Parameters7B7B + 40B
Context Length131,072 tokensUp to 1,048,576 tokens
ModalitiesDNA, RNA, proteinDNA, RNA, protein
SafetyViruses of Eukaryotes excludedViruses of Eukaryotes excluded
ApplicationsLimited cross-species tasksBroad cross-species applications
Table 1. Key features of Evo 2 and Evo

Evo 2’s expanded training data and refined architecture empower it to excel across various biological applications. Its multimodal design integrates DNA, RNA, and protein data, enabling zero-shot performance on tasks like mutation impact prediction and genome annotation. Evo 2 also fundamentally improves Evo by including eukaryotic genomes, enabling deeper insights into human diseases, agriculture, and environmental science.

Evo 2’s predictive capabilities outperform specialized models across diverse tasks:

  • Variant impact analysis: Achieves state-of-the-art accuracy in predicting the functional effects of mutations across species zero-shot, including human and non-coding variants.
  • Gene essentiality: Identifies essential genes in prokaryotic and eukaryotic genomes, validated against experimental datasets, bridging the gap between molecular and systems biology tasks.
  • Generative capabilities: Designs complex biological systems, such as genome-scale prokaryotic and eukaryotic sequences, and the controllable design of chromatin accessibility, demonstrating new capabilities for biological design with real-world applicability.

Using the NVIDIA Evo 2 NIM microservice

The NVIDIA Evo 2 NIM microservice is useful for generating a variety of biological sequences, with an API that provides settings to adjust tokenization, sampling, and temperature parameters:

# Define JSON example human L1 retrotransposable element sequence
example = {
 
# nucleotide sequence to be analyzed  
        "sequence": "GAATAGGAACAGCTCCGGTCTACAGCTCCCAGCGTGAGCGACGCAGAAGACGGTGATTTCTGCATTTCCATCTGAGGTACCGGGTTCATCTCACTAGGGAGTGCCAGACAGTGGGCGCAGGCCAGTGTGTGTGCGCACCGTGCGCGAGCCGAAGCAGGGCGAGGCATTGCCTCACCTGGGAAGCGCAAGGGGTCAGGGAGTTCCCTTTCCGAGTCAAAGAAAGGGGTGATGGACGCACCTGGAAAATCGGGTCACTCCCACCCGAATATTGCGCTTTTCAGACCGGCTTAAGAAACGGCGCACCACGAGACTATATCCCACACCTGGCTCAGAGGGTCCTACGCCCACGGAATC", 
        "num_tokens": 102, # number of tokens to generate
        "top_k": 4, # only predict top 4 most likely outcomes per token
        "top_p": 1.0, # include 100% cumulative prob results in sampling
        "temperature": 0.7, # add variability (creativity) to predictions
        "
": True, # enable more diverse outputs
        "enable_logits": False, # disable raw model output (logits)
}

# Retrieve the API key from the environment
key = os.getenv("NVCF_RUN_KEY")

# Send the example sequence and parameters to the Evo 2 API
r = requests.post(

        # Example URL for the Evo 2 model API.
        url=os.getenv("URL","https://health.api.nvidia.com/v1/biology/arc/evo2-40b/generate")
        
        # Authorization headers to authenticate with the API
        headers={"Authorization": f"Bearer {key}"},
        
        # The data payload (sequence and parameters) sent as JSON
        json=example,
)

For more information about the API output for various prompts, see the NVIDIA BioNeMo Framework documentation.

Evo 2 can also be fine-tuned using the open-source NVIDIA BioNeMo Framework, which offers robust tools for adapting pretrained models such as Evo 2 to specialized tasks in BioPharma:

# Prepare raw sequence data for training based on a YAML config file 
   preprocess_evo2 -c data_preproc_config.yaml

# Trains the Evo 2 model with preprocessed data and parallelism across multiple GPUs   
torchrun --nproc-per-node=8 --no-python train_Evo 2 -d data_train_config.yaml --num-nodes=1 --devices=8 --max-steps=100 --val-check-interval=25 --experiment-dir=/workspace/bionemo2/model/checkpoints/example --seq-length=8192 --tensor-parallel-size=4 --pipeline-model-parallel-size=1 --context-parallel-size=2 --sequence-parallel --global-batch-size=8 --micro-batch-size=1 --model-size=7b --fp8 --tflops-callback

# Optional Fine-tuning: Add this argument to start from a pretrained model  
# --ckpt-dir=/path/to/pretrained_checkpoint

Evo 2 and the future of AI in biology

AI is poised to rapidly transform biological research, enabling breakthroughs previously thought to be decades away. Evo 2 represents a significant leap forward in this evolution, introducing a genomic foundation model capable of analyzing and generating DNA, RNA, and protein sequences at unprecedented scales. 

While Evo excelled in predicting mutation effects and gene expression in prokaryotes, the capabilities of Evo 2 are much broader, with enhanced cross-species generalization, making it a valuable tool for studying eukaryotic biology, human diseases, and evolutionary relationships. 

Evo 2’s ability to perform zero-shot tasks, from identifying genes that drive cancer risk to designing complex biomolecular systems, showcases its versatility. Including long-context dependencies enables it to uncover patterns across genomes, providing multi-modal and multi-scale insights that are pivotal for advancements in precision medicine, agriculture, and synthetic biology.

As the field moves forward, models like Evo 2 set the stage for a future where AI deciphers life’s complexity and is also used to design new useful biological systems. These advancements align with broader trends in AI-driven science, where foundational models are tailored to domain-specific challenges, unlocking previously unattainable capabilities. Evo 2’s contributions signal a future where AI becomes an indispensable partner in decoding, designing, and reshaping the living world.

For more information about Evo 2, see the technical report published by the Arc Institute. Evo 2 is also available within the NVIDIA BioNeMo platform.

    Acknowledgments

    We’d like to thank the following contributors to the described research for their notable contributions to the ideation, writing, and figure design for this post:

    • Garyk Brixi, genetics Ph.D. student at Stanford 
    • Jerome Ku, machine learning engineer working with the Arc Institute 
    • Michael Poli, founding scientist at Liquid AI and computer science Ph.D. student at Stanford
    • Greg Brockman, co-founder and president of OpenAI
    • Eric Nguyen, bioengineering Ph.D. student at Stanford
    • Brandon Yang, co-founder of Cartesia AI and computer science Ph.D. student at Stanford (on leave)
    • Dave Burke, chief technology officer at the Arc Institute
    • Hani Goodarzi, core investigator at the Arc Institute and associate professor of biophysics and biochemistry at the University of California, San Francisco
    • Patrick Hsu, co-founder of the Arc Institute, assistant professor of bioengineering, and Deb Faculty Fellow at the University of California, Berkeley
    • Brian Hie, assistant professor of chemical engineering at Stanford University, Dieter Schwarz Foundation Stanford Data Science Faculty Fellow, innovation investigator at the Arc Institute, and leader at the Laboratory of Evolutionary Design at Stanford
      Discuss (0)

      Tags