Agentic AI / Generative AI

Understanding the Language of Life’s Biomolecules Across Evolution at a New Scale with Evo 2

Feb 19, 2025

By Kyle Tretina

Discuss (0)

AI-Generated Summary

Dislike

Evo 2 is a next-generation genomic foundation model that analyzes and generates DNA, RNA, and protein sequences at unprecedented scales, building on the success of its predecessor Evo.
Trained on 8.85T nucleotides from 15,032 eukaryotic genomes and 113,379 prokaryotic genomes, Evo 2 demonstrates enhanced cross-species generalization and broader applications in studying eukaryotic biology, human diseases, and evolutionary relationships.
Evo 2's capabilities include zero-shot tasks such as mutation impact prediction, genome annotation, and gene essentiality identification, making it a valuable tool for advancements in precision medicine, agriculture, and synthetic biology, with its performance enhanced by its training on NVIDIA H100 GPUs using NVIDIA DGX Cloud on AWS.

AI-generated content may summarize information incompletely. Verify important information. Learn more

AI has evolved from an experimental curiosity to a driving force within biological research. The convergence of deep learning algorithms, massive omics datasets, and automated laboratory workflows has allowed scientists to tackle problems once thought intractable—from rapid protein structure prediction to generative drug design, increasing the need for AI literacy among scientists. With this momentum, we find ourselves on the cusp of the next paradigm shift: the emergence of powerful AI foundation models purpose-built for biology.

These new models promise to unify disparate data sources—genomic sequences, RNA and proteomic profiles, and, in some cases, scientific literature—into a single, coherent understanding of life at the molecular, cellular, and systems levels. Learning biology’s language and structure opens doors to transformative applications, such as smarter drug discovery, rational enzyme design, and disease mechanism elucidation.

As we set the stage for this next wave of AI-driven breakthroughs, it is clear that these foundation models will not merely accelerate progress; they stand poised to redefine what is possible in biological research.

Test Evo 2 as an NVIDIA BioNeMo NIM microservice for free with the /NVIDIA/bionemo-examples example notebook.
Explore a complete reference workflow for protein design.
Start training Evo 2 on your data today in BioNeMo Framework.
Stay up to date with the latest NVIDIA BioNeMo platform updates.

A leap forward in sequence modeling and design from molecular to genome-scale

The first Evo model from November 2024 represented a groundbreaking milestone in genomic research, introducing a foundation model capable of analyzing and generating biological sequences across DNA, RNA, and proteins.

Published at a time when most models were restricted to single modalities or short contexts, Evo is known for its ability to operate across scales—ranging from molecular to genomic—using a unified approach. Trained on 2.7M prokaryotic and phage genomes, encompassing 300B nucleotide tokens, Evo delivered single-nucleotide resolution across many biological evolution and function tasks.

At the core of Evo’s success is its innovative StripedHyena architecture (Figure 1), a hybrid model combining 29 Hyena layers, a new type of deep learning architecture designed to handle long sequences of information without relying on traditional attention mechanisms that are common to Transformer architectures. Instead it uses a combination of convolutional filters and gates.

This design overcame the limitations of traditional Transformer models, enabling Evo to handle long contexts of up to 131,072 tokens efficiently. The result was a model capable of connecting small sequence changes to system-wide and organism-level impacts, bridging the gap between molecular biology and evolutionary genomics.

Evo’s predictive capabilities set new standards for biological modeling. It achieved competitive performance in several zero-shot tasks, including predicting the fitness effects of mutations on proteins, non-coding RNAs, and regulatory DNA, providing invaluable insights for synthetic biology and precision medicine.

Evo also demonstrated remarkable generative capabilities, designing functional CRISPR-Cas systems and transposons. These outputs were validated experimentally, proving that Evo could predict and design novel biological systems with real-world utility.

Evo represents a notable advancement in integrating multimodal and multiscale biological understanding into a single model. Its ability to generate genome-scale sequences and predict gene essentiality across entire genomes marked a leap forward in our capacity to analyze and engineer life.

Evo’s milestones were not just its technical achievements but also its vision. This unified framework combined biology’s vast complexity with cutting-edge AI to accelerate discovery and innovation in life sciences.

Learning the language of life across evolution

Evo 2 is the next generation of this line of research in genomic modeling, building on the success of Evo with expanded data, enhanced architecture, and superior performance.

Evo 2 can provide insights into three essential biomolecules—DNA, RNA, and protein— and all three domains of life: Eukarya, Prokarya, and Archaea. Trained on a dataset of 8.85T nucleotides from 15,032 eukaryotic genomes and 113,379 prokaryotic genomes, this training dataset covers diverse species, enabling unprecedented cross-species generalization and significantly broadening its scope compared to Evo, which focused solely on prokaryotic genomes.

Evo 2 uses a new and improved StripedHyena 2 architecture, extended up to 40B parameters, enhancing the model’s training efficiency and ability to capture long-range dependencies with context lengths of 1M tokens. StripedHyena 2, thanks to its multihybrid design based on convolutions, trains significantly faster than Transformers and other hybrid models using linear attention or state-space models.

The largest Evo 2 model was trained using 2,048 NVIDIA H100 GPUs using NVIDIA DGX Cloud on AWS. As part of NVIDIA’s partnership with Arc, they gained access to this high-performance, fully managed AI platform optimized for large-scale, distributed training with NVIDIA AI software and expertise.

These advances mark a significant increase from Evo’s 7B parameters and a 131,000-token context length, positioning Evo 2 as a leader in multimodal and multiscale biological modeling (Table 1).

Feature	Evo	Evo 2
Genomic Training Data	Bacterial + bacteriophage (300B nucleotides)	All domains of life + bacteriophage (9T nucleotides)
Model Parameters	7B	7B + 40B
Context Length	131,072 tokens	Up to 1,048,576 tokens
Modalities	DNA, RNA, protein	DNA, RNA, protein
Safety	Viruses of Eukaryotes excluded	Viruses of Eukaryotes excluded
Applications	Limited cross-species tasks	Broad cross-species applications

Table 1. Key features of Evo 2 and Evo

Evo 2’s expanded training data and refined architecture empower it to excel across various biological applications. Its multimodal design integrates DNA, RNA, and protein data, enabling zero-shot performance on tasks like mutation impact prediction and genome annotation. Evo 2 also fundamentally improves Evo by including eukaryotic genomes, enabling deeper insights into human diseases, agriculture, and environmental science.

Evo 2’s predictive capabilities outperform specialized models across diverse tasks:

Variant impact analysis: Achieves state-of-the-art accuracy in predicting the functional effects of mutations across species zero-shot, including human and non-coding variants.
Gene essentiality: Identifies essential genes in prokaryotic and eukaryotic genomes, validated against experimental datasets, bridging the gap between molecular and systems biology tasks.
Generative capabilities: Designs complex biological systems, such as genome-scale prokaryotic and eukaryotic sequences, and the controllable design of chromatin accessibility, demonstrating new capabilities for biological design with real-world applicability.

Using the NVIDIA Evo 2 NIM microservice

The NVIDIA Evo 2 NIM microservice is useful for generating a variety of biological sequences, with an API that provides settings to adjust tokenization, sampling, and temperature parameters:

# Define JSON example human L1 retrotransposable element sequence
example = {
 
# nucleotide sequence to be analyzed  
        "sequence": "GAATAGGAACAGCTCCGGTCTACAGCTCCCAGCGTGAGCGACGCAGAAGACGGTGATTTCTGCATTTCCATCTGAGGTACCGGGTTCATCTCACTAGGGAGTGCCAGACAGTGGGCGCAGGCCAGTGTGTGTGCGCACCGTGCGCGAGCCGAAGCAGGGCGAGGCATTGCCTCACCTGGGAAGCGCAAGGGGTCAGGGAGTTCCCTTTCCGAGTCAAAGAAAGGGGTGATGGACGCACCTGGAAAATCGGGTCACTCCCACCCGAATATTGCGCTTTTCAGACCGGCTTAAGAAACGGCGCACCACGAGACTATATCCCACACCTGGCTCAGAGGGTCCTACGCCCACGGAATC", 
        "num_tokens": 102, # number of tokens to generate
        "top_k": 4, # only predict top 4 most likely outcomes per token
        "top_p": 1.0, # include 100% cumulative prob results in sampling
        "temperature": 0.7, # add variability (creativity) to predictions
        "
": True, # enable more diverse outputs
        "enable_logits": False, # disable raw model output (logits)
}

# Retrieve the API key from the environment
key = os.getenv("NVCF_RUN_KEY")

# Send the example sequence and parameters to the Evo 2 API
r = requests.post(

        # Example URL for the Evo 2 model API.
        url=os.getenv("URL","https://health.api.nvidia.com/v1/biology/arc/evo2-40b/generate")
        
        # Authorization headers to authenticate with the API
        headers={"Authorization": f"Bearer {key}"},
        
        # The data payload (sequence and parameters) sent as JSON
        json=example,
)

For more information about the API output for various prompts, see the NVIDIA BioNeMo Framework documentation.

Evo 2 can also be fine-tuned using the open-source NVIDIA BioNeMo Framework, which offers robust tools for adapting pretrained models such as Evo 2 to specialized tasks in BioPharma:

# Prepare raw sequence data for training based on a YAML config file 
   preprocess_evo2 -c data_preproc_config.yaml

# Trains the Evo 2 model with preprocessed data and parallelism across multiple GPUs   
torchrun --nproc-per-node=8 --no-python train_Evo 2 -d data_train_config.yaml --num-nodes=1 --devices=8 --max-steps=100 --val-check-interval=25 --experiment-dir=/workspace/bionemo2/model/checkpoints/example --seq-length=8192 --tensor-parallel-size=4 --pipeline-model-parallel-size=1 --context-parallel-size=2 --sequence-parallel --global-batch-size=8 --micro-batch-size=1 --model-size=7b --fp8 --tflops-callback

# Optional Fine-tuning: Add this argument to start from a pretrained model  
# --ckpt-dir=/path/to/pretrained_checkpoint

Evo 2 and the future of AI in biology

AI is poised to rapidly transform biological research, enabling breakthroughs previously thought to be decades away. Evo 2 represents a significant leap forward in this evolution, introducing a genomic foundation model capable of analyzing and generating DNA, RNA, and protein sequences at unprecedented scales.

While Evo excelled in predicting mutation effects and gene expression in prokaryotes, the capabilities of Evo 2 are much broader, with enhanced cross-species generalization, making it a valuable tool for studying eukaryotic biology, human diseases, and evolutionary relationships.

Evo 2’s ability to perform zero-shot tasks, from identifying genes that drive cancer risk to designing complex biomolecular systems, showcases its versatility. Including long-context dependencies enables it to uncover patterns across genomes, providing multi-modal and multi-scale insights that are pivotal for advancements in precision medicine, agriculture, and synthetic biology.

As the field moves forward, models like Evo 2 set the stage for a future where AI deciphers life’s complexity and is also used to design new useful biological systems. These advancements align with broader trends in AI-driven science, where foundational models are tailored to domain-specific challenges, unlocking previously unattainable capabilities. Evo 2’s contributions signal a future where AI becomes an indispensable partner in decoding, designing, and reshaping the living world.

For more information about Evo 2, see the technical report published by the Arc Institute. Evo 2 is also available within the NVIDIA BioNeMo platform.

Acknowledgments

We’d like to thank the following contributors to the described research for their notable contributions to the ideation, writing, and figure design for this post:

Garyk Brixi, genetics Ph.D. student at Stanford
Jerome Ku, machine learning engineer working with the Arc Institute
Michael Poli, founding scientist at Liquid AI and computer science Ph.D. student at Stanford
Greg Brockman, co-founder and president of OpenAI
Eric Nguyen, bioengineering Ph.D. student at Stanford
Brandon Yang, co-founder of Cartesia AI and computer science Ph.D. student at Stanford (on leave)
Dave Burke, chief technology officer at the Arc Institute
Hani Goodarzi, core investigator at the Arc Institute and associate professor of biophysics and biochemistry at the University of California, San Francisco
Patrick Hsu, co-founder of the Arc Institute, assistant professor of bioengineering, and Deb Faculty Fellow at the University of California, Berkeley
Brian Hie, assistant professor of chemical engineering at Stanford University, Dieter Schwarz Foundation Stanford Data Science Faculty Fellow, innovation investigator at the Arc Institute, and leader at the Laboratory of Evolutionary Design at Stanford

Discuss (0)

About the Authors

About Kyle Tretina
Kyle Tretina is a product marketing leader at NVIDIA, focused on advancing AI for digital biology and drug discovery. He drives the strategy and storytelling behind BioNeMo and our work with BioPharma, shaping how next-generation foundation models and GPU-accelerated microservices transform molecular and protein design. With a PhD in molecular microbiology and immunology, Kyle bridges science and strategy, translating breakthroughs in AI, chemistry, and biology into platforms that accelerate discovery for researchers, startups, and pharmaceutical companies worldwide.

View all posts by Kyle Tretina