Introducing the CodonFM Open Model for RNA Design and Analysis

Open research is critical for driving innovation, and many breakthroughs in AI and science are achieved through open collaboration. In the field of digital biology research, NVIDIA Clara supports this open collaboration.

Clara is an open source family of models, tools, and recipes for biology, chemistry, and human health. It includes models for use cases such as small-molecule generative design, synthetic pathway prediction, ADMET property prediction, and protein structure-sequence co-design. To learn more about NVIDIA Clara models and tools for biology and chemistry, visit the NVIDIA-Digital-Bio GitHub repo.

This post introduces CodonFM, a new addition to the Clara open model family. CodonFM is a language model for biology focused on RNA. We describe how the model was designed and how it can be used for various tasks, like variant effect prediction or mRNA design.

CodonFM: An open foundation model for RNA

Today, NVIDIA is announcing CodonFM, a new state-of-the-art RNA foundation model joining the Clara open model family. CodonFM processes RNA by reading it in codons, which each comprise three nucleotides. This approach treats RNA triplets like words in a sentence rather than independent nucleotide letters. By analyzing RNA sequences in their natural syntax, this approach allows the model to learn the complex “grammar” of the genetic code. The result is a model that understands the complex, context-dependent patterns of codon usage bias across organisms.

Some of the most common language models for biology are protein language models, which independently model each amino acid residue in a protein sequence. These models overlook that the same amino acid can be encoded by different codons (synonymous variants) and that, during cellular protein synthesis, different synonymous codon variants lead to different amounts of protein being produced.

By accounting for synonymous variants, CodonFM understands how these different RNA sequences that all encode the same amino acids can impact biological function. This enables predicting properties such as mRNA stability, translation efficiency, and protein yield. It also enhances the performance of language models in predicting disease risk associated with genetic mutations.

CodonFM is built on a BERT-style bidirectional encoder architecture, enabling the model to understand the entire input RNA sequence. With a large context window of 2,046 codon tokens (6,138 ribonucleotides), the model identifies complex, long-range sequence patterns that have been refined over billions of years of evolution.

To learn this biological language, CodonFM was trained on a curated set of 131 million protein-coding sequences from 22,000 species, encompassing hundreds of billions of codon tokens drawn from the National Institutes of Health – National Center for Biotechnology Information RefSeq database.

CodonFM is available in a series of different model sizes (80M, 600M, and 1B parameters) and two pretraining methods. As the models increase in scale, they more accurately distinguish between synonymous codons that encode the same amino acid. This reduction in codon confusion (the frequency with which the model mispredicts one synonymous codon for another) reflects a deeper understanding of codon usage patterns and translation-relevant sequence context (Figure 1).

A heat map comparing how consistently Encodon models of different sizes distinguish synonymous codons for each amino acid, with larger models having lower confusion values (lighter colors), indicating better codon differentiation. — Figure 1. *Synonymous codon confusion analysis across the Encodon model scales, where a darker green color indicates a lower confusion score and a better understanding of codon usage patterns*

Furthermore, each pretraining method offers unique advantages for the resulting models:

Random codon masking: Randomly masks codons within a sequence, regardless of how often they appear in the pretraining corpus. This trains CodonFM to predict missing codons from their surrounding context, helping the model to learn the underlying grammar of the genetic code across a wide range of coding regions.
Codon-weighted masking: Builds on random masking by selectively masking codons according to their usage bias, focusing on rare codon usage in certain sub-contexts of sequences. This allows the model to better capture patterns related to species-specific functional codon selection, rather than treating all codons equally.

As the benchmarks in this post show, CodonFM demonstrates clear scaling laws. As model and dataset size increase, model accuracy improves across use cases such as synonymous and missense variant classification, mRNA translation efficiency, and protein abundance prediction.

Using CodonFM across biological tasks

CodonFM demonstrates broad applicability across both zero-shot and fine-tuned settings, enabling diverse molecular and clinical use cases. This section highlights CodonFM performance across various life sciences applications and provides code snippets for implementing this model for each task.

Mutation effect size prediction

CodonFM models the coding sequence itself—capturing codon context, redundancy, and regulatory patterns—without explicitly relying on protein structure. This enables the fine-tuned 1B-parameter Encodon model, collaboratively developed by NVIDIA and Arc Institute, to achieve robust performance in detecting pathogenic missense mutations. It demonstrates high accuracy in distinguishing disease-associated amino acid substitutions from benign variants.

As Encodon models scale from 80M to 1B parameters and undergo fine-tuning, their ability to distinguish (Mann-Whitney U test - two sided) pathogenic from control variants significantly improves, reflecting greater biological sensitivity and generalization. — *Figure 2. Classification of* de novo missense variants in deciphering developmental disorders (DDD) case versus control cohorts

More importantly, CodonFM extends this capability to the much harder problem of interpreting synonymous variants. Synonymous mutations leave the protein sequence unchanged and have historically eluded prediction models. Encodon detects subtle shifts in codon usage and translation-level effects. It achieves best-in-class discrimination of pathogenic versus benign synonymous variants in ClinVar, demonstrating its unique ability to interpret even silent mutations.

Larger Encodon models (80M→1B) achieve stronger statistical separation (Mann-Whitney U test, two-sided) between pathogenic and matched benign synonymous variants than codon-level baselines, demonstrating superior biological resolution and generalization. — *Figure 3. Classification of synonymous variants in ClinVar pathogenic versus benign datasets*

The following code snippet demonstrates how to perform mutation scoring tasks using the pretrained Encodon models:

# Task: score effect of single synonymous/missense mutation at a given codon
# Output: log-likelihoods for ref/alt codons + LLR per variant
# More details can be found here in the source code of CodonFM
# src/data/preprocess/mutation_pred.py and src/data/mutation_dataset.py

# 1) Configure model checkpoint
CKPT_PATH = "/path/to/NV-CodonFM-Encodon-1B-v1.ckpt"  # change to your .ckpt
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

enc = EncodonInference(model_path=CKPT_PATH, task_type=TaskTypes.MUTATION_PREDICTION, ...  
) # <- routes to predict_mutation
enc.configure_model()  # loads Tokenizer + EncodonPL + weights
enc.to(DEVICE)

# 2) Prepare a mutation example and build a model batch
# Example CDS (coding DNA sequence) length must be divisible by 3 and in-frame.
cds = ("ATGCCGGCGGTCAAGAAGGAGTTCCCGGGCCGCGAGGACCTGGCCCTGGCTCTGGCCACGTTCCACCCGACC") # <--- replace with your full coding sequence (no introns, 5'->3')

# Choose a 0-based CODON index (not nucleotide index). 
codon_idx = 10 #codon 10 in the CDS

# Define the ref codon present at that position and the alternate codon to test
ref_codon = "CGC"
alt_codon = "CGA"  # e.g., a synonymous change

# --- Tokenize the full CDS into codon tokens using the tokenizer ---

tok = enc.tokenizer
context_length = 2048
cds = cds[:(context_length-2) * 3] # truncate the sequence to the context length
cds = '<cls>' + cds + '<sep>'
# Encode the full sequence to input IDs. The tokenizer works at the codon resolution.

input_ids = np.array(tok.encode(cds), dtype=np.int32)  
attention_mask = np.ones_like(input_ids, dtype=bool)

# Add padding to the input_ids and attention_mask to the context length
input_ids = np.pad(input_ids, (0, context_length - len(input_ids)), ...)
attention_mask = np.pad(attention_mask, (0, context_length - len(attention_mask)), ...)

mutation_token_idx = codon_idx + 1 # the position in the tokenized sequence is shifted by 1 because of the <cls> token

Once the input_ids and attention mask are set, get the ref and alt tokens, mask the input, and then make a batch. After the batch is made, run inference using the predict_mutation function.

ref_tok = tok.convert_tokens_to_ids(ref_codon)
alt_tok = tok.convert_tokens_to_ids(alt_codon)

input_ids[mutation_token_idx] = tok.mask_token_id # replace the mutated codon with mask token

batch = {
    MetadataFields.INPUT_IDS: torch.tensor(input_ids, dtype=torch.int32, device=DEVICE).unsqueeze(0),
    MetadataFields.ATTENTION_MASK: torch.tensor(attention_mask, dtype=torch.bool, device=DEVICE).unsqueeze(0),
......
}

# 3) Run inference and interpret LLRs
out = enc.predict_mutation(batch, ids=["example_variant"])

print("IDs:                ", out.ids[0])
print("log P(ref codon):   ", out.ref_likelihoods[0])
print("log P(alt codon):   ", out.alt_likelihoods[0])
print("LLR (ref - alt):    ", out.likelihood_ratios[0])

Under the hood, this script uses the Encodon masked-language model to assess how a mutation alters codon probability within its sequence context. It masks the target codon, predicts likelihoods for the reference and alternate versions, and computes their log-likelihood ratio (LLR). A positive LLR indicates the mutation is less natural or potentially disruptive, while a negative LLR suggests it is tolerated or contextually favored.

mRNA therapeutic design

mRNA design is rapidly emerging as a major modality in modern therapeutics, enabling gene replacement, protein restoration, and the development of programmable biologics. A key challenge in this area is sequence optimization—even small peptides or proteins can be encoded by an enormous number of synonymous mRNA sequences, each influencing expression, stability, and immunogenicity in different ways.

Subtle choices in codon usage and sequence context can significantly affect translational outcomes. CodonFM delivers a best-in-class predictive framework for these applications, achieving state-of-the-art performance across diverse mRNA stability and expression benchmarks. This includes zero-shot prediction of protein abundance and translation efficiency prediction, and provides a foundation for optimized mRNA design.

Each bar represents the cross-validated R² for translation efficiency regression across codon-aware models. The Encodon series (80M–1B parameters) shows progressively higher explanatory power compared to the SOTA Codon Model baseline, reflecting improved capture of codon co-occurrence patterns and translational context. — *Figure 4. Modeling codon-level translation efficiency*

Fine-tuning with CodonFM

The CodonFM repository includes implementations of multiple fine-tuning strategies that enable users to fine-tune the pretrained model for their use case. These strategies include:

Low-Rank Adaptation (LoRA): Fine-tunes low-rank adapters within a pretrained model added to each transformer layer to reduce training cost and memory usage.
Head-Only Random: Trains a randomly initialized output head while the remainder of the model is kept frozen.
Head-Only Pretrained: Trains a pretrained output head while the remainder of the model is kept frozen.
Full: Fine-tunes all parameters of the model end-to-end

Toward programmable biology

Just as language models have learned to reason and protein models to fold, CodonFM learns the rules that connect RNA codons to its behavior and protein expression. It transforms RNA from a passive carrier of genetic information into a programmable language—one that can be interpreted, optimized, and redesigned.

This capability to read and write the language of life forms is a cornerstone of the NVIDIA Virtual Cell initiative. Releasing open, powerful, and scalable models like CodonFM enables researchers and developers to build AI systems that not only understand biology but also actively shape it.

We invite you to join our collaborators from Arc Institute, Therna Biosciences, Greenstone Biosciences, Moonwalk Biosciences, and the Stanford RNA Medicine Program to explore and test CodonFM as part of this shared effort to advance biological intelligence.

Get started with CodonFM

CodonFM was trained using the same core infrastructure that powers other Clara open models, leveraging GPU-native acceleration through the NVIDIA cuDNN and NVIDIA cuBLAS libraries for optimized matrix operations during genomic tokenization. Input datasets were converted to memory-mapped files to enable fast, efficient data streaming, while NVIDIA NeMo Run served as the central training configuration and orchestration framework.

Optionally, Transformer Engine through NVIDIA BioNeMo Framework recipes can be used to accelerate model training and fine-tuning by up to 3x with negligible accuracy loss, ensuring both scalability and computational efficiency.

Ready to get started with CodonFM?

Find the full code on the NVIDIA-Digital-Bio/CodonFM GitHub repository
See the model checkpoints on Hugging Face and NGC
Read more about CodonFM
Use built-in recipes for accelerated training and inference with BioNeMo Framework recipes

Acknowledgments

We would like to acknowledge the following people for their support and contributions to this post: Sajad Darabi, Fan Cao, Mohsen Naghipourfar, Hani Goodarzi, Sara Rabhi, Yingfei Wang, William Greenleaf, Yang Zhang, Cory Ye, Jonathan Mitchell, Timur Rvachov, T.J. Chen, Daniel Burkhardt, and Neha Tadimeti.