Predicting where proteins are located inside a cell is critical in biology and drug discovery. This process is known as subcellular localization. The location of a protein is tightly linked to its function. Knowing whether a protein resides in the nucleus, cytoplasm, or cell membrane can unlock new insights into cellular processes and potential therapeutic targets.
This post explains how researchers can collaboratively train AI models to predict protein properties such as subcellular location—without moving sensitive data across institutions—using NVIDIA FLARE and NVIDIA BioNeMo Framework.
How to fine-tune a model for subcellular localization
A new NVIDIA FLARE tutorial demonstrates how to fine-tune an ESM-2nv model to classify proteins by their subcellular localization. The ESM-2nv model learns from embeddings of protein sequences, leveraging datasets introduced in Light Attention Predicts Protein Location from the Language of Life.
We focus on subcellular localization prediction, formatted as FASTA files following the biotrainer standard that include the sequence, training/validation split, and location class (one of 10, for example: Nucleus, Cell_membrane, and so on).

A data sample in this FASTA format looks like this:
>Sequence1 TARGET=Cell_membrane SET=train VALIDATION=False
MMKTLSSGNCTLNVPAKNSYRMVVLGASRVGKSSIVSRFLNGRFEDQYTPTIEDFHRKVYNIHGDMYQLDILDTSGNHPFPAMRRLSILT
GDVFILVFSLDSRESFDEVKRLQKQILEVKSCLKNKTKEAAELPMVICGNKNDHSELCRQVPAMEAELLVSGDENCAYFEVSAKKNTNVNE
MFYVLFSMAKLPHEMSPALHHKISVQYGDAFHPRPFCMRRTKVAGAYGMVSPFARRPSVNSDLKYIKAKVLREGQARERDKCSIQ
Where:
- TARGET = subcellular location class
- SET = training versus test data
- VALIDATION = marks validation sequences
The dataset spans 10 location classes, making it an excellent real-world classification challenge.
How to use federated learning with BioNeMo protein language models
Running this example is refreshingly simple. With BioNeMo Framework v2.5 in Docker, you can spin up a Jupyter Lab environment directly and run the Federated Protein Property Prediction with BioNeMo tutorial notebook in your browser.
On top of the BioNeMo framework, NVIDIA FLARE is used to bring in federated training. Instead of pooling datasets from multiple sites, each participant trains locally and contributes only model updates. With FedAvg, those updates are aggregated centrally to form a shared global model—privacy preserved, collaboration enabled.
Training and visualization
For this demonstration, the team fine-tuned the 650-million-parameter ESM-2nv model, pretrained in BioNeMo. This larger model offers a strong balance between predictive accuracy and computational efficiency, making it well-suited for federated training scenarios.
Key steps in the workflow include:
- Data splitting: Heterogeneous sampling is applied to mimic the variability one would expect across real-world institutions. This ensures the federated setup more closely reflects practical deployment conditions.
- Federated averaging (FedAvg): Local client updates are aggregated into a shared global model, enabling collaboration without exposing raw protein sequence data.
- Visualization with TensorBoard: Researchers can monitor both local and federated training runs in real time. Continuous server-side metrics provide insight into how the global model evolves with each communication round.

Results
The team compared local training at each site against federated training (FedAvg) under heterogeneous data conditions (alpha = 1.0).
Client | # Samples | Local accuracy | FedAvg accuracy |
Site-1 | 1,844 | 78.2Â | 81.8Â |
Site-2 | 2,921 | 78.9Â | 81.3 |
Site-3 | 2,151 | 79.2Â | 82.1 |
Average | — | 78.8 | 81.7 |
These results highlight how federated learning leverages knowledge across institutions to build a stronger model than any site could achieve alone.

Benefits of using BioNeMo and FLARE for protein prediction
The benefits of using BioNeMo and FLARE extend beyond predicting where proteins localize in a cell. This approach supports the community to build AI for science together. With BioNeMo plus FLARE:
- Federated learning strengthens protein property prediction: Pool collective intelligence without sharing raw data.
- Collaboration benefits everyone: Each site contributes to a stronger model while keeping sensitive data local.
- BioNeMo Framework accelerates discovery: Access state-of-the-art tools for biological sequence analysis.
Get started with federated protein prediction
Federated protein property prediction with NVIDIA BioNeMo and NVIDIA FLARE is part of a powerful new paradigm. Combining the language of life (protein sequences) with federated AI workflows can accelerate discoveries in drug development, healthcare, and biotech—all while respecting data privacy.
The future of life sciences AI isn’t siloed—it’s collaborative. And with FLARE and BioNeMo, that future is already here. Visit the NVIDIA/NVFlare GitHub repo to get started with Federated Protein Property Prediction with BioNeMo and to see more advanced examples.