Agentic AI / Generative AI

Breakthrough in Functional Annotation with HiFi-NN

A triptych of digital graphics showcasing biomolecular structures resembling biomes, DNA, cells, and data points.

Dec 19, 2023

By Bruno Trentini and Phil Lorenz

Discuss (0)

AI-Generated Summary

Dislike

Basecamp Research, a London-based Bio-AI company and NVIDIA Inception member, developed a Hierarchically Fine-tuned Nearest Neighbor method (HiFi-NN) using NVIDIA GPUs, which significantly improved enzyme annotation by over 15% compared to state-of-the-art models.
HiFi-NN was trained on a large knowledge graph called BaseGraph, containing over 5.5B relationships and a genomic context exceeding 70 kilobases per protein, which was built from data collected during global expeditions to diverse biomes across five continents and 23 countries.
By leveraging proprietary sequences from BaseGraph and utilizing the hierarchical nature of the enzyme commission annotation system, HiFi-NN outperformed existing models in precision, recall, and F1 scores, and was able to annotate the entire human proteome within 24 minutes on a single NVIDIA A100 GPU.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Enzymes are vital biological catalysts for a multitude of processes, from cellular metabolism to industrial manufacturing. The applications of artificial intelligence for enzyme generation is an exciting field of research with direct applications in the life sciences. Advances in these scientific challenges are a critical necessity to further advance drug discovery, environmental science, and bioengineering.

Currently, only a tiny fraction of Earth’s vast array of life forms has been sequenced, hindering the broader application and generalization of machine learning algorithms within the complex realm of sequence design. Improved methods for functional annotation are a vital component in enzyme research, enabling the identification and characterization of the functions of newly discovered enzymes. This is key to understanding complex biological processes and enhancing the data used for generative workflows.

Basecamp Research, a London-based Bio-AI company and NVIDIA Inception member, recently used NVIDIA GPUs to train a Hierarchically Fine-tuned Nearest Neighbor method (HiFi-NN). This approach has shown significant improvements over existing models in recall, precision, and F1 scores, surpassing the state-of-the-art (SoTA) in enzyme annotation by over 15%.

Unique data collected in global expeditions

Basecamp Research tackles the most complex biological design challenges across the biotech industry. Existing datasets have significant shortcomings:

Small representativity, covering only 0.001% of life on earth
No consistent metadata
Lack of stakeholder consent and engagement before data collection

Basecamp opted to develop its proprietary biological data resource through biodiversity partnerships with nature parks across five continents and 23 countries. They sent their scientists on worldwide expeditions to discover new genomes, enzymes, and biological relationships from the most extreme and extraordinary biomes.

In under two years, they created BaseGraph, the largest knowledge graph of natural biodiversity, containing over 5.5B relationships with a genomic context exceeding 70 kilobases per protein. Their extensive long-read sequencing is complemented by comprehensive metadata collection, enabling them to link proteins of interest to specific reactions and desired process conditions.

Basecamp Research’s AI strategy is data-centric for two reasons: their proprietary data enhances model performance in a rapidly commoditizing algorithmic landscape and significantly compensates for the lack of diversity in publicly available data.

Their knowledge graph, built from the ground up, captures and re-creates the complexities of four billion years of protein evolution in nature. This data advantage enables their AI and product teams to outperform SoTA annotation and design models, addressing complex design challenges in biotech industries, from gene-writing therapeutics to plastic degradation.

In silico functional annotation

To tackle the challenge of in silico functional annotation of proteins and enzymes, Basecamp Research’s deep learning (DL) team developed HiFi-NN search. HiFi-NN annotates protein sequences with enzyme commission (EC) numbers beating the current bioinformatics tool of choice (blastp), as well as other SoTA DL models, including CLEAN, in precision and recall (Table 1).

Method	Recall	Precision	F1-score
ECPred	0.0197	0.0197	0.0197
DEEPpre	0.0403	0.0415	0.0386
DeepEC	0.0724	0.1184	0.0846
ProteInfer	0.1382	0.2434	0.1662
ProteinVec	0.2961	0.4901	0.3378
BLASTp	0.3750	0.5083	0.3852
DeepECtransformer	0.3026	0.5263	0.3511
CLEAN	0.4671	0.5844	0.4947
HiFi-NN (Swissprot)	0.5724	0.5505	0.5304
HiFi-NN (Swissprot + 3M curated sequences)	0.5921	0.6657	0.6015

Table 1. HiFi-NN results compared to existing tools and other SoTA DL models

BCR 3M refers to the version of the model retrained with 3M selected, environmentally diverse sequences from Basecamp Research’s BaseGraph.

Developed in collaboration with advisors Noelia Ferruz and Kevin Yang, HiFi-NN was accepted by NeurIPS, a prestigious AI conference, for presentation at their Machine Learning for Structural Biology workshop in December 2023.

The model uses contrastive learning of EC numbers and the inherent hierarchy of the enzyme commission annotation system for natural augmentation. Trained on eight NVIDIA A100 GPUs on a Lambda Labs instance with CUDA version 11.8 and NCCL version 2.14.3, it employs PyTorch Lightning for distributed-data parallel training. Experiment management and tracking are done through the Hydra framework and Weights and Biases. The model boasts over 3M parameters.

HiFi-NN’s superior performance in SoTA annotation methods is attributed to using the hierarchical nature of enzyme function represented in the EC numbering system and supplementing the training set with proprietary sequences from Basecamp’s knowledge graph.

Proprietary sequences

The proprietary Basecamp sequences that HiFi-NN is supplemented with originate from environments spanning five continents, and a 110^o C temperature range to ensure as much sequence and environmental diversity in the training set as possible. As a result, HiFi-NN outperforms all SoTA models on benchmarking datasets and performs particularly well on protein sequences from functional dark matter, that is, those sequences with low similarity to any known enzymes.

In fact, the Basecamp team used HiFi-NN to annotate a large representative portion of the MGnify microbial protein database that was previously not annotated.

In addition to outperforming all previous annotation models, HiFi-NN is also particularly easy to use and generates annotation labels at great speed. For example, it annotates the entire human proteome within 24 minutes on a single NVIDIA A100 GPU.

Bolstering AI development in life sciences

Functional annotation plays a pivotal role for researchers, especially in practical scenarios:

In drug discovery, it aids in the creation of targeted treatments by elucidating enzyme interactions within the body.
In the realm of industrial biotechnology, it enables the bespoke design of enzymes, tailored for specific industrial applications, promoting more environmentally friendly production methods.
It also offers crucial insights into evolutionary biology by revealing the developmental trajectory of enzymes across different species.

In essence, functional annotation, driven by machine learning, transcends its role as a mere scientific instrument. it acts as a catalyst for innovation and exploration across diverse sectors, including healthcare and environmental science.

Basecamp Research, along with other life sciences entities, is integrating its workflow with NVIDIA BioNeMo, a generative AI platform geared towards drug discovery. This platform streamlines and expedites the training of models. With BioNeMo, organizations can tailor and deploy AI models for various purposes, including 3D protein structure prediction, de novo protein and small molecule generation, property prediction, and molecular docking.

If you’re interested in exploring BioNeMo, opportunities are available through the BioNeMo Framework beta release or the API early access program., For more information, see the NVIDIA BioNeMo product page.

Discuss (0)

About the Authors

About Bruno Trentini
Bruno Trentini is a developer relations manager at NVIDIA, where he leads Global Research Alliances to develop high-performance machine learning for life sciences. Bruno holds a master’s degree in Data Science & Machine Learning from University College London and has over a decade of technical experience working with startups and large enterprises.

View all posts by Bruno Trentini

About Phil Lorenz
Phil Lorenz is the CTO at Basecamp Research where he leads the genomics, deep learning, and product development functions. With previous experience in the pharmaceutical industry, Phil holds a Ph.D. in genomics and machine learning as well as a master’s degree in biochemistry from the University of Oxford.

View all posts by Phil Lorenz