Predict Protein Structures and Properties with Biomolecular Large Language Models

Discuss (0)
Biomolecular structure

The NVIDIA BioNeMo service is now available for early access. At GTC Fall 2022, NVIDIA unveiled BioNeMo, a domain-specific framework and service for training and serving biomolecular large language models (LLMs) for chemistry and biology at supercomputing scale across billions of parameters. 

The BioNeMo service is domain-optimized for chemical, proteomic, and genomic applications, designed to support molecular data represented in the SMILES notation for chemical structures, and FASTA for amino acid and nucleic acid sequences for proteins, DNA, and RNA.

With the BioNeMo service, scientists and researchers now have access to pretrained biomolecular LLMs through a cloud API, enabling them to predict protein structures, develop workflows, and fit downstream task models from LLM embeddings.

The BioNeMo service is a turnkey cloud solution for AI drug discovery pipelines that can be used in your browser or through API endpoints. The service API endpoints offer scientists the ability to get started quickly with AI drug discovery workflows based on large language model architectures. It also provides a UI Playground to easily and quickly try these models through an API, which can be integrated into your applications.

The BioNeMo service contains the following features:

  • Fully managed, browser-based service with API endpoints for protein LLMs
  • Accelerated OpenFold model for fast 3D protein structure predictions
  • ESM-1nv LLM for protein embeddings for downstream tasks
  • Interactive inference and visualization of protein structures through a graphic user interface (GUI)
  • Programmatic access to pretrained models through the API

About the models

ESM-1nv, based on Meta AI’s state-of-the-art ESM-1b, is a large language model for the evolutionary-scale modeling of proteins. It is based on the BERT architecture and trained on millions of protein sequences with a masked language modeling objective. ESM-1nv learns the patterns and dependencies between amino acids that ultimately give rise to protein structure and function.

Embeddings from ESM-1nv can be used to fit downstream task models for protein properties of interest such as subcellular location, thermostability, and protein structure. This is accomplished by training a typically much smaller model with a supervised learning objective to infer a property from ESM-1nv embeddings of protein sequences. Using embeddings from ESM-1nv typically results in far superior accuracy in the final model.

OpenFold is a faithful reproduction of DeepMind’s AlphaFold-2 model for 3D protein structure prediction from a primary amino acid sequence. This long-standing grand challenge in structural biology reached a significant milestone at CASP14, where AlphaFold-2 achieved nearly experimental accuracy for predicted structures. While AlphaFold was developed for a JAX workflow, OpenFold bases its code on PyTorch. 

OpenFold in BioNeMo is also trainable, meaning variants may be created for specialized research. OpenFold achieves similar accuracy to the original model and predicts the median backbone at an accuracy of 0.96 Å RMSD95 and is up to 6x faster due to changes made in the MSA generation step. This means that drug discovery researchers get 3D protein structure predictions very quickly. 

Get early access to the BioNeMo service

Apply for early access to the BioNeMo service. You’ll be asked to join the NVIDIA Developer Program and fill out a short questionnaire to gain your early access.