How to Accelerate Protein Structure Prediction at Proteome-Scale

Proteins rarely function in isolation as individual monomers. Most biological processes are governed by proteins interacting with other proteins, forming protein complexes whose structures are described in the hierarchy of protein structure as the quaternary representation.

This represents one level of complexity up from tertiary representations, the 3D structure of monomers, which are commonly known since the emergence of AlphaFold2 and the creation of the Protein Data Bank.

Structural information for the vast majority of complexes remains unavailable. While the AlphaFold Protein Structure Database (AFDB), jointly developed by Google DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI), transformed access to monomeric protein structures, interaction-aware structural biology at the proteome scale has remained a bottleneck with unique challenges:

Massive combinatorial interaction space
High computational cost for multiple sequence alignment (MSA) generation and protein folding
Inference scaling across millions of complexes
Confidence calibration and benchmarking
Dataset consistency and biological interpretability

In recent work, we extended the AFDB with large-scale predictions of homomeric protein complexes generated by a high-throughput pipeline based on AlphaFold-Multimer—made possible by NVIDIA accelerated computing. Additionally, we predicted heteromeric complexes to compare the accuracy of different complex prediction modalities.

In particular, for the predictions of these datasets, we leveraged kernel-level accelerations from MMseqs2-GPU for MSA generation, and NVIDIA TensorRT and NVIDIA cuEquivariance for deep-learning-based protein folding. We then mapped the workload to HPC-scale inference by maximizing the utilization of all available GPUs, including scale-out to multiple clusters.

This blog describes the major principles we adopted to increase protein folding throughput, from adopting libraries and SDKs to optimizations to reduce the computational complexity of the workload. These principles can help you set up a similar pipeline yourself by borrowing from the techniques we used to create this new dataset.

So, if you are a:

Computational biologist scaling structure prediction pipelines
AI researcher training generative protein models
HPC engineer optimizing GPU workloads
Bioinformatician team building structural resources

You will learn how to:

Design a proteome-scale complex prediction strategy
Separate MSA generation from structure inference for efficiency
Scale AlphaFold-Multimer workflows across GPU clusters

Prerequisites

Technical knowledge

Python and shell scripting
SLURM as HPC workload scheduler
Basic structural biology
Familiarity with AlphaFold/ColabFold/OpenFold or similar pipelines

Infrastructure

We describe scaling on a multi-GPU and multi-node NVIDIA DGX H100 Superpod cluster
This cluster includes high-speed storage to store MSAs and intermediate outputs

Software

Access to MMseqs2-GPU
Familiarity with TensorRT
If not using a model with integrated cuEquivariance, knowledge about triangular attention and multiplication operations

Procedure/Steps

1. Define the dataset you’d like to compute

Begin by defining the scope of prediction. Because predicting protein complexes can become a combinatorial problem, it’s useful to understand what may be most interesting. In some cases, if your proteomes are small enough, an all-against-all (dimeric) complex prediction might be tractable; however, this could change if you want to predict large datasets of proteomes.

Here’s how we decided to go about it:

Homomeric complexes: We selected all proteomes represented in the AFDB and sorted them by perceived importance (e.g., proteomes of human concern or commonly accessed). This allowed us to rank proteomes for computation in a particular order, making execution more manageable.

Heteromeric complexes: This is where things can get complicated, fast. For our heteromeric runs, we decided to focus on complexes originating from several reference proteomes and proteomes included in the WHO list of important proteomes. As there’s an intractable number of combinations of complexes that can be derived from these proteomes, for our runs, we focused on dimers (complexes of two proteins), within the same proteome (no inter-proteome complexes) that had “physical” interaction evidence in STRING. As we sought coverage, we decided to consider all interactions reported in STRING for these proteomes, rather than further filtering. Evidence in the literature suggests that filtering for STRING scores >700 can further reduce the number of inputs while increasing the likelihood of well-predicted complexes.

2. Decoupling MSA generation from structure prediction

MSA generation and structure inference are both compute-intensive but scale differently, as we recently presented in a white paper. We thus approached these computations as separate steps and implemented separate SLURM pipelines. In general, for optimal use of a node, we set up MSA generation and structure prediction this way.

MSA generation

We generated MSAs using colabfold_search with the MMseqs2-GPU backend. While MMSeqs2-GPU scales across GPUs on a node natively, we chose to spawn one MMseqs2-GPU server process per GPU on a node for easier process management. In colabfold_search, the GPUs are only used for the ungappedfilter stages and not the subsequent alignment stages (which are multithreaded CPU processes).

Therefore, we can stack colabfold_search calls and start the next one once the GPU is no longer used by the previous one, by monitoring the colabfold_search output, to reduce GPU idle time.

Although this approach oversubscribes CPU resources, in practice, we found that on a DGX H100 node, up to 25% of the overall increase in throughput can be achieved with three staggered colabfold_search processes, at the expense of slower processing of individual input chunks.

On determining reasonable input chunk sizes, there are two factors to consider. Smaller chunk sizes result in more chunks, which means more per-process overheads, such as database loading, which can take a couple of minutes each, even on fast storage. (Pre-staging the databases on the fastest storage available, such as the on-node SSD, helps with throughput as well.) On the other hand, larger chunks take more time to finish. On a SLURM cluster with a job time limit, this results in more unfinished chunks.

The sweet spot will depend on the cluster configuration, but for our DGX H100 node with a 4-hour wall time limit, the chunk size of 300 sequences seemed to work well with the staggering colabfold_search approach.

Structure prediction

In order to increase structure prediction throughput, we leveraged both optimizations in data handling for JAX-based folding through ColabFold, as well as accelerated tooling developed at NVIDIA, including TensorRT, and cuEquivariance for OpenFold-based folding.

Deep learning inference parameters

First, we selected inference parameters that struck a good balance between accuracy and speed. Protein inference setup for all deep learning inference pipelines (ColabFold and OpenFold), thus utilized:

Weights: 1x weights from AlphaFold Multimer (model_1_multimer_v3)
Four recycles (with early stopping)
No relaxation
MSAs: frozen MSAs generated through ColabFold-search (using MMseqs2-GPU), as described above

Accuracy validation

	Homodimer PDB set (125 proteins)
Model	High	Medium	Accept	Incorr	Usable		DockQ
DockQ	>0.8	>0.6	>0.3	>0
ColabFold	52	37	12	21	89	(72.95%)	0.637
OpenFold with TensorRT and cuEquivariance	53	39	10	20	92	(75.41%)	0.647

Table 1. A comparison of interface accuracy between ColabFold and OpenFold (accelerated by TensorRT and cuEquivariance) across a benchmark set of 125 homodimer proteins.

As we used different inference pipelines, we performed accuracy validation using a curated benchmark set of 125 X-ray resolved PDB homodimers released after AlphaFold2 was introduced, thus minimizing the potential for information leakage.

Predicted complexes for each deep learning implementation were compared against experimental reference structures using DockQ, which evaluates interface accuracy via the fraction of native contacts (Fnat), fraction of non-native contacts (Fnonnat), interface RMSD (iRMS), and ligand RMSD after receptor alignment (LRMS), and assigns standard CAPRI classifications of high, medium, acceptable, or incorrect.

Across the PDB homodimer benchmark, OpenFold accelerated through TensorRT and cuEquivariance reproduces ColabFold interface accuracy, achieving a similar fraction of “high” scoring predictions and comparable mean DockQ scores. This indicates that the accelerated implementations preserve interface-level structural accuracy relative to the ColabFold baseline.

MSA preparation and sequence packing

For ColabFold-based homodimer inferences, higher throughput can be achieved by packing homodimers of equal length into a batch for processing, sorted by their MSA depth in descending order. This reduces the number of JAX recompilations, thereby increasing end-to-end throughput. This trick, however, does not work when processing heterodimers, because the lengths of the individual chains differ.

For OpenFold, whether for homodimers or heterodimers, this packing strategy is not needed, as the method doesn’t require re-compilation. However, given a dependency between sequence length and execution time, reserving longer sequences for individual jobs may be beneficial if operating with specific SLURM runtimes. To further optimize the process, input featurizations (CPU-bound) were performed for the next input query alongside the inference step for the current query (GPU-bound).

Additionally, OpenFold’s throughput was enhanced through the integration of the NVIDIA cuEquivariance library and NVIDIA TensorRT SDK. These modular libraries and SDKs can be leveraged to accelerate operations common in protein structure AI and general inference AI workloads, respectively. We previously described how TensorRT can be leveraged to accelerate OpenFold inference.

3. Optimize GPU utilization with SLURM

As alluded to in the previous section, depending on the available hardware, you can increase throughput by “packing” GPUs and nodes. SLURM is a great orchestrator, and we divided the inference workflows in SLURM scripts to:

Pack multiple predictions per node
Match GPU memory to sequence length
Reduce idle time between jobs
Separate short vs long sequence queues

Our workload was mapped to a H100 DGX Superpod HPC system. We could thus deploy inference across NVIDIA H100 GPUs on multi-node clusters, leveraging exclusive execution on a single node, and packing each GPU with as many processes as saturated the GPU utilization for both MSA processing and deep learning inference.

Helpful tips:

Group jobs by total residue length
Monitor GPU memory fragmentation
Use asynchronous I/O to avoid disk bottlenecks

4. Making quality predictions accessible to the world

In partnership with EMBL-EBI, the Steineggerlab at Seoul National University, and Google DeepMind, we explored complex structure prediction analysis. We highlight that predicting these biological systems remains challenging. Unlike protein monomer prediction, where predicted Local Distance Difference Test (pLDDT) can inform overall prediction quality, yielding a balanced amount of plausible predictions, in the complex scenario, assessing interface plausibility is much harder. This has to do with the fact that assessing complexes involves global and per-chain confidence metrics, as well as local confidence metrics at the interface.

Simply put, is the interface between two monomers plausible, and is it predicted in the right pocket? These questions are much harder to answer than more “local” questions about monomer likelihood, given the very limited data available. Therefore, we make available a set of high-confidence structures through the AlphaFold Database, thereby enabling, for the first time, exploration of protein complexes. We intend to refine our approach further and expand the universe of available protein complexes in the AlphaFold Database.

Getting started

Proteome-scale quaternary structure prediction requires more than just running AlphaFold-Multimer at scale. Success depends on:

Evidence-driven interaction selection
Decoupled and optimized compute workflows
GPU-aware job orchestration
Confidence calibration and validation
Dataset health monitoring

By combining STRING-guided selection, MMseqs2-GPU acceleration, and NVIDIA H100-powered multimer inference, this work extends AFDB into a unified, interaction-aware structural resource.

This infrastructure enables:

Variant interpretation at interfaces
Systems-level structural biology
Drug target validation
Generative protein design benchmarking

Resources

Read more about the project here: https://research.nvidia.com/labs/dbr/assets/data/manuscripts/afdb.pdf
Accelerated libraries and SDKs are available here:
- MMseqs2-GPU
- NVIDIA cuEquivariance
- NVIDIA TensorRT
If you wish to deploy MSA search and protein folding easily, you can get accelerated inference pipelines through NVIDIA’s Inference Microservices (NIMs):
- MSA Search NIM
- OpenFold2 NIM
The predictions from this effort are available through https://alphafold.com

How to Accelerate Protein Structure Prediction at Proteome-Scale