Data Science

Boost Alphafold2 Protein Structure Prediction with GPU-Accelerated MMseqs2

Nov 13, 2024

By Kyle Tretina, Christian Dallago, Alejandro Chacon, Bertil Schmidt, Milot Mirdita, Martin Steinegger and Felix Kallenborn

Discuss (0)

AI-Generated Summary

Dislike

MMseqs2-GPU is a GPU-accelerated library that rapidly computes multiple sequence alignments (MSAs) of protein sequences, leveraging NVIDIA CUDA to achieve significant speedups over traditional CPU-based methods.
The tool's gapless prefiltering algorithm and gapped alignment kernels are optimized for NVIDIA GPUs, such as the NVIDIA L40S, allowing for efficient sequence comparisons and alignments.
MMseqs2-GPU achieves a 177x speedup over JackHMMER on a single NVIDIA L40S GPU and a 22x speedup in protein structure prediction when used with Colabfold, making it a valuable resource for researchers in academia and industry.

AI-generated content may summarize information incompletely. Verify important information. Learn more

The ability to compare the sequences of multiple related proteins is a foundational task for many life science researchers. This is often done in the form of a multiple sequence alignment (MSA), and the evolutionary information retrieved from these alignments can yield insights into protein structure, function, and evolutionary history.

Now, with MMseqs2-GPU, an updated GPU-accelerated library for evolutionary information retrieval, getting insights from protein sequences is faster than ever.

In simple terms, an MSA is a big matrix containing letters representing residues (or amino acids) in protein sequences. The first row of the matrix contains the “query” sequence—the sequence of interest for the analysis—with each residue placed in one column from left to right.

Subsequent rows represent similar sequences, from most to least similar, containing aligned residues to the query for each column in the row. When no residue matches the query sequence, a placeholder gap is introduced in the alignment, usually represented by a “-” (Figure 1).

MSA is visualized with a stacked image of a gapped alignment at each position. It shows various alignment characteristics, including sequence logos with stacked letters whose height is proportionate to their prevalence at that position and bar plots.

Figure 1. MSAs capture information about the compared sequences and can be visualized and quantified with sequence logos, bar plots, and other methods

Evolutionary information encoded in MSAs retrieved from protein databases containing sequences from thousands of species provides insights into protein domains that highlight conserved function across species. A simple analysis of the conservation of residues in the MSA (i.e., how often the same amino acid appears in a column) can give a quick insight into key residues in the sequence, which, if altered, could cause the protein to malfunction.

This can help researchers develop insights about the proteins they study. MSAs have been used as inputs to sophisticated machine learning algorithms since 1992 to predict complex protein traits like structure and function.

AlphaFold2, revolutionizing computational and structural biology, leverages MSAs to gain highly accurate 3D protein structure predictions, which is just one of many uses of MSAs in drug discovery research (Figure 2).

However, computing MSAs is challenging, especially since general-purpose CPUs are not built for highly parallel workflows, like sifting through vast databases of protein sequences. The problem keeps getting harder, as protein sequence databases grow daily thanks to large-scale metagenomic experiments and cheap next-generation sequencing. Thus, new algorithms that quickly sift through large databases must be developed to build informative MSAs for protein analysis.

Overcoming computationally expensive MSA with NVIDIA CUDA

Traditional MSA tools rely on CPU-based implementations, which, while effective at sequential processing, can’t match GPU parallel processing capabilities.

The joint research team that developed MMseqs2-GPU was led by researchers at Seoul National University, Johannes Gutenberg University Mainz, and NVIDIA. Inspired by their previous work on CUDASW++4.0, they approached the problem by developing a novel, gapless prefiltering algorithm tailored to NVIDIA CUDA that enables efficient, high-sensitivity sequence comparisons at unparalleled speeds.

This GPU-accelerated prefilter replaces k-mer prefiltering in MMseqs2 with a gapless scoring approach. Instead of using k-mer searches, simplifying comparisons between sequences with a coarse representation, the gapless prefilter directly analyzes the full sequences. It employs a modified version of the classic Smith-Waterman-Gotoh algorithm that only considers diagonal dependencies, avoiding gaps in the alignment. The process runs efficiently across thousands of GPU cores.

The outcome of running this algorithm between the query and every sequence in the reference database is a ranked list of similar sequences from the database to the query, which can be filtered to the top candidates, for which an accelerated affine-gap Smith-Waterman-Gotoh can be performed. These algorithms built into the MMseqs2 library also reduce memory requirements and are natively compatible with multi-GPU systems, overcoming potential memory availability for single GPUs and offering additional speedups.

The gapless pre-filtering step is ideal for GPUs, as it enables sequence-to-sequence comparisons with minimal data transfer, reducing latency and maximizing GPU utilization. With this approach, MMseqs2 on a single NVIDIA L40S showed a 177x speedup over standard JackHMMER implementations on a 128-core CPU (Figure 3). Using eight NVIDIA L40 GPUs improves this speed to 720x (0.117 seconds per sequence).

These numbers were obtained by averaging runtimes across 6370 protein sequences aligned against a reference database of 30M sequences. The algorithms were run on a system featuring a CPU with 128 cores, complemented by one terabyte of RAM, two terabytes of NVMe storage, and a single NVIDIA L40S GPU.

For context, it takes about the same time to compute the alignment of a sequence using MMseqs2-GPU (0.475 seconds), as it does for humans to form conscious thought (~0.3 to 0.5 seconds), to blink (~0.3 to 0.4 seconds), or for lightning to strike (~0.2 to 0.5 seconds).

How CUDA powers optimization and acceleration for MMseqs2-GPU

At the heart of this acceleration is CUDA, which enables MMseqs2-GPU to execute optimized compute kernels for gapless and gapped alignments. These kernels leverage multi-threading and memory-sharing features to align many reference sequences in parallel at greater speed.

MMseqs2-GPU is particularly compatible with the latest NVIDIA GPUs, such as the NVIDIA L40S GPU. GPU-accelerated kernels for gapless prefiltering and gapped alignment leverage the high parallelism of GPUs. The gapless prefilter processes each matrix row in parallel, using shared GPU memory to optimize accesses and 16-bit numbers packed in a 32-bit word (using either half2 or s16x2 data types) to maximize throughput.

It efficiently handles dynamic programming dependencies at warp level using cross-thread warp shuffles. Furthermore, necessary memory lookups are accelerated by using fast CUDA-shared memory. Combining these techniques effectively transforms the problem to compute-bound and minimizes overheads from memory accesses.

The tool also supports multi-GPU setups to ensure scalability, enabling researchers to process larger datasets by distributing the computational load across several GPUs. This architecture is highly adaptable to cloud-based environments, making MMseqs2-GPU an attractive option for researchers in academia and industry looking to reduce computational costs without compromising accuracy.

“We have been waiting for something like this for quite some time. Protein structure prediction inference has long been known to be limited by the MSA computation step. This is an incredible achievement; reducing the MSA step to less than 20% of the execution time completely changes how we will approach structure prediction workflows in the future,” said Luca Naef, CTO at VantAI.

MMseqs2-GPU accelerates protein structure prediction

The success of MMseqs2-GPU is rooted in redesigning gapless prefiltering and gapped alignment algorithms, leveraging CUDA to deliver rapid, affordable, and scalable sequence alignment that meets today’s bioinformatics research demands.

As MMseqs2 is integrated into many computational pipelines using GPUs, including structure prediction with Colabfold, users can expect an easy-to-swap-in performance boost:

Speed improvement

Colabfold using MMseqs2-GPU is 22x faster than AlphaFold2 with JackHMMER and HHblits for protein folding (Figure 4). In practice, this means that instead of waiting 40 minutes to predict a protein structure using HHblits, JackHMMER, and AlphaFold2, you can get that exact prediction in one and a half minutes using Colabfold and MMseqs2-GPU.

The graph is based on predictions for 20 CASP14 queries, and accuracy (LDDT) is the same (~0.76) for each method. The prediction methods were run on a system featuring a CPU with 128 cores, complemented by one terabyte of RAM, two terabytes of NVMe storage, and a single NVIDIA L40S GPU.

Memory requirements

With an ungapped GPU prefilter, MMseqs2-GPU can avoid using large k-mer hash table indexes required in the CPU implementation. This makes this solution more suitable for GPUs and reduces the overall memory requirements by an order of magnitude (see Figure 3 and 4 descriptions).

Cost efficiency

Colabfold using MMseqs2-GPU is 70x cheaper in cloud cost estimates than AlphaFold2 with JackHMMER and HHblits. This massive cost reduction helps labs, especially those with limited budgets, access powerful bioinformatics tools without breaking the bank. Lower compute costs can also enable ongoing, large-scale analyses that would otherwise be financially prohibitive.

High throughput and scalability

The newly developed gapless prefilter can reach 102 Tera Cell Updates Per Second (TCUPS) across eight GPUs, quickly prefiltering large datasets. The tool’s support for multi-GPU execution enables users to scale up further, processing larger datasets while increasing total execution speed, which is crucial for large genomic or proteomic studies.

Accuracy

MMseqs2-GPU achieves these speed and cost benefits without compromising accuracy. It maintains comparable sensitivity and protein folding accuracy, ensuring researchers gain rapid insights without losing reliability.

“My lab at Columbia has developed OpenFold, a faithful reproduction of AlphaFold2, to enable the community to train their protein structure prediction models. Especially interesting for our applications is the ability to perform iterative profile searches, which have proven to provide more informative MSAs for structure prediction. We are very excited to see that MMseqs2-GPU supports profile searches at a greater speed than previous methods,” said Mohammed AlQuraishi, professor at Columbia University.

Accelerated MMseqs2 means faster discoveries

Looking ahead, the joint research team is focused on further refining the algorithms and the MMseqs2 integration, expanding its applications to protein clustering and cascaded database searches. The availability of MMSeqs2 means faster inputs to protein structure prediction that can accelerate drug discovery, as we’ve illustrated here, and a host of other applications (Figure 2).

For example, it can mean faster inputs to protein variant predictors like GEMME, which can be used to deepen our understanding of disease variants and real-time retrieval for protein LLMs like PoET. It can mean faster antimicrobial drug resistance profiles. It can even mean faster vaccine design.

For those interested in diving deeper or contributing to this field-shifting work, MMseqs2-GPU is open source and available online, providing an invaluable resource for researchers globally.

For more information, visit the MMseqs2 GitHub or read their detailed analysis and benchmarks. There’s also an AlphaFold2 NVIDIA NIM using MMseqs2 as the MSA step that you can test.

Discuss (0)

About the Authors

About Kyle Tretina
Kyle Tretina is a product marketing leader at NVIDIA, focused on advancing AI for digital biology and drug discovery. He drives the strategy and storytelling behind BioNeMo and our work with BioPharma, shaping how next-generation foundation models and GPU-accelerated microservices transform molecular and protein design. With a PhD in molecular microbiology and immunology, Kyle bridges science and strategy, translating breakthroughs in AI, chemistry, and biology into platforms that accelerate discovery for researchers, startups, and pharmaceutical companies worldwide.

View all posts by Kyle Tretina

About Christian Dallago
Chris Dallago is a computer scientist turned bioinformatician, passionately models biological mechanisms using machine learning. He's advanced bio-sequence representation learning, contributing to its establishment, notably in transformer models. Chris is dedicated to solving scarce data problems, such as designing proteins for therapeutic and industrial applications.

View all posts by Christian Dallago

About Alejandro Chacon
Alejandro Chacon is a researcher in high-performance computing and computer architecture applied to bioinformatics. He's working on Healthcare at NVIDIA by leading Genomics efforts as Developer Technologies engineer. Alejandro's research is recognized for improving computational efficiency in bioinformatics, making significant contributions to scalable, GPU-optimized algorithms in DNA sequencing.

View all posts by Alejandro Chacon

About Bertil Schmidt
Bertil Schmidt is a tenured full professor and chair of high-performance computing at the University of Mainz, Germany. He's designed and implemented various parallel algorithms and tools for bioinformatics and data science on numerous platforms ranging from GPUs and embedded devices to the world’s fastest supercomputers. He also authored the textbook “Parallel Programming: Concepts and Practice," providing an upper-level introduction to parallel programming.

View all posts by Bertil Schmidt

About Milot Mirdita
Milot Mirdita is a researcher at Seoul National University in South Korea, specializing in developing accessible and efficient methods for metagenomics, protein annotation, and structure prediction by combining classic bioinformatics and machine learning. He is the developer and maintainer of widely used open-source tools, including ColabFold, MMseqs2, and Foldseek.

View all posts by Milot Mirdita

About Martin Steinegger
Dr. Steinegger is an associate professor in the Biology Department at Seoul National University, with a joint appointment to the Interdisciplinary Program in Bioinformatics. He started his research group in 2020, focusing on the development of methods to analyze massive genomics and proteomic datasets. The group's contributions to bioinformatics include widely used tools for predicting structures (ColabFold/AlphaFold2), clustering (Linclust), assembling (Plass), and searching sequences (MMseqs2) and protein structures (Foldseek). His group's software and web services have been installed and used millions of times. Dr. Steinegger is an advocate for internationality at his home institution, open science and open source.

View all posts by Martin Steinegger

About Felix Kallenborn
Felix Kallenborn obtained his PhD at the Johannes-Gutenberg University in Mainz in 2024. In his current position as post doctoral researcher at the Institute of Computer Science at JGU Mainz, he focuses on the development of NVIDIA CUDA-enabled parallel algorithms in the field of bioinformatics.

View all posts by Felix Kallenborn