The ability to compare the sequences of multiple related proteins is a foundational task for many life science researchers. This is often done in the form of a multiple sequence alignment (MSA), and the evolutionary information retrieved from these alignments can yield insights into protein structure, function, and evolutionary history.
Now, with MMseqs2-GPU, an updated GPU-accelerated library for evolutionary information retrieval, getting insights from protein sequences is faster than ever.
In simple terms, an MSA is a big matrix containing letters representing residues (or amino acids) in protein sequences. The first row of the matrix contains the “query” sequence—the sequence of interest for the analysis—with each residue placed in one column from left to right.
Subsequent rows represent similar sequences, from most to least similar, containing aligned residues to the query for each column in the row. When no residue matches the query sequence, a placeholder gap is introduced in the alignment, usually represented by a “-” (Figure 1).
Evolutionary information encoded in MSAs retrieved from protein databases containing sequences from thousands of species provides insights into protein domains that highlight conserved function across species. A simple analysis of the conservation of residues in the MSA (i.e., how often the same amino acid appears in a column) can give a quick insight into key residues in the sequence, which, if altered, could cause the protein to malfunction.
This can help researchers develop insights about the proteins they study. MSAs have been used as inputs to sophisticated machine learning algorithms since 1992 to predict complex protein traits like structure and function.
AlphaFold2, revolutionizing computational and structural biology, leverages MSAs to gain highly accurate 3D protein structure predictions, which is just one of many uses of MSAs in drug discovery research (Figure 2).
However, computing MSAs is challenging, especially since general-purpose CPUs are not built for highly parallel workflows, like sifting through vast databases of protein sequences. The problem keeps getting harder, as protein sequence databases grow daily thanks to large-scale metagenomic experiments and cheap next-generation sequencing. Thus, new algorithms that quickly sift through large databases must be developed to build informative MSAs for protein analysis.
Overcoming computationally expensive MSA with NVIDIA CUDA
Traditional MSA tools rely on CPU-based implementations, which, while effective at sequential processing, can’t match GPU parallel processing capabilities.
The joint research team that developed MMseqs2-GPU was led by researchers at Seoul National University, Johannes Gutenberg University Mainz, and NVIDIA. Inspired by their previous work on CUDASW++4.0, they approached the problem by developing a novel, gapless prefiltering algorithm tailored to NVIDIA CUDA that enables efficient, high-sensitivity sequence comparisons at unparalleled speeds.
This GPU-accelerated prefilter replaces k-mer prefiltering in MMseqs2 with a gapless scoring approach. Instead of using k-mer searches, simplifying comparisons between sequences with a coarse representation, the gapless prefilter directly analyzes the full sequences. It employs a modified version of the classic Smith-Waterman-Gotoh algorithm that only considers diagonal dependencies, avoiding gaps in the alignment. The process runs efficiently across thousands of GPU cores.
The outcome of running this algorithm between the query and every sequence in the reference database is a ranked list of similar sequences from the database to the query, which can be filtered to the top candidates, for which an accelerated affine-gap Smith-Waterman-Gotoh can be performed. These algorithms built into the MMseqs2 library also reduce memory requirements and are natively compatible with multi-GPU systems, overcoming potential memory availability for single GPUs and offering additional speedups.
The gapless pre-filtering step is ideal for GPUs, as it enables sequence-to-sequence comparisons with minimal data transfer, reducing latency and maximizing GPU utilization. With this approach, MMseqs2 on a single NVIDIA L40S showed a 177x speedup over standard JackHMMER implementations on a 128-core CPU (Figure 3). Using eight NVIDIA L40 GPUs improves this speed to 720x (0.117 seconds per sequence).
These numbers were obtained by averaging runtimes across 6370 protein sequences aligned against a reference database of 30M sequences. The algorithms were run on a system featuring a CPU with 128 cores, complemented by one terabyte of RAM, two terabytes of NVMe storage, and a single NVIDIA L40S GPU.
For context, it takes about the same time to compute the alignment of a sequence using MMseqs2-GPU (0.475 seconds), as it does for humans to form conscious thought (~0.3 to 0.5 seconds), to blink (~0.3 to 0.4 seconds), or for lightning to strike (~0.2 to 0.5 seconds).
How CUDA powers optimization and acceleration for MMseqs2-GPU
At the heart of this acceleration is CUDA, which enables MMseqs2-GPU to execute optimized compute kernels for gapless and gapped alignments. These kernels leverage multi-threading and memory-sharing features to align many reference sequences in parallel at greater speed.
MMseqs2-GPU is particularly compatible with the latest NVIDIA GPUs, such as the NVIDIA L40S GPU. GPU-accelerated kernels for gapless prefiltering and gapped alignment leverage the high parallelism of GPUs. The gapless prefilter processes each matrix row in parallel, using shared GPU memory to optimize accesses and 16-bit numbers packed in a 32-bit word (using either half2 or s16x2 data types) to maximize throughput.
It efficiently handles dynamic programming dependencies at warp level using cross-thread warp shuffles. Furthermore, necessary memory lookups are accelerated by using fast CUDA-shared memory. Combining these techniques effectively transforms the problem to compute-bound and minimizes overheads from memory accesses.
The tool also supports multi-GPU setups to ensure scalability, enabling researchers to process larger datasets by distributing the computational load across several GPUs. This architecture is highly adaptable to cloud-based environments, making MMseqs2-GPU an attractive option for researchers in academia and industry looking to reduce computational costs without compromising accuracy.
“We have been waiting for something like this for quite some time. Protein structure prediction inference has long been known to be limited by the MSA computation step. This is an incredible achievement; reducing the MSA step to less than 20% of the execution time completely changes how we will approach structure prediction workflows in the future,” said Luca Naef, CTO at VantAI.
MMseqs2-GPU accelerates protein structure prediction
The success of MMseqs2-GPU is rooted in redesigning gapless prefiltering and gapped alignment algorithms, leveraging CUDA to deliver rapid, affordable, and scalable sequence alignment that meets today’s bioinformatics research demands.
As MMseqs2 is integrated into many computational pipelines using GPUs, including structure prediction with Colabfold, users can expect an easy-to-swap-in performance boost:
Speed improvement
Colabfold using MMseqs2-GPU is 22x faster than AlphaFold2 with JackHMMER and HHblits for protein folding (Figure 4). In practice, this means that instead of waiting 40 minutes to predict a protein structure using HHblits, JackHMMER, and AlphaFold2, you can get that exact prediction in one and a half minutes using Colabfold and MMseqs2-GPU.
The graph is based on predictions for 20 CASP14 queries, and accuracy (LDDT) is the same (~0.76) for each method. The prediction methods were run on a system featuring a CPU with 128 cores, complemented by one terabyte of RAM, two terabytes of NVMe storage, and a single NVIDIA L40S GPU.
Memory requirements
With an ungapped GPU prefilter, MMseqs2-GPU can avoid using large k-mer hash table indexes required in the CPU implementation. This makes this solution more suitable for GPUs and reduces the overall memory requirements by an order of magnitude (see Figure 3 and 4 descriptions).
Cost efficiency
Colabfold using MMseqs2-GPU is 70x cheaper in cloud cost estimates than AlphaFold2 with JackHMMER and HHblits. This massive cost reduction helps labs, especially those with limited budgets, access powerful bioinformatics tools without breaking the bank. Lower compute costs can also enable ongoing, large-scale analyses that would otherwise be financially prohibitive.
High throughput and scalability
The newly developed gapless prefilter can reach 102 Tera Cell Updates Per Second (TCUPS) across eight GPUs, quickly prefiltering large datasets. The tool’s support for multi-GPU execution enables users to scale up further, processing larger datasets while increasing total execution speed, which is crucial for large genomic or proteomic studies.
Accuracy
MMseqs2-GPU achieves these speed and cost benefits without compromising accuracy. It maintains comparable sensitivity and protein folding accuracy, ensuring researchers gain rapid insights without losing reliability.
“My lab at Columbia has developed OpenFold, a faithful reproduction of AlphaFold2, to enable the community to train their protein structure prediction models. Especially interesting for our applications is the ability to perform iterative profile searches, which have proven to provide more informative MSAs for structure prediction. We are very excited to see that MMseqs2-GPU supports profile searches at a greater speed than previous methods,” said Mohammed AlQuraishi, professor at Columbia University.
Accelerated MMseqs2 means faster discoveries
Looking ahead, the joint research team is focused on further refining the algorithms and the MMseqs2 integration, expanding its applications to protein clustering and cascaded database searches. The availability of MMSeqs2 means faster inputs to protein structure prediction that can accelerate drug discovery, as we’ve illustrated here, and a host of other applications (Figure 2).
For example, it can mean faster inputs to protein variant predictors like GEMME, which can be used to deepen our understanding of disease variants and real-time retrieval for protein LLMs like PoET. It can mean faster antimicrobial drug resistance profiles. It can even mean faster vaccine design.
For those interested in diving deeper or contributing to this field-shifting work, MMseqs2-GPU is open source and available online, providing an invaluable resource for researchers globally.
For more information, visit the MMseqs2 GitHub or read their detailed analysis and benchmarks. There’s also an AlphaFold2 NVIDIA NIM using MMseqs2 as the MSA step that you can test.