Data Science

Discover New Biological Insights with Accelerated Pangenome Alignment in NVIDIA Parabricks

NVIDIA Parabricks is a scalable genomics analysis software suite that solves omics challenges with accelerated computing and deep learning to unlock new scientific breakthroughs. NVIDIA Parabricks v4.4 introduces new features and functionality including accelerated pangenome graph alignment, as announced at the American Society of Human Genetics (ASHG) national meeting. 

The core new feature of the Parabricks v4.4 release is single-end and paired-end support for Giraffe for accelerated pangenome graph alignment. The release also includes additional functionality for Minimap2 and GATK HaplotypeCaller, as well as tool performance improvements. It also expands collaborations to support genomic sequencing and software platforms. 

Release highlights include the following:

New features

  • GPU-accelerated Giraffe, with single-end and paired-end support 
  • Pbmm2 wrapper for native PacBio input and output of Minimap2
  • Allele option support in GATK HaplotypeCaller 
  • Support for unaligned BAMs: FQ2BAM (BWA-MEM) and Minimap2

Improved features

  • Faster Minimap2 for PacBio and Oxford Nanopore (ONT) data
  • DeepVariant acceleration for ONT data
  • Faster CRAM file writer (2x acceleration over CPU-only)
  • <30-minute end-to-end 30x whole genome sequencing (WGS) germline on a single-GPU system (NVIDIA Grace Hopper)

New collaborations and benchmarks

  • Complete Genomics data supported on Parabricks
  • Parabricks now available on Basepair platform
  • Updated benchmarks, including DeepSomatic and Giraffe

The latest release of Parabricks v4.4 enables scientists and researchers to use Giraffe for pangenome alignment. By understanding genetic diversity from pangenomes and using the accelerated version of Giraffe available in Parabricks v4.4, scientists can discover new biological insights even faster.

Understanding genetic diversity from pangenomes

To understand the underlying cause of disease, individuals’ genomes have historically been compared to a linear reference genome. Although a linear reference genome is not the DNA sequence of an individual, but is instead an average genome constructed from DNA of a few individuals, it serves as an accepted representation of a single consensus haplotype. 

Genome Reference Consortium Human Build 38 (GRCh38) is the current human reference genome that is most widely used across genetic studies as the comparison for different genetic studies. It inherently introduces biases and errors in variant calling, especially in repetitive or highly polymorphic regions. Additionally, it may inadequately represent genetic variation from minority populations, thereby limiting understanding of the complete spectrum of genetic diversity.

In contrast, graph-based pangenomes offer a robust solution to this issue by integrating multiple reference genomes into a unified structure. This approach effectively captures the genetic diversity within a species, enabling more accurate detection and analysis of variations across different genomes. By representing genomic data as graphs, pangenome graphs enable comprehensive and unbiased genetic variation analysis, overcoming the limitations imposed by reliance on a single reference genome.

The reference genome as a linear haploid sequence is limited in how well it can represent genetic diversity of populations, including single nucleotide polymorphisms (SNPs), indels and structural variants that are more common amongst specific subpopulations.
Aligning to a pangenome graph reference enables high accuracy genomic analysis by providing representation for many diverse subpopulations.
Figure 1. A linear reference genome compared to a pangenome graph

Graph genomes

To represent pangenome data, graph genomes provide a unified framework for representing the genetic variation of multiple genomes. The graph structure of the data provides easier understanding of structural changes, including insertions, deletions, and rearrangements. 

Graph genomes are particularly beneficial to improving accuracy in variant calling since they can help increase detection of genetic variants. However, the analysis becomes more challenging, particularly in alignment, since graph-based representations introduce more complexities than the linear sequences of single references. Additionally, as graph genomes grow in size and complexity, computational requirements and processing can become prohibitive. 

Accelerating pangenome alignment with Giraffe

Giraffe is a software tool to support pangenome graph alignment. Built by the University of California, Santa Cruz (UCSC), it is used particularly in the context of large-scale genomic sequencing projects and helps with alignment, assembly, and variant calling. Giraffe enables new genomic sequences to be compared to a pangenome—not just a single reference genome. 

With the latest v4.4 release, Parabricks now supports Giraffe for single-end and paired-end data to provide GPU-acceleration for pangenome alignment. Plus, results are fully equivalent to the open-source version of Giraffe so that researchers can use Parabricks v4.4 to replicate an open-source tool. As a result, scientists and researchers can increase accuracy and improve variant calling—particularly across genetic variations and diverse populations.

“The current human reference genome has been the cornerstone of human genetics research for over twenty years,” explains Dr. Benedict Paten, professor and associate director at the University of California, Santa Cruz Genomics Institute. “However, it contains just a single representative sequence for each chromosome and so can’t by definition capture the rich variation present in our population. To understand the common genetic diversity of our population a human pangenome is necessary.” 

“Pangenomes encode hundreds or, in the future, even thousands of individual genomes in a reference structure,” Dr. Paten adds. “They better represent us, ensuring research and future precision therapeutics account for our individual diversity. At UCSC, we have a research team dedicated to building tools to use the pangenome. This includes Giraffe, a tool for mapping a new sample to the pangenome. We are excited to be working with the NVIDIA team to accelerate Giraffe and make it a workhorse tool for future projects. This has potential to have a huge downstream impact.” 

New collaborations

In addition to the latest features of Parabricks v4.4, NVIDIA expands collaborations with genomic sequencing and software platforms–including Complete Genomics and Basepair. 

Complete Genomics

Complete Genomics is committed to driving genomics forward with complete sequencing solutions that improve lives. Offering a wide range of applications, including WGS, single-cell analysis, spatial transcriptomics, and microbiology, Complete Genomics leverages its proprietary DNBSEQ (DNA Nanoball Sequencing) technology. This technology produces deep sequencing coverage while ensuring high accuracy and low error rates. Parabricks germline workflows can now use data from Complete Genomicssequencers, including the DNBSEQ-T7 and DNBSEQ-G400. 

The integration of the DNBSEQ with Parabricks technology provides an accelerated and cost-effective solution for secondary genomic analysis. For example, processing a 30x WGS sample using fq2bam and haplotypecaller workflows on the DNBSEQ-T7 sequencer can be optimized for speed or cost depending on the GPU instance.

  • Speed: 16-minute runtime on four NVIDIA L40 GPUs
  • Cost: $2.67 cost on four NVIDIA L4 GPUs

“Our integration of NVIDIA Parabricks allows us to harness the full potential of our DNBSEQ-T7 sequencing platform,” says Rob Tarbox, VP of Product and Marketing at Complete Genomics. “By combining our high-quality sequencing data with Parabricks’ speed and accuracy, we’re enabling researchers to uncover variants more efficiently and cost-effectively, ultimately advancing precision medicine and improving patient outcomes.

Explore the quick start guide to learn more about benchmarking Parabricks germline workflows with Complete Genomics data.

The Complete Genomics DNBSEQ-T7 sequencer.
Figure 2. The Complete Genomics DNBSEQ-T7 sequencer. Image credit: Complete Genomics

Basepair

Basepair is a next-generation sequencing (NGS) data analysis platform. Their point-and-click user interface helps make genomic data analysis and visualization more accessible to a broader range of scientists. 

Now, users can supercharge their genomic data analysis by using Parabricks on Basepair, powered by HealthOmics from AWS. Parabricks on Basepair gives users an intuitive graphical user interface (GUI) with interactive visualizations entirely provisioned within their own AWS account for compute and storage. 

“We are excited to support Parabricks on Basepair, bringing accelerated tools alongside a more comprehensive and visual way to analyze their genomic data,” says Simon Valentine, chief commercial officer at Basepair. “Parabricks provides access to some of the most effective bioinformatics tools available today. By making them available through Basepair’s intuitive point-and-click interface we can work together to make them accessible to an even broader range of scientists.” 

Screenshot of NVIDIA Parabricks running on the Basepair platform, with fields for pipeline, samples, analysis name, and omics.
Figure 3. NVIDIA Parabricks running on the Basepair platform. Image credit: Basepair

Latest Parabricks benchmarks

In addition to new features and upgrades for each release, NVIDIA continuously works to improve benchmark performance across instruments, tools, and GPUs.  

Table 1 outlines the latest benchmarks on the most popular NVIDIA GPUs for the fastest speed (NVIDIA H100) and lowest cost per sample (NVIDIA L4)–including Giraffe from Parabricks v4.4 and DeepSomatic from v4.3.1.

NVIDIA H100  GPU
Fastest speed
NVIDIA L4 GPU
Lowest cost per sample
2 GPUs4 GPUs2 GPUs4 GPUs
Giraffe65.842.184.944.7
DeepSomatic56.2835.13215.53108.55
FQ2BAM (BWA-MEM)13.89.1548.1527.88
BWA-Meth27.4315.1277.3539.77
DeepVariant9.65.8223.4813.10
HaplotypeCaller10.574.9012.007.73
Mutect225.8013.6055.832.50
Table 1. The latest benchmarks on the most popular NVIDIA GPUs for the fastest speed and lowest cost per sample with performance time in minutes

30x whole genome sequenced for FQ2BAM (BWA-Mem), BWA-Meth, DeepVariant, and Haplotype Caller with Illumina data.
50x tumor-normal whole genome sequenced for DeepSomatic and Mutect2 with Illumina data.

Get started

With the NVIDIA Parabricks v4.4 release, scientists and researchers using graph genomes can now access Giraffe for pangenome alignment. Parabricks v4.4 supports the groundbreaking tool from UCSC by powering an accelerated version of Giraffe to help discover new biological insights—now even faster. 

Download NVIDIA Parabricks to get started with GPU-accelerated genomics analysis and join the conversation on the NVIDIA Parabricks Developer Forum.

Discuss (1)

Tags