Improve Variant Calling Accuracy with NVIDIA Parabricks

Built for data scientists and bioinformaticians, NVIDIA Parabricks is a scalable genomics software suite for secondary analysis. Providing GPU-accelerated versions of open-source tools for increased speed and accuracy, researchers can uncover biological insights faster.

The latest release, Parabricks v4.6, offers improvements to multiple features, most notably support for Google’s DeepVariant and DeepSomatic 1.9. This includes a pangenome-aware mode for DeepVariant, which improves accuracy across genetic variations and diverse populations.

New features:

DeepVariant and DeepSomatic 1.9, including pangenome-aware DeepVariant.
DeepSomatic long read and whole exome sequencing (WES) support.
STAR quantMode including GeneCounts.

Improved features:

STAR speedups: Almost 8x faster on two NVIDIA RTX PRO 6000 GPUs compared to CPU-only solutions.
Additional arguments for Mutectcaller, including mitochondrial mode.

Improve variant calling with DeepVariant and DeepSomatic 1.9

Variant calling is a critical step in genomic analysis. It identifies differences between the sample genome (i.e., an individual or population) and a reference genome. Understanding these genetic differences gives scientists a better understanding of diseases and potential treatments.

There is a wide variety of tools built to address variant calling, including HaplotypeCaller and Mutect2 in the Genomic Analysis Toolkit (GATK) from the Broad Institute. In addition to the industry standards from GATK, deep-learning-based variant callers have become widely used.

Developed by Google, DeepVariant and DeepSomatic use deep learning to support variant identification. For germline data, DeepVariant determines inherited variants. On the other hand, DeepSomatic shows how somatic variants affect non-inherited mutations, including those found in tumor cells.

Enhancing variant calling accuracy is critical, particularly when considering genetic diversity. According to a recent paper, pangenome-aware DeepVariant reduced errors by up to 25.5% across all settings when compared to linear-referenced-based DeepVariant.

“Taking genetic diversity into account is critical to accurate genome analysis, especially across diverse populations. New pangenome methods allow more comprehensive maps of genetic variation to inform analysis,” says Andrew Carroll, product lead at Google Research. “I’m excited by Parabricks v4.6 support for pangenome-aware DeepVariant v1.9, which combines the incredible speed of Parabricks with the new DeepVariant ability to directly use pangenome information during variant calling.”

Improve accuracy even more with Giraffe and DeepVariant v1.9

Traditional linear references, including the Genome Reference Consortium Human Build 38 (GRCh38), are built from the DNA of only a few individuals, providing a universal coordinate system for genomic research. However, these references don’t capture the full spectrum of genetic variation present across the broader human population. As a result, important subpopulation diversity is often underrepresented. This can introduce bias into subsequent analyses, such as read mapping and variant detection, which may miss or inaccurately interpret important genetic differences tied to ancestry or disease.

Unlike linear references, pangenomes are built by integrating multiple high-quality genomes from diverse individuals, capturing a much broader range of genetic variation present in human populations. This comprehensive approach reduces reference bias, improves variant detection across populations, and supports more accurate and equitable genomic analyses. Giraffe, a software tool developed by researchers at the University of California, Santa Cruz, enables efficient read alignment to pangenome graphs.

Giraffe maps genomic sequences to a reference pangenome rather than a traditional linear reference, improving variant-calling accuracy across diverse populations. Combining Giraffe with pangenome-aware mode in DeepVariant, which is now available in Parabricks v4.6, improves the accuracy of identified variants and provides the speed of Parabricks GPU acceleration.

Accuracy: Open-source pangenome-aware DeepVariant was more accurate than BWA, receiving the following F1 scores according to Pangenome-aware DeepVariant.
- Pangenome-aware DeepVariant: SNP: 0.9981 | Indel 0.9971
- BWA: SNP: 0.9973 | Indel: 0.9968
Speed: Using GPU-acceleration in Parabricks, Giraffe, and DeepVariant runtimes resulted in over a 14x speedup compared to CPU-only Giraffe and DeepVariant with pangenome-aware mode on four NVIDIA RTX PRO 6000 GPUs.

Pangenome-aware DeepVariant 1.9 and Giraffe total runtimes resulted in over a 14x speedup on 4 NVIDIA RTX PRO 6000 GPUs. — *Figure 1. Using four NVIDIA RTX PRO 6000 GPUs, the total runtime for pangenome-aware DeepVariant 1.9 and Giraffe reduced from more than 9 hours on CPU-only solution to under 40 minutes*

“Roche’s SBX technology enables sequencing at unparalleled data rates and flexible data processing workflows for different sequencing applications,” says John Mannion, VP Computational Sciences at Roche. “Through our collaboration with NVIDIA, we plan to leverage GPU-accelerated versions of multiple aligners, including Giraffe, to provide users with an integrated solution allowing for faster and more accurate analysis.”

Get started with Giraffe and DeepVariant

Existing users of Parabricks can run DeepVariant after providing:

the appropriate FASTA reference file from the Giraffe index files,
a BAM file and the graph GPZ file output from running Giraffe.

Instructions on obtaining these files are available in the Parabricks Giraffe documentation focused on Using Giraffe in Variant Calling workflows. The following steps also guide you through the process.

Step 1

Run baseline VG to generate a FASTA file from the graph.

Please note that step 1 with baseline VG is a one-time run. Once you have the FASTA file from the graph, you don’t need to run step 1. Instead, run steps 2 and 3 to handle more FASTQ samples.

# Extract the sequences corrresponding to the list of paths to a FASTA file
docker run --rm --volume $(pwd):/workdir \
    --workdir /workdir \
    quay.io/vgteam/vg:v1.59.0 \
    vg paths -x hprc-v1.1-mc-grch38.gbz -p hprc-v1.1-mc-grch38.paths.sub -F > hprc-v1.1-mc-grch38.fa

# Index the fasta file
samtools faidx hprc-v1.1-mc-grch38.fa

Step 2

Next, run Giraffe normally.

# This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun giraffe --read-group "sample_rg1" \
    --sample "sample-name" --read-group-library "library" \
    --read-group-platform "platform" --read-group-pu "pu" \
    --dist-name /workdir/hprc-v1.1-mc-grch38.dist \
    --minimizer-name /workdir/hprc-v1.1-mc-grch38.min \
    --gbz-name /workdir/hprc-v1.1-mc-grch38.gbz \
    --ref-paths /workdir/hprc-v1.1-mc-grch38.paths.sub \
    --in-fq /workdir/${INPUT_FASTQ_1} /workdir/${INPUT_FASTQ_2} \
    --out-bam /outputdir/${OUTPUT_BAM}

Step 3

Finally, these three files can be used as inputs for Deep Variant. Run pangenome_aware_deepvariant with the BAM from step 2, FASTA from step 1, and the graph GBZ file.

# Pangenome_aware_deepvariant
# This command assumes all the inputs are in the current working directory and all the outputs go to the same place.
docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
    nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
    pbrun pangenome_aware_deepvariant \
    --ref /workdir/hprc-v1.1-mc-grch38.fa \
    --pangenome /workdir/hprc-v1.1-mc-grch38.gbz \
    --in-bam /workdir/${INPUT_BAM} \
    --out-variants /outputdir/${OUTPUT_VCF}

STAR improvements: including quantMode GeneCounts

In addition to pangenome-aware mode for DeepVariant, the latest release of Parabricks also includes improvements to STAR. STAR is a tool used to accelerate RNA-sequencing alignment. It is particularly useful due to its speed and accuracy for RNA-seq data across sequencing platforms and scalability for large datasets. Already available in Parabricks, STAR is further accelerated thanks to GPU-acceleration–resulting in nearly 8x faster speedups on two NVIDIA RTX PRO 6000 GPUs compared to CPU-only solutions.

In the latest release of Parabricks, quantMode GeneCounts is a new option available for STAR, which is valuable for a variety of applications relevant to gene expression, QC, normalization, and data integration. During the mapping step of alignment, quantMode GeneCounts enables fast generation of gene-level read counts.

STAR runtimes resulted in almost an 8x speedup on 2 RTX PRO 6000 GPUs compared to CPU-only solutions. — *Figure 2. Compared to CPU-only solutions that took over 105 minutes, STAR runtimes were reduced to under 14 minutes on two NVIDIA RTX PRO 6000 GPUs*

Get started with STAR

QuantMode GeneCounts can be run as an argument that can be added to STAR. An example command is below.

docker run --rm --gpus all --volume $(pwd):/workdir --volume $(pwd):/outputdir \
    --workdir /workdir \
nvcr.io/nvidia/clara/clara-parabricks:4.6.0-1 \
pbrun rna_fq2bam \
--genome-lib-dir ${GENOME_DIR} \
--in-fq ${FASTQ1} ${FASTQ2} \
--output-dir ${OUT_DIR} \
--ref ${GENOME} \
--out-bam ${OUT_BAM} \
--num-gpus ${GPU_NUM} \
--quantMode GeneCounts

Download Parabricks today

Download NVIDIA Parabricks v4.6 to get started with GPU-accelerated genomic analysis and join the conversation on the NVIDIA Parabricks Developer Forum.