Reproducible Performance

Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide

Related Resources

HPC Performance

Review the latest GPU-acceleration factors of popular HPC applications.


Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Related Resources

Read our blog on convergence for more details.

Get up and running quickly with NVIDIA’s complete solution stack:

  • Pull software containers from NVIDIA NGC.

  • Learn how NVIDIA A100 Tensor Core GPUs provide unprecedented acceleration at every scale, setting records in MLPerf.


NVIDIA Performance on MLPerf 1.1 Training Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA A100 Performance on MLPerf 1.1 AI Benchmarks - Closed Division

FrameworkNetworkTime to Train (mins)MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.527.56875.90% classification8x A100Inspur: NF5488A51.1-2050MixedImageNet2012A100-SXM4-80GB
4.53475.90% classification64x A100DGX A1001.1-2070MixedImageNet2012A100-SXM4-80GB
0.57575.90% classification1,024x A100DGX A1001.1-2078MixedImageNet2012A100-SXM4-80GB
0.34775.90% classification4,320x A100DGX A1001.1-2082MixedImageNet2012A100-SXM4-80GB
SSD7.97923.0% mAP8x A100Inspur: NF5488A51.1-2050MixedCOCO2017A100-SXM4-80GB
1.51723.0% mAP64x A100Azure: 8x Standard_ND96amsr_v41.1-2005MixedCOCO2017A100-SXM4-80GB
0.45423.0% mAP1,024x A100DGX A1001.1-2078MixedCOCO2017A100-SXM4-80GB
3D U-Net23.4640.908 Mean DICE score8x A100Inspur: NF5688M61.1-2051MixedKiTS19A100-SXM4-80GB
3.7570.908 Mean DICE score72x A100DGX A1001.1-2072MixedKiTS19A100-SXM4-80GB
1.2620.908 Mean DICE score768x A100Azure: 96x Standard_ND96amsr_v41.1-2012MixedKiTS19A100-SXM4-80GB
PyTorchBERT19.3890.712 Mask-LM accuracy8x A100Inspur: NF5688M61.1-2053MixedWikipedia 2020/01/01A100-SXM4-80GB
3.0380.712 Mask-LM accuracy64x A100DGX A1001.1-2071MixedWikipedia 2020/01/01A100-SXM4-80GB
0.5580.712 Mask-LM accuracy1,024x A100DGX A1001.1-2079MixedWikipedia 2020/01/01A100-SXM4-80GB
0.2260.712 Mask-LM accuracy4,320x A100DGX A1001.1-2083MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN45.6670.377 Box min AP and 0.339 Mask min AP8x A100Inspur: NF5688M61.1-2053MixedCOCO2017A100-SXM4-80GB
14.4990.377 Box min AP and 0.339 Mask min AP32x A100DGX A1001.1-2068MixedCOCO2017A100-SXM4-80GB
3.2420.377 Box min AP and 0.339 Mask min AP408x A100DGX A1001.1-2076MixedCOCO2017A100-SXM4-80GB
RNN-T33.3770.058 Word Error Rate8x A100Inspur: NF5488A51.1-2052MixedLibriSpeechA100-SXM4-80GB
4.4080.058 Word Error Rate128x A100DGX A1001.1-2074MixedLibriSpeechA100-SXM4-80GB
2.3750.058 Word Error Rate1,536x A100DGX A1001.1-2080MixedLibriSpeechA100-SXM4-80GB
TensorFlowMiniGo264.86850% win rate vs. checkpoint8x A100DGX A1001.1-2067MixedGoA100-SXM4-80GB
29.09350% win rate vs. checkpoint256x A100DGX A1001.1-2075MixedGoA100-SXM4-80GB
15.46550% win rate vs. checkpoint1,792x A100DGX A1001.1-2081MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM1.6980.8025 AUC8x A100Inspur: NF5688M61.1-2049MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.6850.8025 AUC64x A100DGX A1001.1-2069MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.6330.8025 AUC112x A100DGX A1001.1-2073MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

MLPerf™ v1.1 Training Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.


NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Strong Scaling - Closed Division

FrameworkNetworkTime to Train (mins)MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetCosmoFlow8.04Mean average error 0.1241,024x A100DGX A1001.0-1120MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
25.78Mean average error 0.124128x A100DGX A1001.0-1121MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
PyTorchDeepCAM1.67IOU 0.822,048x A100DGX A1001.0-1122MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB
2.65IOU 0.82512x A100DGX A1001.0-1123MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB

NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Weak Scaling - Closed Division

FrameworkNetworkThroughputMLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetCosmoFlow0.73 models/minMean average error 0.1244,096x A100DGX A1001.0-1131MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
PyTorchDeepCAM5.27 models/minIOU 0.824,096x A100DGX A1001.0-1132MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB

MLPerf™ v1.0 Training HPC Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v1.0 Training HPC rules and guidelines, click here


Converged Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.58577.14 Top 123,184 images/sec8x A100DGX A10021.09-py3Mixed192ImageNet2012A100-SXM4-80GB
PyTorch1.8.0a0Mask R-CNN176.34 AP Segm167 images/sec8x A100DGX A10020.12-py3TF328COCO 2014A100-SXM-80GB
1.6.0a0SSD v1.143.25 mAP3,092 images/sec8x A100DGX A10020.06-py3Mixed128COCO 2017A100-SXM-80GB
1.10.0a0Tacotron299.56 Training Loss306,044 total output mels/sec8x A100DGX A10021.08-py3TF32128LJSpeech 1.1A100-SXM-80GB
1.10.0a0WaveGlow285-5.84 Training Loss1,486,357 output samples/sec8x A100DGX A10021.09-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
1.6.0a0Jasper3,6003.53 dev-clean WER603 sequences/sec8x A100DGX A10020.06-py3Mixed64LibriSpeechA100 SXM4-40GB
1.10.0a0Transformer21427.78 BLEU Score470,914 tokens/sec8x A100DGX A10021.09-py3Mixed10240wmt14-en-deA100-SXM4-80GB
1.6.0a0FastPitch216.18 Training Loss1,040,206 frames/sec8x A100DGX A10020.06-py3Mixed32LJSpeech 1.1A100 SXM4-40GB
1.10.0a0GNMT V21724.3 BLEU Score916,105 total tokens/sec8x A100DGX A10021.09-py3Mixed128wmt16-en-deA100-SXM4-80GB
1.10.0a0NCF0.38.96 Hit Rate at 10152,159,129 samples/sec8x A100DGX A10021.09-py3Mixed131072MovieLens 20MA100-SXM4-80GB
1.10.0a0BERT-LARGE391.05 F1922 sequences/sec8x A100DGX A10021.09-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.9.0a0Transformer-XL Large40814.03 Perplexity202,130 total tokens/sec8x A100DGX A10021.06-py3Mixed16WikiText-103A100-SXM-80GB
1.10.0a0Transformer-XL Base21022.53 Perplexity628,500 total tokens/sec8x A100DGX A10021.09-py3Mixed128WikiText-103A100-SXM4-80GB
1.6.0a0BERT-Large Pre-Training P12,379-3,231 sequences/sec8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
1.6.0a0BERT-Large Pre-Training P21,3771.34 Final Loss630 sequences/sec8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
1.6.0a0BERT-Large Pre-Training E2E3,7561.34 Final Loss-8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
Tensorflow1.15.5ResNet-50 v1.59576.97 Top 120,388 images/sec8x A100DGX A10021.09-py3Mixed256ImageNet2012A100-SXM4-80GB
1.15.5ResNext10119279.34 Top 110,078 images/sec8x A100DGX A10021.08-py3Mixed256Imagenet2012A100-SXM-80GB
1.15.5SE-ResNext10122679.71 Top 18,578 images/sec8x A100DGX A10021.09-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5U-Net Industrial1.99 IoU Threshold 0.991,051 images/sec8x A100DGX A10021.09-py3Mixed2DAGM2007A100-SXM4-80GB
1.15.5U-Net Medical6.9 Dice Score952 images/sec8x A100DGX A10021.03-py3Mixed8EM segmentation challengeA100-SXM-80GB
1.15.5VAE-CF1.43 NDCG@1001,642,346 users processed/sec8x A100DGX A10021.09-py3TF323072MovieLens 20MA100-SXM4-80GB
2.6.0Wide and Deep7.66 MAP at 123,662,293 samples/sec8x A100DGX A10021.09-py3TF3216384Kaggle Outbrain Click PredictionA100-SXM4-80GB
1.15.5BERT-LARGE1191.48 F1859 sequences/sec8x A100DGX A10021.09-py3Mixed24SQuaD v1.1A100-SXM4-80GB
2.6.0Electra Base Fine Tuning392.58 F12,843 sequences/sec8x A100DGX A10021.09-py3Mixed32SQuaD v1.1A100-SXM4-80GB
2.2.0EfficientNet-B44,23182.81 Top 12,535 images/sec8x A100DGX A10020.08-py3Mixed160ImageNet2012A100-SXM-80GB
1.15.5V-Net Medical2.84 Anterior DICE1,227 images/sec8x A100DGX A10021.09-py3TF322Hippocampus head and body from Medical Segmentation DecathlonA100-SXM4-40GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
BERT-Large Pre-Training Sequence Length for Phase 1 = 128 and Phase 2 = 512 | Batch Size for Phase 1 = 65,536 and Phase 2 = 32,768
EfficientNet-B4: Mixup = 0.2 | Auto-Augmentation | cuDNN Version = 8.0.5.39 | NCCL Version = 2.7.8
Starting from 21.09-py3, ECC is enabled

A40 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.520377.2 Top 19,650 images/sec8x A40GIGABYTE G482-Z52-0021.09-py3Mixed192ImageNet2012A40
PyTorch1.9.0a0NCF1.96 Hit Rate at 1059,667,265 samples/sec8x A40GIGABYTE G482-Z52-0021.05-py3Mixed131072MovieLens 20MA40
1.10.0a0BERT-LARGE891.03 Top 1387 sequences/sec8x A40GIGABYTE G482-Z52-0021.09-py3Mixed32SQuaD v1.1A40
1.10.0a0Tacotron2131.55 Training Loss234,625 total output mels/sec8x A40GIGABYTE G482-Z52-0021.09-py3Mixed128LJSpeech 1.1A40
1.10.0a0WaveGlow562-5.8 Training Loss747,941 output samples/sec8x A40GIGABYTE G482-Z52-0021.09-py3Mixed10LJSpeech 1.1A40
Tensorflow1.15.5ResNet-50 v1.522476.88 Top 18,626 images/sec8x A40GIGABYTE G482-Z52-0021.09-py3Mixed256ImageNet2012A40
1.15.5U-Net Industrial1.99 IoU Threshold 0.95626 images/sec8x A40GIGABYTE G482-Z52-0021.08-py3Mixed2DAGM2007A40
1.15.5SE-ResNext10154279.52 Top 13,553 images/sec8x A40GIGABYTE G482-Z52-0021.09-py3Mixed256Imagenet2012A40
2.6.0Electra Base Fine Tuning492.52 Top 11,090 sequences/sec8x A40GIGABYTE G482-Z52-0021.09-py3Mixed32SQuaD v1.1A40

Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

A30 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.518277.34 Top110,739 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed192ImageNet2012A30
PyTorch1.9.0a0Tacotron2215.54 Training Loss144,326 total output mels/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed104LJSpeech 1.1A30
1.9.0a0WaveGlow533-5.82 Training Loss794,511 output samples/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed10LJSpeech 1.1A30
1.9.0a0Transformer1,10827.58 BLEU Score87,584 words/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed2560wmt14-en-deA30
1.9.0a0GNMT V28124.65 BLEU Score219,582 total tokens/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3TF32128wmt16-en-deA30
1.10.0a0NCF1.96 Hit Rate at 1055,275,748 samples/sec8x A30GIGABYTE G482-Z52-0021.09-py3Mixed131072MovieLens 20MA30
1.10.0a0BERT-LARGE1191.11 F1278 sequences/sec8x A30GIGABYTE G482-Z52-0021.09-py3Mixed10SQuaD v1.1A30
1.9.0a0Transformer-XL Base15122.16 Perplexity219,994 total tokens/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed32WikiText-103A30
Tensorflow1.15.5ResNet-50 v1.519876.78 Top19,798 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed256ImageNet2012A30
1.15.5U-Net Industrial1.99 IoU Threshold 0.99576 images/sec8x A30GIGABYTE G482-Z52-0021.09-py3Mixed2DAGM2007A30
1.15.5U-Net Medical9.9 DICE Score461 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed8EM segmentation challengeA30
1.15.5VAE-CF1.43 NDCG@100828,913 users processed/sec8x A30GIGABYTE G482-Z52-0021.09-py3TF323072MovieLens 20MA30
1.15.5SE-ResNext10157379.83 Top13,399 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed96Imagenet2012A30
2.4.0Electra Base Fine Tuning692.65 F1904 sequences/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed16SQuaD v1.1A30
1.15.5V-Net Medical3.84 Anterior DICE509 images/sec8x A30GIGABYTE G482-Z52-0021.09-py3Mixed2Hippocampus head and body from Medical Segmentation DecathlonA30

Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

A10 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.524277.25 Top18,117 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed192ImageNet2012A10
PyTorch1.9.0a0SE-ResNeXt10199680.24 Top11,953 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed112Imagenet2012A10
1.9.0a0Tacotron2204.5 Training Loss151,946 total output mels/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed104LJSpeech 1.1A10
1.9.0a0WaveGlow637-5.84 Training Loss664,022 output samples/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed10LJSpeech 1.1A10
1.9.0a0Transformer1,36527.8 BLEU Score70,844 words/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed2560wmt14-en-deA10
1.9.0a0FastPitch177.25 Training Loss467,464 frames/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed32LJSpeech 1.1A10
1.9.0a0GNMT V26124.49 BLEU Score292,052 total tokens/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed128wmt16-en-deA10
1.10.0a0NCF1.96 Hit Rate at 1044,205,872 samples/sec8x A10GIGABYTE G482-Z52-0021.09-py3Mixed131072MovieLens 20MA10
1.10.0a0BERT-LARGE1391.54 F1224 sequences/sec8x A10GIGABYTE G482-Z52-0021.08-py3Mixed10SQuaD v1.1A10
1.9.0a0Transformer-XL Base17622.16 Perplexity187,731 total tokens/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed32WikiText-103A10
Tensorflow1.15.5ResNet-50 v1.526676.74 Top17,283 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed256ImageNet2012A10
1.15.5U-Net Industrial1.99 IoU Threshold 0.99526 images/sec8x A10GIGABYTE G482-Z52-0021.09-py3Mixed2DAGM2007A10
1.15.5U-Net Medical14.9 DICE Score324 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed8EM segmentation challengeA10
1.15.5VAE-CF1.43 NDCG@100614,709 users processed/sec8x A10GIGABYTE G482-Z52-0021.09-py3TF323072MovieLens 20MA10
1.15.5SE-ResNext10186679.65 Top12,240 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed96Imagenet2012A10
2.4.0Electra Base Fine Tuning692.62 F1745 sequences/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed16SQuaD v1.1A10
1.15.5V-Net Medical3.84 Anterior DICE576 images/sec8x A10GIGABYTE G482-Z52-0021.09-py3Mixed2Hippocampus head and body from Medical Segmentation DecathlonA10

Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

T4 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.550777.28 Top 13,860 images/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed192ImageNet2012NVIDIA T4
PyTorch1.9.0a0SE-ResNeXt1011,77079.94 Top11,102 images/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed112Imagenet2012NVIDIA T4
1.10.0a0Tacotron2241.53 Training Loss125,992 total output mels/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed104LJSpeech 1.1NVIDIA T4
1.10.0a0WaveGlow1,041-5.82 Training Loss400,494 output samples/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed10LJSpeech 1.1NVIDIA T4
1.10.0a0Transformer2,23427.56 BLEU Score42,963 tokens/sec8x T4Supermicro SYS-4029GP-TRT21.09-py3Mixed2560wmt14-en-deNVIDIA T4
1.7.0a0FastPitch319.21 Training Loss281,406 frames/sec8x T4Supermicro SYS-4029GP-TRT20.10-py3Mixed32LJSpeech 1.1NVIDIA T4
1.10.0a0GNMT V29224.12 BLEU Score157,601 total tokens/sec8x T4Supermicro SYS-4029GP-TRT21.09-py3Mixed128wmt16-en-deNVIDIA T4
1.10.0a0NCF2.96 Hit Rate at 1028,643,324 samples/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed131072MovieLens 20MNVIDIA T4
1.10.0a0BERT-LARGE2491.34 F1125 sequences/sec8x T4Supermicro SYS-4029GP-TRT21.09-py3Mixed10SQuaD v1.1NVIDIA T4
1.9.0a0Transformer-XL Base31822.12 Perplexity103,740 total tokens/sec8x T4Supermicro SYS-4029GP-TRT21.06-py3Mixed32WikiText-103NVIDIA T4
Tensorflow1.15.5ResNet-50 v1.555076.81 Top 13,496 images/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed256ImageNet2012NVIDIA T4
1.15.5U-Net Industrial2.99 IoU Threshold284 images/sec8x T4Supermicro SYS-4029GP-TRT21.09-py3Mixed2DAGM2007NVIDIA T4
1.15.5U-Net Medical31.9 DICE Score155 images/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5VAE-CF2.43 NDCG@100354,626 users processed/sec8x T4Supermicro SYS-4029GP-TRT21.09-py3Mixed3072MovieLens 20MNVIDIA T4
1.15.4SSD112.28 mAP549 images/sec8x T4Supermicro SYS-4029GP-TRT20.12-py3Mixed32COCO 2017NVIDIA T4
1.15.5Mask R-CNN492.34 AP Segm53 samples/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed4COCO 2014NVIDIA T4
1.15.5ResNext1011,22279.22 Top 11,577 images/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed128Imagenet2012NVIDIA T4
1.15.5SE-ResNext1011,64879.86 Top 11,169 images/sec8x T4Supermicro SYS-4029GP-TRT21.09-py3Mixed96Imagenet2012NVIDIA T4
2.6.0Electra Base Fine Tuning992.48 F1396 sequences/sec8x T4Supermicro SYS-4029GP-TRT21.09-py3Mixed16SQuaD v1.1NVIDIA T4
2.6.0Wide and Deep27.66 MAP at 12764,329 samples/sec8x T4Supermicro SYS-4029GP-TRT21.09-py3Mixed16384Kaggle Outbrain Click PredictionNVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

V100 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.517177.3 Top 111,633 images/sec8x V100DGX-221.09-py3Mixed256ImageNet2012V100-SXM3-32GB
PyTorch1.10.0a0Mask R-CNN265.16 AP Segm111 images/sec8x V100DGX-221.09-py3Mixed8COCO 2014V100-SXM3-32GB
1.10.0a0Tacotron2191.53 Training Loss152,506 total output mels/sec8x V100DGX-221.09-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.10.0a0WaveGlow456-5.74 Training Loss931,487 output samples/sec8x V100DGX-221.09-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.6.0a0Jasper6,3003.49 dev-clean WER312 sequences/sec8x V100DGX-220.06-py3Mixed64LibriSpeechV100 SXM2-32GB
1.10.0a0Transformer47427.58 BLEU Score208,238 tokens/sec8x V100DGX-221.09-py3Mixed5120wmt14-en-deV100-SXM3-32GB
1.6.0a0FastPitch354.18 Training Loss570,968 frames/sec8x V100DGX-120.06-py3Mixed32LJSpeech 1.1V100 SXM2-16GB
1.10.0a0GNMT V23424.35 BLEU Score438,681 total tokens/sec8x V100DGX-221.09-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.10.0a0NCF1.96 Hit Rate at 1097,664,714 samples/sec8x V100DGX-221.09-py3Mixed131072MovieLens 20MV100-SXM3-32GB
1.10.0a0BERT-LARGE891.31 F1370 sequences/sec8x V100DGX-221.09-py3Mixed10SQuaD v1.1V100-SXM3-32GB
1.9.0a0Transformer-XL Base11622.05 Perplexity284,635 total tokens/sec8x V100DGX-221.06-py3Mixed32WikiText-103V100-SXM3-32GB
1.10.0a0ResNeXt10150079.5 Top 13,915 images/sec8x V100DGX-221.08-py3Mixed112Imagenet2012V100-SXM3-32GB
Tensorflow1.15.5ResNet-50 v1.518977.07 Top 110,230 images/sec8x V100DGX-221.09-py3Mixed256ImageNet2012V100-SXM3-32GB
1.15.5ResNext10142579.3 Top 14,558 images/sec8x V100DGX-221.08-py3Mixed128Imagenet2012V100-SXM3-32GB
1.15.5SE-ResNext10153479.8 Top 13,633 images/sec8x V100DGX-221.09-py3Mixed96Imagenet2012V100-SXM3-32GB
1.15.5U-Net Industrial1.99 IoU Threshold 0.99665 images/sec8x V100DGX-221.06-py3Mixed2DAGM2007V100-SXM3-32GB
1.15.5U-Net Medical12.89 DICE Score465 images/sec8x V100DGX-221.06-py3Mixed8EM segmentation challengeV100-SXM3-32GB
1.15.5VAE-CF1.43 NDCG@100907,352 users processed/sec8x V100DGX-221.09-py3Mixed3072MovieLens 20MV100-SXM3-32GB
2.6.0Wide and Deep12.66 MAP at 122,132,710 samples/sec8x V100 DGX-221.09-py3Mixed16384Kaggle Outbrain Click PredictionV100-SXM3-32GB
1.15.5BERT-LARGE1891.54 F1333 sequences/sec8x V100DGX-221.09-py3Mixed10SQuaD v1.1V100-SXM3-32GB
2.6.0Electra Base Fine Tuning492.49 F11,459 sequences/sec8x V100DGX-221.09-py3Mixed32SQuaD v1.1V100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled


Converged Training Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance on Cloud

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.9.0a0BERT-LARGE391.31 F1876 sequences/sec8x A100AWS EC2 p4d.24xlarge21.06-py3Mixed32SQuaD v1.1A100-SXM4-40GB
1.10.0a0BERT-LARGE391.05 F1877 sequences/sec8x A100GCP A2-HIGHGPU-8G21.09-py3Mixed32SQuaD v1.1A100-SXM4-40GB
Tensorflow1.15.5BERT-LARGE1391.4 F1759 sequences/sec8x A100GCP A2-HIGHGPU-8G21.09-py3Mixed24SQuaD v1.1A100-SXM4-40GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

V100 Training Performance on Cloud

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.10.0a0BERT-LARGE891.25 F1355 sequences/sec8x V100GCP N1-HIGHMEM-6421.09-py3Mixed10SQuaD v1.1V100-SXM2-16GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled


Converged Multi-Node Training Performance of NVIDIA GPU

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Multi-Node Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUNodesServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.10.0a0BERT-LARGE Pre-Training P13221.54 Training Loss25,927 sequences/sec8x A1008Selene21.09-py3Mixed64SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training P21681.36 Training Loss5,146 sequences/sec8x A1008Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training E2E2711.36 Training Loss-8x A1008Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training P11651.52 Training Loss50,548 sequences/sec8x A10016Selene21.09-py3Mixed64SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training P2861.35 Training Loss10,101 sequences/sec8x A10016Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training E2E1381.35 Training Loss-8x A10016Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training P1831.49 Training Loss93,888 sequences/sec8x A10032Selene21.09-py3Mixed64SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training P2451.34 Training Loss19,597 sequences/sec8x A10032Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training E2E701.34 Training Loss-8x A10032Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training P1461.5 Training Loss167,820 sequences/sec8x A10064Selene21.09-py3Mixed64SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training P2251.33 Training Loss37,847 sequences/sec8x A10064Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training E2E391.33 Training Loss-8x A10064Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training P1261.5 Training Loss300,769 sequences/sec8x A100128Selene21.09-py3Mixed64SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training P2131.35 Training Loss74,498 sequences/sec8x A100128Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0BERT-LARGE Pre-Training E2E221.35 Training Loss-8x A100128Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0Transformer19018.35 Perplexity444,469 total tokens/sec8x A1002Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0Transformer10618.31 Perplexity799,988 total tokens/sec8x A1004Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB
1.10.0a0Transformer6518.26 Perplexity1,333,045 total tokens/sec8x A1008Selene21.09-py3Mixed16SQuaD v1.1A100-SXM-80GB

BERT-Large Pre-Training Phase 1 Sequence Length = 128
BERT-Large Pre-Training Phase 2 Sequence Length = 512
Starting from 21.09-py3, ECC is enabled

Single-GPU Training

Some scenarios aren’t used in real-world training, such as single-GPU throughput. The table below provides an indication of a platform’s single-chip throughput.

Related Resources

Achieve unprecedented acceleration at every scale with NVIDIA’s complete solution stack.

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.

NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit the NVIDIA NGC catalog to pull containers and quickly get up and running with deep learning.


Single GPU Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.52,963 images/sec1x A100DGX A10021.09-py3Mixed192ImageNet2012A100-SXM-80GB
PyTorch1.10.0a0Mask R-CNN29 images/sec1x A100DGX A10021.09-py3TF328COCO 2014A100-SXM-80GB
1.9.0a0SSD v1.1447 images/sec1x A100DGX A10021.06-py3Mixed128COCO 2017A100-SXM-80GB
1.10.0a0Tacotron240,553 total output mels/sec1x A100DGX A10021.09-py3TF32128LJSpeech 1.1A100-SXM-80GB
1.10.0a0WaveGlow202,578 output samples/sec1x A100DGX A10021.09-py3Mixed10LJSpeech 1.1A100-SXM-80GB
1.10.0a0Jasper80 sequences/sec1x A100DGX A10021.09-py3Mixed64LibriSpeechA100-SXM-80GB
1.6.0a0Transformer82,618 words/sec1x A100DGX A10020.06-py3Mixed10240wmt14-en-deA100 SXM4-40GB
1.10.0a0FastPitch180,663 frames/sec1x A100DGX A10021.08-py3Mixed128LJSpeech 1.1A100-SXM4-80GB
1.10.0a0GNMT V2158,825 total tokens/sec1x A100DGX A10021.09-py3Mixed128wmt16-en-deA100-SXM-80GB
1.10.0a0NCF37,073,053 samples/sec1x A100DGX A10021.09-py3Mixed1048576MovieLens 20MA100-SXM-80GB
1.10.0a0BERT-LARGE122 sequences/sec1x A100DGX A10021.09-py3Mixed32SQuaD v1.1A100-SXM-80GB
1.9.0a0Transformer-XL Large28,503 total tokens/sec1x A100DGX A10021.05-py3Mixed16WikiText-103A100-SXM-80GB
1.10.0a0Transformer-XL Base76,960 total tokens/sec1x A100DGX A10021.09-py3Mixed128WikiText-103A100-SXM-80GB
Tensorflow1.15.5ResNet-50 v1.52,649 images/sec1x A100DGX A10021.09-py3Mixed256ImageNet2012A100-SXM-80GB
1.15.5ResNext1011,297 images/sec1x A100DGX A10021.09-py3Mixed256Imagenet2012A100-SXM-80GB
1.15.5SE-ResNext1011,120 images/sec1x A100DGX A10021.09-py3Mixed256Imagenet2012A100-SXM-80GB
1.15.5U-Net Industrial346 images/sec1x A100DGX A10021.09-py3Mixed16DAGM2007A100-SXM4-40GB
2.6.0U-Net Medical150 images/sec1x A100DGX A10021.09-py3Mixed8EM segmentation challengeA100-SXM-80GB
1.15.5VAE-CF403,011 users processed/sec1x A100DGX A10021.09-py3Mixed24576MovieLens 20MA100-SXM-80GB
2.6.0Wide and Deep2,417,876 samples/sec1x A100DGX A10021.09-py3Mixed131072Kaggle Outbrain Click PredictionA100-SXM4-40GB
1.15.5BERT-LARGE116 sequences/sec1x A100DGX A10021.09-py3Mixed24SQuaD v1.1A100-SXM-80GB
2.6.0Electra Base Fine Tuning380 sequences/sec1x A100DGX A10021.09-py3Mixed32SQuaD v1.1A100-SXM-80GB
-EfficientNet-B4332 images/sec1x A100DGX A100-Mixed160ImageNet2012A100-SXM-80GB
1.15.5NCF42,093,685 samples/sec1x A100DGX A10021.08-py3Mixed1048576MovieLens 20MA100-SXM4-40GB
2.4.0Mask R-CNN30 samples/sec1x A100DGX A10021.05-py3Mixed4COCO 2014A100-SXM4-40GB
1.15.5V-Net Medical1,723 images/sec1x A100DGX A10021.09-py3Mixed32Hippocampus head and body from Medical Segmentation DecathlonA100-SXM4-40GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
EfficientNet-B4: Basic Augmentation | cuDNN Version = 8.0.5.32 | NCCL Version = 2.7.8 | Installation Source = NGC catalog
Starting from 21.09-py3, ECC is enabled

A40 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.51,192 images/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed192ImageNet2012A40
PyTorch1.10.0a0Mask R-CNN14 images/sec1x A40GIGABYTE G482-Z52-0021.09-py3TF328COCO 2014A40
1.9.0a0SSD v1.1222 images/sec1x A40GIGABYTE G482-Z52-0021.06-py3Mixed128COCO 2017A40
1.10.0a0Tacotron224,495 total output mels/sec1x A40GIGABYTE G482-Z52-0021.09-py3Mixed128LJSpeech 1.1A40
1.10.0a0WaveGlow120,308 output samples/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed10LJSpeech 1.1A40
1.10.0a0GNMT V282,036 total tokens/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed128wmt16-en-deA40
1.10.0a0NCF20,388,435 samples/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed1048576MovieLens 20MA40
1.9.0a0Transformer-XL Large15,301 total tokens/sec1x A40GIGABYTE G482-Z52-0021.05-py3Mixed16WikiText-103A40
1.10.0a0BERT-LARGE58 sequences/sec1x A40GIGABYTE G482-Z52-0021.09-py3Mixed32SQuaD v1.1A40
1.10.0a0FastPitch122,507 frames/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed128LJSpeech 1.1A40
1.10.0a0Transformer-XL Base44,361 total tokens/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed128WikiText-103A40
1.10.0a0Jasper43 sequences/sec1x A40GIGABYTE G482-Z52-0021.09-py3Mixed64LibriSpeechA40
1.10.0a0Transformer30,906 tokens/sec1x A40GIGABYTE G482-Z52-0021.09-py3Mixed5120wmt14-en-deA40
Tensorflow1.15.5ResNet-50 v1.51,325 images/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed256ImageNet2012A40
1.15.5SSD214 images/sec1x A40GIGABYTE G482-Z52-0021.06-py3Mixed32COCO 2017A40
1.15.5U-Net Industrial112 images/sec1x A40GIGABYTE G482-Z52-0021.09-py3Mixed16DAGM2007A40
1.15.5BERT-LARGE55 sentences/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed24SQuaD v1.1A40
1.15.5VAE-CF214,146 users processed/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed24576MovieLens 20MA40
2.6.0U-Net Medical68 images/sec1x A40GIGABYTE G482-Z52-0021.09-py3Mixed8EM segmentation challengeA40
2.6.0Wide and Deep879,070 samples/sec1x A40GIGABYTE G482-Z52-0021.09-py3Mixed131072Kaggle Outbrain Click PredictionA40
1.15.5ResNext101571 images/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed256Imagenet2012A40
1.15.5SE-ResNext101525 images/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed256Imagenet2012A40
2.5.0Electra Base Fine Tuning180 sequences/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed32SQuaD v1.1A40
1.15.5V-Net Medical878 images/sec1x A40GIGABYTE G482-Z52-0021.09-py3Mixed32Hippocampus head and body from Medical Segmentation DecathlonA40

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

A30 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.51,452 images/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed192ImageNet2012A30
PyTorch1.9.0a0SSD v1.1226 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.06-py3Mixed64COCO 2017A30
1.10.0a0Tacotron219,716 total output mels/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed104LJSpeech 1.1A30
1.10.0a0WaveGlow117,277 output samples/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed10LJSpeech 1.1A30
1.10.0a0Transformer24,302 tokens/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed2560wmt14-en-deA30
1.10.0a0FastPitch98,275 frames/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed64LJSpeech 1.1A30
1.10.0a0NCF19,512,965 samples/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed1048576MovieLens 20MA30
1.10.0a0GNMT V284,768 total tokens/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed128wmt16-en-deA30
1.9.0a0Transformer-XL Base41,124 total tokens/sec1x A30GIGABYTE G482-Z52-0021.04-py3Mixed32WikiText-103A30
1.10.0a0ResNeXt101539 images/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed112Imagenet2012A30
1.10.0a0Jasper34 sequences/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed16LibriSpeechA30
1.9.0a0Transformer-XL Large12,617 total tokens/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed4WikiText-103A30
1.10.0a0BERT-LARGE51 sequences/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed10SQuaD v1.1A30
Tensorflow1.15.5ResNet-50 v1.51,336 images/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed256ImageNet2012A30
1.15.5ResNext101592 images/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed128Imagenet2012A30
1.15.5SE-ResNext101489 images/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed96Imagenet2012A30
1.15.5U-Net Industrial109 images/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed16DAGM2007A30
2.6.0U-Net Medical71 images/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed8EM segmentation challengeA30
1.15.5VAE-CF200,126 users processed/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed24576MovieLens 20MA30
2.6.0Wide and Deep851,130 samples/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed131072Kaggle Outbrain Click PredictionA30
2.4.0Mask R-CNN21 samples/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed4COCO 2014A30
2.6.0Electra Base Fine Tuning153 sequences/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed16SQuaD v1.1A30
1.15.5SSD201 images/sec1x A30GIGABYTE G482-Z52-0021.06-py3Mixed32COCO 2017A30
1.15.5V-Net Medical959 images/sec1x A30GIGABYTE G482-Z52-0021.09-py3Mixed32Hippocampus head and body from Medical Segmentation DecathlonA30

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

A10 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.51,017 images/sec1x A10GIGABYTE G482-Z52-0021.06-py3Mixed192ImageNet2012A10
PyTorch1.9.0a0SSD v1.1173 images/sec1x A10GIGABYTE G482-Z52-0021.06-py3Mixed64COCO 2017A10
1.10.0a0Tacotron219,636 total output mels/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed104LJSpeech 1.1A10
1.10.0a0WaveGlow96,531 output samples/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed10LJSpeech 1.1A10
1.10.0a0Transformer20,756 tokens/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed2560wmt14-en-deA10
1.10.0a0FastPitch94,160 frames/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed64LJSpeech 1.1A10
1.9.0a0Transformer-XL Base34,901 total tokens/sec1x A10GIGABYTE G482-Z52-0021.04-py3Mixed32WikiText-103A10
1.10.0a0GNMT V261,526 total tokens/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed128wmt16-en-deA10
1.10.0a0ResNeXt101298 images/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed112Imagenet2012A10
1.10.0a0SE-ResNeXt101246 images/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed112Imagenet2012A10
1.10.0a0NCF16,806,585 samples/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed1048576MovieLens 20MA10
1.10.0a0Jasper25 sequences/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed16LibriSpeechA10
1.9.0a0Transformer-XL Large10,699 total tokens/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed4WikiText-103A10
1.10.0a0BERT-LARGE40 sequences/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed10SQuaD v1.1A10
Tensorflow1.15.5ResNet-50 v1.5995 images/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed256ImageNet2012A10
1.15.5ResNext101412 images/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed128Imagenet2012A10
1.15.5SE-ResNext101293 images/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed96Imagenet2012A10
1.15.5U-Net Industrial90 images/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed16DAGM2007A10
2.6.0U-Net Medical47 images/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed8EM segmentation challengeA10
1.15.5VAE-CF175,029 users processed/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed24576MovieLens 20MA10
2.6.0Wide and Deep721,854 samples/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed131072Kaggle Outbrain Click PredictionA10
2.6.0Electra Base Fine Tuning119 sequences/sec1x A10GIGABYTE G482-Z52-0021.09-py3Mixed16SQuaD v1.1A10
2.4.0Mask R-CNN18 samples/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed4COCO 2014A10
1.15.5SSD180 images/sec1x A10GIGABYTE G482-Z52-0021.06-py3Mixed32COCO 2017A10

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

T4 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.5444 images/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed192ImageNet2012NVIDIA T4
PyTorch1.10.0a0ResNeXt101180 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed112Imagenet2012NVIDIA T4
1.10.0a0Tacotron217,331 total output mels/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed104LJSpeech 1.1NVIDIA T4
1.10.0a0WaveGlow53,856 output samples/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed10LJSpeech 1.1NVIDIA T4
1.10.0a0Transformer9,816 tokens/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed2560wmt14-en-deNVIDIA T4
1.10.0a0FastPitch40,379 frames/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed64LJSpeech 1.1NVIDIA T4
1.10.0a0GNMT V230,244 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed128wmt16-en-deNVIDIA T4
1.10.0a0NCF8,091,013 samples/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed1048576MovieLens 20MNVIDIA T4
1.10.0a0BERT-LARGE19 sequences/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed10SQuaD v1.1NVIDIA T4
1.9.0a0Transformer-XL Base17,182 total tokens/sec1x T4Supermicro SYS-4029GP-TRT21.04-py3Mixed32WikiText-103NVIDIA T4
1.10.0a0Jasper11 sequences/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed16LibriSpeechNVIDIA T4
1.10.0a0SE-ResNeXt101146 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed112Imagenet2012NVIDIA T4
1.9.0a0Transformer-XL Large5,231 total tokens/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed4WikiText-103NVIDIA T4
Tensorflow1.15.5ResNet-50 v1.5412 images/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed256ImageNet2012NVIDIA T4
1.15.5U-Net Industrial46 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed16DAGM2007NVIDIA T4
2.6.0U-Net Medical21 images/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5VAE-CF81,989 users processed/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed24576MovieLens 20MNVIDIA T4
1.15.5SSD98 images/sec1x T4Supermicro SYS-4029GP-TRT21.06-py3Mixed32COCO 2017NVIDIA T4
2.4.0Mask R-CNN9 samples/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed4COCO 2014NVIDIA T4
2.6.0Wide and Deep330,004 samples/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed131072Kaggle Outbrain Click PredictionNVIDIA T4
1.15.5SE-ResNext101155 images/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed96Imagenet2012NVIDIA T4
1.15.5ResNext101188 images/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed128Imagenet2012NVIDIA T4
2.6.0Electra Base Fine Tuning59 sequences/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed16SQuaD v1.1NVIDIA T4
1.15.5V-Net Medical427 images/sec1x T4Supermicro SYS-1029GQ-TRT21.09-py3Mixed32Hippocampus head and body from Medical Segmentation DecathlonNVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

V100 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.51,490 images/sec1x V100DGX-221.09-py3Mixed256ImageNet2012V100-SXM3-32GB
PyTorch1.10.0a0ResNeXt101548 images/sec1x V100DGX-221.09-py3Mixed112Imagenet2012V100-SXM3-32GB
1.9.0a0SSD v1.1233 images/sec1x V100DGX-221.06-py3Mixed64COCO 2017V100-SXM3-32GB
1.10.0a0Tacotron222,417 total output mels/sec1x V100DGX-221.09-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.10.0a0WaveGlow129,673 output samples/sec1x V100DGX-221.09-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.10.0a0Jasper42 sequences/sec1x V100DGX-221.09-py3Mixed64LibriSpeechV100-SXM3-32GB
1.10.0a0Transformer31,644 tokens/sec1x V100DGX-221.09-py3Mixed5120wmt14-en-deV100-SXM3-32GB
1.10.0a0FastPitch121,642 frames/sec1x V100DGX-221.08-py3Mixed64LJSpeech 1.1V100-SXM3-32GB
1.10.0a0GNMT V276,246 total tokens/sec1x V100DGX-221.09-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.10.0a0NCF22,004,456 samples/sec1x V100DGX-221.09-py3Mixed1048576MovieLens 20MV100-SXM3-32GB
1.10.0a0BERT-LARGE53 sequences/sec1x V100DGX-221.09-py3Mixed10SQuaD v1.1V100-SXM3-32GB
1.9.0a0Transformer-XL Base44,072 total tokens/sec1x V100DGX-221.05-py3Mixed32WikiText-103V100-SXM3-32GB
1.9.0a0Transformer-XL Large15,360 total tokens/sec1x V100DGX-221.05-py3Mixed8WikiText-103V100-SXM3-32GB
Tensorflow1.15.5ResNet-50 v1.51,369 images/sec1x V100DGX-221.09-py3Mixed256ImageNet2012V100-SXM3-32GB
1.15.5ResNext101617 images/sec1x V100DGX-221.09-py3Mixed128Imagenet2012V100-SXM3-32GB
1.15.5SE-ResNext101516 images/sec1x V100DGX-221.09-py3Mixed96Imagenet2012V100-SXM3-32GB
1.15.5U-Net Industrial115 images/sec1x V100DGX-221.09-py3Mixed16DAGM2007V100-SXM3-32GB
2.6.0U-Net Medical67 images/sec1x V100DGX-221.09-py3Mixed8EM segmentation challengeV100-SXM3-32GB
1.15.5VAE-CF221,975 users processed/sec1x V100DGX-221.09-py3Mixed24576MovieLens 20MV100-SXM3-32GB
2.6.0Wide and Deep968,852 samples/sec1x V100DGX-221.09-py3Mixed131072Kaggle Outbrain Click PredictionV100-SXM3-32GB
1.15.5BERT-LARGE48 sequences/sec1x V100DGX-221.09-py3Mixed10SQuaD v1.1V100-SXM3-32GB
2.6.0Electra Base Fine Tuning195 sequences/sec1x V100DGX-221.09-py3Mixed32SQuaD v1.1V100-SXM3-32GB
2.4.0Mask R-CNN22 samples/sec1x V100DGX-221.05-py3Mixed4COCO 2014V100-SXM3-32GB
1.15.5SSD222 images/sec1x V100DGX-221.06-py3Mixed32COCO 2017V100-SXM3-32GB
1.15.5V-Net Medical1,070 images/sec1x V100DGX-221.09-py3Mixed32Hippocampus head and body from Medical Segmentation DecathlonV100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled


Single GPU Training Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance on Cloud

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.52,724 images/sec1x A100GCP A2-HIGHGPU-1G21.09-py3Mixed192ImageNet2012A100-SXM4-40GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

T4 Training Performance on Cloud

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.5425 images/sec1x T4AWS EC2 g4dn.4xlarge21.06-py3Mixed192ImageNet2012NVIDIA T4
1.9.0ResNet-50 v1.5389 images/sec1x T4GCP N1-HIGHMEM-821.09-py3Mixed192ImageNet2012NVIDIA T4
PyTorch1.9.0a0BERT-LARGE16 sequences/sec1x T4AWS EC2 g4dn.4xlarge21.06-py3Mixed10SQuaD v1.1NVIDIA T4

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

V100 Training Performance on Cloud

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.51,397 images/sec1x V100GCP N1-HIGHMEM-821.09-py3Mixed192ImageNet2012V100-SXM2-16GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Related Resources

Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:


MLPerf Inference v1.1 Performance Benchmarks

Offline Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionDatasetTarget Accuracy
ResNet-50 v1.5313,516 samples/sec8x A100DGX A100A100 SXM-80GBImageNet76.46% Top1
283,469 samples/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
145,742 samples/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
149,178 samples/sec8x A30Gigabyte G482-Z54A30
150,315 samples/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
110,197 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
SSD ResNet-347,851 samples/sec8x A100DGX A100A100 SXM-80GBCOCO0.2 mAP
7,316 samples/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
3,606 samples/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
3,788 samples/sec8x A30Gigabyte G482-Z54A30
3,727 samples/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
2,473 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
3D-UNet487 samples/sec8x A100DGX A100A100 SXM-80GBBraTS 20190.853 DICE mean
421 samples/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
227 samples/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
241 samples/sec8x A30Gigabyte G482-Z54A30
225 samples/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
173 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
RNN-T106,918 samples/sec8x A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
50,561 samples/sec8x A100Gigabyte G242-P31A100-PCIe-80GB
52,596 samples/sec8x A30Gigabyte G482-Z54A30
36,461 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
BERT28,302 samples/sec8x A100DGX A100A100 SXM-80GBSQuAD v1.190.07% f1
25,677 samples/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
12,606 samples/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
13,385 samples/sec8x A30Gigabyte G482-Z54A30
12,867 samples/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
8,757 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
DLRM2,421,440 samples/sec8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
1,097,730 samples/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
1,083,600 samples/sec8x A30Gigabyte G482-Z54A30
772,521 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10

Server Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionTarget AccuracyMLPerf Server Latency
Constraints (ms)
Dataset
ResNet-50 v1.5260,042 queries/sec8x A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
70,007 queries/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
104,012 queries/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
116,014 queries/sec8x A30Gigabyte G482-Z54A30
65,004 queries/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
88,014 queries/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
SSD ResNet-347,581 queries/sec8x A100DGX A100A100 SXM-80GB0.2 mAP100COCO
5,802 queries/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
3,083 queries/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
3,575 queries/sec8x A30Gigabyte G482-Z54A30
3,002 queries/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
2,000 queries/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
RNN-T104,012 queries/sec8x A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
43,005 queries/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
36,999 queries/sec8x A30Gigabyte G482-Z54A30
22,600 queries/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
BERT25,795 queries/sec8x A100DGX A100A100 SXM-80GB90.07% f1130SQuAD v1.1
20,497 queries/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
10,402 queries/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
11,501 queries/sec8x A30Gigabyte G482-Z54A30
8,301 queries/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
7,202 queries/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
DLRM2,302,660 queries/sec8x A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
600,198 queries/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
1,000,530 queries/sec8x A30Gigabyte G482-Z54A30
680,257 queries/sec8x A10Supermicro 4029GP-TRT-OTO-28A10

Power Efficiency Offline Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5244,537 samples/sec83 samples/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
125,232 samples/sec110.9 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
211,436 samples/sec112.03 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
SSD ResNet-346,482 samples/sec2.04 samples/sec/watt8x A100DGX A100A100 SXM-80GBCOCO
3,295 samples/sec2.65 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
5,866 samples/sec2.71 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
3D-UNet399 samples/sec0.13 samples/sec/watt8x A100DGX A100A100 SXM-80GBBraTS 2019
203 samples/sec0.18 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
345 samples/sec0.18 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
RNN-T90,243 samples/sec27.73 samples/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
44,495 samples/sec37.7 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
84,727 samples/sec38.44 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
BERT24,667 samples/sec6.95 samples/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
10,573 samples/sec8.5 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
20,401 samples/sec8.19 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
DLRM2,091,060 samples/sec629.03 samples/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs
987,260 samples/sec786.67 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB

Power Efficiency Server Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5232,036 queries/sec79.14 queries/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
107,013 queries/sec94.74 queries/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
185,034 queries/sec87.76 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
SSD ResNet-346,301 queries/sec1.99 queries/sec/watt8x A100DGX A100A100 SXM-80GBCOCO
3,083 queries/sec2.49 queries/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
5,703 queries/sec2.62 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
RNN-T88,014 queries/sec25.46 queries/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
43,406 queries/sec33.55 queries/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
75,012 queries/sec33.11 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
BERT21,497 queries/sec6.22 queries/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
10,203 queries/sec8.01 queries/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
17,496 queries/sec7.99 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
DLRM2,002,040 queries/sec591.77 queries/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs
890,424 queries/sec672.18 queries/sec/watt4x A100DGX-Station-A100A100 SXM-80GB

MLPerf™ v1.1 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net, DLRM 99% of FP32 accuracy target: 1.1-033, 1.1-037, 1.1-039, 1.1-042, 1.1-043, 1.1-046, 1.1-047, 1.1-048, 1.1-051. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
4x1g.6gb and 7x1g.10gb is a notation used to refer to the MIG configuration. In this example, the workload is running on 4 or 7 single GPC slices, each with 6GB or 10GB of memory on a single A30 and A100 respectively.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v1.1


NVIDIA landed top performance spots on all MLPerf™ Inference 1.1 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.

MLPerf™ v1.1 A100 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net, DLRM 99% of FP32 accuracy target: 1.1-047, 1.1-049. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.​

The chart compares the performance of Triton to the custom MLPerf™ serving harness across five different TensorRT networks on A100 SXM-80GB on bare metal. The results show that Triton is highly efficient and delivers nearly equal or identical performance to the highly optimized MLPerf™ harness.

 

NVIDIA Client Batch Size 1 Performance with Triton Inference Server

A100 Triton Inference Server Performance

NetworkAcceleratorTraining FrameworkFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
ResNet-50 V1.5 InferenceA100-SXM4-40GBPyTorchTensorRTTF32216425648.355,294 inf/sec-21.03-py3
ResNet-50 V1.5 InferenceA100-PCIE-40GBPyTorchTensorRTMixed216425661.024,197 inf/sec-20.07-py3
BERT Large InferenceA100-SXM4-40GBTensorFlowTensorRTINT82186456.341,136 inf/sec38420.09-py3
BERT Large InferenceA100-PCIE-40GBTensorFlowTensorRTMixed1181617.48915 inf/sec38420.09-py3
DLRM InferenceA100-SXM4-40GBPyTorchTorchscriptMixed2165,536302.7111,076 inf/sec-21.03-py3
DLRM InferenceA100-PCIE-40GBPyTorchTorchscriptMixed2165,536242.529,521 inf/sec-21.05-py3

T4 Triton Inference Server Performance

NetworkAcceleratorTraining FrameworkFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
ResNet-50 V1.5 InferenceNVIDIA T4PyTorchTensorRTMixed1164256257.91992 inf/sec-20.07-py3
BERT Large InferenceNVIDIA T4TensorFlowTensorRTMixed1181681.14197 inf/sec38420.09-py3

V100 Triton Inference Server Performance

NetworkAcceleratorTraining FrameworkFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
ResNet-50 V1.5 InferenceV100 SXM2-32GBPyTorchTensorRTFP324164384215.791,781 inf/sec-21.03-py3
DLRM InferenceV100-SXM2-32GBPyTorchTorchscriptMixed2165,536263.677,083 inf/sec-21.06-py3

Inference Performance of NVIDIA A100, A40, A30, A10, A2, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

Inference Natural Langugage Processing

BERT Inference Throughput

DGX A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128

 

NVIDIA A100 BERT Inference Benchmarks

NetworkNetwork
Type
Batch
Size
ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
BERT-Large with SparsityAttention946,188 sequences/sec--1x A100DGX A100-INT8SQuaD v1.1-A100 SXM4-40GB

A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: Mixed | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: Mixed | Dataset: Synthetic

 

A100 Full Chip Inference Performance

NetworkBatch SizeFull Chip ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-501For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
811,468 images/sec0.71x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
12830,671 images/sec4.171x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
22332,204 images/sec6.921x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0.1A100 SXM-80GB
ResNet-50v1.51For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
811,203 images/sec0.711x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100-SXM4-40GB
12829,855 images/sec4.291x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100-SXM-80GB
21431,042 images/sec6.891x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0.1A100 SXM-80GB
ResNext101327,674 samples/sec4.171x A100--INT8SyntheticTensorRT 7.2A100-SXM4-40GB
EfficientNet-B012822,346 images/sec5.731x A100--INT8SyntheticTensorRT 7.2A100-SXM4-40GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
86,895 sequences/sec1.161x A100DGX A10021.06-py3INT8Sample TextTensorRT 7.2A100-SXM-80GB
12813,554 sequences/sec9.441x A100DGX A10021.06-py3INT8Sample TextTensorRT 7.2A100-SXM4-40GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
82,333 sequences/sec3.431x A100DGX A10021.06-py3INT8Sample TextTensorRT 7.2A100-SXM-80GB
1284,485 sequences/sec28.541x A100DGX A10021.06-py3INT8Sample TextTensorRT 7.2A100-SXM4-40GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
For BS=1 inference refer to the Triton Inference Server section
Starting from 21.09-py3, ECC is enabled

A100 1/7 MIG Inference Performance

NetworkBatch Size1/7 MIG ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-501For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
83,725 images/sec2.151x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
294,277 images/sec6.781x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
1284,642 images/sec27.581x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
ResNet-50v1.51For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
83,623 images/sec2.211x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
284,107 images/sec6.821x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
1284,501 images/sec28.441x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
81,676 sequences/sec4.771x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
1282,151 sequences/sec59.521x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
8553 sequences/sec14.481x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
128671 sequences/sec190.631x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled

A100 7 MIG Inference Performance

NetworkBatch Size7 MIG ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-501For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
826,033 images/sec2.151x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
2929,905 images/sec6.791x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
12832,470 images/sec27.591x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
ResNet-50v1.51For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
825,299 images/sec2.211x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
2828,868 images/sec6.791x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
12831,510 images/sec28.441x A100DGX A10021.09-py3INT8SyntheticTensorRT 8.0.3A100 SXM-80GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
811,771 sequences/sec4.761x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
12815,052 sequences/sec59.531x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
83,865 sequences/sec14.491x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
1284,702 sequences/sec190.541x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled

 

A40 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-501For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
89,637 images/sec40 images/sec/watt0.831x A40GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A40
11616,775 images/sec-6.921x A40GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0.1A40
12815,943 images/sec53 images/sec/watt8.031x A40GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A40
ResNet-50v1.51For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
89,447 images/sec39 images/sec/watt0.851x A40GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A40
10916,537 images/sec-6.591x A40GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0.1A40
12815,174 images/sec51 images/sec/watt8.441x A40GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A40
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
84,494 sequences/sec19 sequences/sec/watt1.781x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
1287,180 sequences/sec27 sequences/sec/watt17.831x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
81,666 sequences/sec6 sequences/sec/watt4.81x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
1282,216 sequences/sec9 sequences/sec/watt57.761x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled

 

A30 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-501For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
88,497 images/sec70 images/sec/watt0.941x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
10916,083 images/sec-6.781x A30GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0.1A30
12815,385 images/sec94 images/sec/watt8.321x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
ResNet-50v1.51For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
88,330 images/sec68 images/sec/watt0.961x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
10615,495 images/sec-6.841x A30GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0.1A30
12815,411 images/sec93 images/sec/watt8.311x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
84,417 sequences/sec33 sequences/sec/watt1.811x A30GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A30
1286,815 sequences/sec50 sequences/sec/watt18.781x A30GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A30
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
81,492 sequences/sec11 sequences/sec/watt5.361x A30GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A30
1282,207 sequences/sec15 sequences/sec/watt581x A30GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled

 

A30 1/4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-501For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
83,636 images/sec45 images/sec/watt2.21x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
284,212 images/sec-6.651x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
1284,593 images/sec51 images/sec/watt27.871x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
ResNet-50v1.51For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
83,551 images/sec43 images/sec/watt2.251x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
284,090 images/sec-6.851x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
1284,445 images/sec49 images/sec/watt28.81x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled

 

A30 4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-501For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
814,553 images/sec43 images/sec/watt2.21x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
2916,975 images/sec-6.831x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
12818,316 images/sec51 images/sec/watt27.951x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
ResNet-50v1.51For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
814,178 images/sec42 images/sec/watt2.261x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
2816,376 images/sec-6.841x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30
12817,750 images/sec50 images/sec/watt28.851x A30GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled

 

A10 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-501For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
87,954 images/sec53 images/sec/watt1.011x A10GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A10
7511,769 images/sec-6.81x A10GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0.1A10
12811,264 images/sec75 images/sec/watt11.361x A10GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A10
ResNet-50v1.51For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
87,683 images/sec51 images/sec/watt1.041x A10GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A10
7511,044 images/sec-6.791x A10GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0.1A10
12810,676 images/sec71 images/sec/watt11.991x A10GIGABYTE G482-Z52-0021.09-py3INT8SyntheticTensorRT 8.0.3A10
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
83,598 sequences/sec27 sequences/sec/watt2.221x A10GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A10
1284,766 sequences/sec35 sequences/sec/watt26.861x A10GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A10
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
81,257 sequences/sec10 sequences/sec/watt6.361x A10GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A10
1281,462 sequences/sec11 sequences/sec/watt881x A10GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A10

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled

 

A2 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50 v1.511,377 images/sec31 images/sec/watt0.731x A2Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.0.1A2
21,823 images/sec37 images/sec/watt1.101x A2Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.0.1A2
82,475 images/sec43 images/sec/watt3.231x A2Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.0.1A2
BERT Base1619 sequences/sec13 sequences/sec/watt1.621x A2Supermicro SYS-1029GQ-TRT21.06-py3INT8SyntheticTensorRT 7.2.3A2
2789 sequences/sec15 sequences/sec/watt2.531x A2Supermicro SYS-1029GQ-TRT21.06-py3INT8SyntheticTensorRT 7.2.3A2
81,018 sequences/sec19 sequences/sec/watt7.861x A2Supermicro SYS-1029GQ-TRT21.06-py3INT8SyntheticTensorRT 7.2.3A2
BERT Large1241 sequences/sec4 sequences/sec/watt4.141x A2Supermicro SYS-1029GQ-TRT21.06-py3INT8SyntheticTensorRT 7.2.3A2
2261 sequences/sec5 sequences/sec/watt7.661x A2Supermicro SYS-1029GQ-TRT21.06-py3INT8SyntheticTensorRT 7.2.3A2
8317 sequences/sec6 sequences/sec/watt25.231x A2Supermicro SYS-1029GQ-TRT21.06-py3INT8SyntheticTensorRT 7.2.3A2
EfficientDet-D01280 images/sec-3.571x A2Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.2.06A2
QuartzNet1323 images/sec-3.101x A2Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.2.06A2

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power

 

T4 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-501For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
84,008 images/sec56 images/sec/watt2.041x T4Supermicro SYS-4029GP-TRT21.06-py3INT8SyntheticTensorRT 7.2.3NVIDIA T4
324,771 images/sec-6.711x T4Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.0.1NVIDIA T4
1284,879 images/sec70 images/sec/watt26.231x T4Supermicro SYS-1029GQ-TRT21.09-py3INT8SyntheticTensorRT 8.0.3NVIDIA T4
ResNet-50v1.51For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
83,745 images/sec54 images/sec/watt2.141x T4Supermicro SYS-4029GP-TRT21.06-py3INT8SyntheticTensorRT 7.2.3NVIDIA T4
294,501 images/sec-6.441x T4Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.0NVIDIA T4
1284,596 images/sec66 images/sec/watt27.851x T4Supermicro SYS-1029GQ-TRT21.09-py3INT8SyntheticTensorRT 8.0.3NVIDIA T4
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
81,766 sequences/sec27 sequences/sec/watt4.531x T4Supermicro SYS-4029GP-TRT21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
1281,872 sequences/sec28 sequences/sec/watt681x T4Supermicro SYS-4029GP-TRT21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
8573 sequences/sec9 sequences/sec/watt13.971x T4Supermicro SYS-4029GP-TRT21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
128565 sequences/sec8 sequences/sec/watt2271x T4Supermicro SYS-4029GP-TRT21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server section
Starting from 21.09-py3, ECC is enabled

 

V100 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-501For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
84,362 images/sec16 images/sec/watt1.831x V100DGX-221.09-py3MixedSyntheticTensorRT 8.0.3V100-SXM3-32GB
527,917 images/sec-6.571x V100DGX-221.06-py3INT8SyntheticTensorRT 7.2.3V100-SXM3-32GB
1288,165 images/sec23 images/sec/watt15.681x V100DGX-221.09-py3MixedSyntheticTensorRT 8.0.3V100-SXM3-32GB
ResNet-50v1.51For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
84,285 images/sec15 images/sec/watt1.871x V100DGX-221.09-py3INT8SyntheticTensorRT 8.0.3V100-SXM3-32GB
527,508 images/sec-6.931x V100DGX-221.06-py3INT8SyntheticTensorRT 7.2.3V100-SXM3-32GB
1287,773 images/sec22 images/sec/watt16.471x V100DGX-221.09-py3MixedSyntheticTensorRT 8.0.3V100-SXM3-32GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
82,201 sequences/sec8 sequences/sec/watt3.641x V100DGX-221.06-py3INT8Sample TextTensorRT 7.2V100-SXM3-32GB
1283,174 sequences/sec10 sequences/sec/watt40.331x V100DGX-221.06-py3INT8Sample TextTensorRT 7.2V100-SXM3-32GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server tab
2For Batch Size 2, please refer to Triton Inference Server tab
8790 sequences/sec3 sequences/sec/watt10.121x V100DGX-221.06-py3MixedSample TextTensorRT 7.2V100-SXM3-32GB
128971 sequences/sec3 sequences/sec/watt1321x V100DGX-221.06-py3MixedSample TextTensorRT 7.2V100-SXM3-32GB

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server section
Starting from 21.09-py3, ECC is enabled

Inference Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Inference Performance on Cloud

NetworkBatch SizeThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.5810,961 images/sec0.731x A100GCP A2-HIGHGPU-1G21.09-py3INT8SyntheticTensorRT 8.0.3A100-SXM4-40GB
12827,779 images/sec4.611x A100GCP A2-HIGHGPU-1G21.09-py3INT8SyntheticTensorRT 8.0.3A100-SXM4-40GB

Starting from 21.09-py3, ECC is enabled

T4 Inference Performance on Cloud

NetworkBatch SizeThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.583,516 images/sec2.281x T4GCP N1-HIGHMEM-821.09-py3INT8SyntheticTensorRT 8.0.3NVIDIA T4
1284,090 images/sec31.31x T4GCP N1-HIGHMEM-821.09-py3INT8SyntheticTensorRT 8.0.3NVIDIA T4
83,533 images/sec2.261x T4AWS EC2 g4dn.4xlarge21.06-py3INT8SyntheticTensorRT 7.2NVIDIA T4
1284,555 images/sec28.11x T4AWS EC2 g4dn.4xlarge21.06-py3INT8SyntheticTensorRT 7.2NVIDIA T4
BERT-LARGE8551 sequences/sec14.521x T4AWS EC2 g4dn.4xlarge21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
128540 sequences/sec237.251x T4AWS EC2 g4dn.4xlarge21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4

BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled

V100 Inference Performance on Cloud

NetworkBatch SizeThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.584,217 images/sec1.91x V100GCP N1-HIGHMEM-821.09-py3INT8SyntheticTensorRT 8.0.3V100-SXM2-16GB
1287,436 images/sec17.211x V100GCP N1-HIGHMEM-821.09-py3INT8SyntheticTensorRT 8.0.3V100-SXM2-16GB

Starting from 21.09-py3, ECC is enabled

Conversational AI

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.

Related Resources

Download and get started with NVIDIA Riva.


Riva Benchmarks

Automatic Speech Recognition

A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet114.41A100 SXM4-40GB
Quartznet256254.364A100 SXM4-40GB
Quartznet512351.2506A100 SXM4-40GB
Quartznet1024630.81005A100 SXM4-40GB
Jasper117.61A100 SXM4-40GB
Jasper256244.9254A100 SXM4-40GB
Jasper512381507A100 SXM4-40GB
Jasper1024749.31,004A100 SXM4-40GB

A100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet19.61A100 SXM4-40GB
Quartznet1625.916A100 SXM4-40GB
Quartznet128132.4128A100 SXM4-40GB
Jasper113.41A100 SXM4-40GB
Jasper1626.316A100 SXM4-40GB
Jasper128258.9128A100 SXM4-40GB

A100 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet128.11A100 SXM4-40GB
Quartznet512566.5505A100 SXM4-40GB
Quartznet1,024899.31,000A100 SXM4-40GB
Quartznet1,5121,303.81,460A100 SXM4-40GB
Jasper1311A100 SXM4-40GB
Jasper512667.5504A100 SXM4-40GB
Jasper1,0241,089997A100 SXM4-40GB
Jasper1,5121,753.81,449A100 SXM4-40GB

V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet114.41V100 SXM2-16GB
Quartznet256222.2254V100 SXM2-16GB
Quartznet512385.2505V100 SXM2-16GB
Quartznet768574.5752V100 SXM2-16GB
Jasper126.81V100 SXM2-16GB
Jasper128239.4127V100 SXM2-16GB
Jasper256416253V100 SXM2-16GB
Jasper512969.7500V100 SXM2-16GB

V100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet18.81V100 SXM2-16GB
Quartznet1622.416V100 SXM2-16GB
Quartznet128114.7127V100 SXM2-16GB
Jasper121.51V100 SXM2-16GB
Jasper1636.916V100 SXM2-16GB
Jasper64406.464V100 SXM2-16GB
Jasper512969.7500V100 SXM2-16GB

V100 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet132.9331V100 SXM2-16GB
Quartznet256461.44253V100 SXM2-16GB
Quartznet512784.73502V100 SXM2-16GB
Quartznet7681,121.6747V100 SXM2-16GB
Quartznet1,0241,551.5986V100 SXM2-16GB
Jasper148.3511V100 SXM2-16GB
Jasper256734.99252V100 SXM2-16GB
Jasper5121,423.3498V100 SXM2-16GB
Jasper7682,190.2730V100 SXM2-16GB

T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet133.1831NVIDIA T4
Quartznet64162.6364NVIDIA T4
Quartznet128263.6127NVIDIA T4
Quartznet256449.28253NVIDIA T4
Quartznet384732.75376NVIDIA T4
Jasper172.3771NVIDIA T4
Jasper64259.6464NVIDIA T4
Jasper128450.81127NVIDIA T4
Jasper2561,200.8249NVIDIA T4

T4 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet119.21NVIDIA T4
Quartznet1656.416NVIDIA T4
Quartznet64242.464NVIDIA T4
Jasper146.91NVIDIA T4
Jasper851.18NVIDIA T4
Jasper1684.416NVIDIA T4

T4 Offline Mode (3200 ms chunk)
Acoustic model# of streamsLatency (ms) (avg)Throughput (RTFX)GPU Version
Quartznet1157.621NVIDIA T4
Quartznet256906.17251NVIDIA T4
Quartznet5121,515.2495NVIDIA T4
Jasper196.2011NVIDIA T4
Jasper2561,758.4247NVIDIA T4

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.0.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

Natural Language Processing

A100 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER13.19311A100 SXM4-40GB
NER25695.52549A100 SXM4-40GB
Q&A14.95201A100 SXM4-40GB
Q&A128279453A100 SXM4-40GB

V100 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER14.87204V100 SXM2-16GB
NER2561351,797V100 SXM2-16GB
Q&A17.47134V100 SXM2-16GB
Q&A128521244V100 SXM2-16GB

T4 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER19.31107NVIDIA T4
NER256255960NVIDIA T4
Q&A111.587NVIDIA T4
Q&A128571223NVIDIA T4

Named Entity Recogniton (NER): 128 seq len, BERT-base | Question Answering (QA): 384 seq len, BERT-large | NLP Throughput (seq/s) - Number of sequences processed per second | Performance of the Riva named entity recognition (NER) service (using a BERT-base model, sequence length of 128) and the Riva question answering (QA) service (using a BERT-large model, sequence length of 384) was measured in Riva. Batch size 1 latency and maximum throughput were measured. Riva version: v1.0.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

Text to Speech

A100 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.060.0420A100 SXM4-40GB
40.480.0337A100 SXM4-40GB
60.690.0342A100 SXM4-40GB
80.880.0346A100 SXM4-40GB
101.060.0349A100 SXM4-40GB

V100 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.080.0514V100 SXM2-16GB
40.770.0523V100 SXM2-16GB
61.110.0526V100 SXM2-16GB
81.40.0628V100 SXM2-16GB
101.740.0728V100 SXM2-16GB

T4 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.120.0711NVIDIA T4
41.020.0717NVIDIA T4
61.590.0718NVIDIA T4
82.130.0819NVIDIA T4
102.550.118NVIDIA T4

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.0.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

 

Last updated: December 1st, 2021