Review the latest GPU acceleration factors of popular HPC applications.

Please refer to Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide for instructions on how to reproduce these performance claims.


Deploying AI in real world applications, requires training the networks to convergence at a specified accuracy. This is the best methodology to test AI systems- where they are ready to be deployed in the field, as the networks can then deliver meaningful results (for example, correctly performing image recognition on video streams). Read our blog on convergence for more details. Training that does not converge is a measurement of hardware’s throughput capabilities on the specified AI network, but is not representative of real world applications.

NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA A100 Tensor Core GPUs provides unprecedented acceleration at every scale, setting records in MLPerf™, the AI industry’s leading benchmark and a testament to our accelerated platform approach.

NVIDIA Performance on MLPerf 0.7 AI Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA A100 Performance on MLPerf 0.7 AI Benchmarks - Closed Division

FrameworkNetworkTime to Train (mins)MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.539.7875.90% classification8x A100DGX A1000.7-18MixedImageNet2012A100-SXM4-40GB
23.7575.90% classification16x A100DGX A1000.7-21MixedImageNet2012A100-SXM4-40GB
1.0675.90% classification768x A100DGX A1000.7-32MixedImageNet2012A100-SXM4-40GB
0.8375.90% classification1536x A100DGX A1000.7-35MixedImageNet2012A100-SXM4-40GB
0.7675.90% classification1840x A100DGX A1000.7-37MixedImageNet2012A100-SXM4-40GB
SSD2.2523.0% mAP64x A100DGX A1000.7-25MixedCOCO2017A100-SXM4-40GB
0.8923.0% mAP512x A100DGX A1000.7-31MixedCOCO2017A100-SXM4-40GB
0.8223.0% mAP1024x A100DGX A1000.7-33MixedCOCO2017A100-SXM4-40GB
PyTorchBERT49.010.712 Mask-LM accuracy8x A100DGX A1000.7-19MixedWikipedia 2020/01/01A100-SXM4-40GB
30.630.712 Mask-LM accuracy16x A100DGX A1000.7-22MixedWikipedia 2020/01/01A100-SXM4-40GB
3.360.712 Mask-LM accuracy256x A100DGX A1000.7-28MixedWikipedia 2020/01/01A100-SXM4-40GB
1.480.712 Mask-LM accuracy1024x A100DGX A1000.7-34MixedWikipedia 2020/01/01A100-SXM4-40GB
0.810.712 Mask-LM accuracy2048x A100DGX A1000.7-38MixedWikipedia 2020/01/01A100-SXM4-40GB
DLRM4.430.8025 AUC8x A100DGX A1000.7-19MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-40GB
GNMT7.8124.0 Sacre BLEU8x A100DGX A1000.7-19MixedWMT16 English-GermanA100-SXM4-40GB
4.9424.0 Sacre BLEU16x A100DGX A1000.7-22MixedWMT16 English-GermanA100-SXM4-40GB
0.9824.0 Sacre BLEU256x A100DGX A1000.7-28MixedWMT16 English-GermanA100-SXM4-40GB
0.7124.0 Sacre BLEU1024x A100DGX A1000.7-34MixedWMT16 English-GermanA100-SXM4-40GB
Mask R-CNN82.160.377 Box min AP and 0.339 Mask min AP8x A100DGX A1000.7-19MixedCOCO2017A100-SXM4-40GB
44.210.377 Box min AP and 0.339 Mask min AP16x A100DGX A1000.7-22MixedCOCO2017A100-SXM4-40GB
28.460.377 Box min AP and 0.339 Mask min AP32x A100DGX A1000.7-24MixedCOCO2017A100-SXM4-40GB
10.460.377 Box min AP and 0.339 Mask min AP256x A100DGX A1000.7-28MixedCOCO2017A100-SXM4-40GB
SSD10.2123.0% mAP8x A100DGX A1000.7-19MixedCOCO2017A100-SXM4-40GB
5.6823.0% mAP16x A100DGX A1000.7-22MixedCOCO2017A100-SXM4-40GB
Transformer7.8425.00 BLEU8x A100DGX A1000.7-19MixedWMT17 English-GermanA100-SXM4-40GB
4.3525.00 BLEU16x A100DGX A1000.7-22MixedWMT17 English-GermanA100-SXM4-40GB
1.825.00 BLEU80x A100DGX A1000.7-26MixedWMT17 English-GermanA100-SXM4-40GB
1.0225.00 BLEU160x A100DGX A1000.7-27MixedWMT17 English-GermanA100-SXM4-40GB
0.6225.00 BLEU480x A100DGX A1000.7-30MixedWMT17 English-GermanA100-SXM4-40GB
TensorFlowMiniGo299.7350% win rate vs. checkpoint8x A100DGX A1000.7-20MixedN/AA100-SXM4-40GB
165.7250% win rate vs. checkpoint16x A100DGX A1000.7-23MixedN/AA100-SXM4-40GB
29.750% win rate vs. checkpoint256x A100DGX A1000.7-29MixedN/AA100-SXM4-40GB
17.0750% win rate vs. checkpoint1792x A100DGX A1000.7-36MixedN/AA100-SXM4-40GB
NVIDIA Merlin HugeCTRDLRM3.330.8025 AUC8x A100DGX A1000.7-17MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-40GB

Converged Training Performance of NVIDIA A100, V100 and T4

Benchmarks are reproducible by following links to NGC scripts

A100 Training Performance

FrameworkNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetResNet-50 v1.54075.9 Top 1 Accuracy22,008 images/sec8x A100DGX A10020.06-py3Mixed408ImageNet2012A100-SXM4-40GB
PyTorchMask R-CNN176.34 AP Segm167 images/sec8x A100DGX A10020.12-py3TF328COCO 2014A100-SXM-80GB
ResNeXt10130079.37 Top 16,888 images/sec8x A100DGX A100-Mixed128ImageNet2012A100-SXM4-40GB
SE-ResNeXt10134178.82 Top 15,737 images/sec8x A100DGX A10021.03-py3Mixed128Imagenet2012A100-SXM-80GB
SSD v1.143.25 mAP3,092 images/sec8x A100DGX A10020.06-py3Mixed128COCO 2017A100-SXM-80GB
Tacotron2103.57 Training Loss299,005 total output mels/sec8x A100DGX A10021.02-py3TF32128LJSpeech 1.1A100-SXM-80GB
WaveGlow293-5.85 Training Loss1,434,954 output samples/sec8x A100DGX A10021.03-py3Mixed10LJSpeech 1.1A100-SXM-80GB
Jasper3,6003.53 dev-clean WER603 sequences/sec8x A100DGX A10020.06-py3Mixed64LibriSpeechA100 SXM4-40GB
Transformer16727.76582,721 words/sec8x A100DGX A10020.06-py3Mixed10240wmt14-en-deA100 SXM4-40GB
FastPitch216.18 Training Loss1,040,206 frames/sec8x A100DGX A10020.06-py3Mixed32LJSpeech 1.1A100 SXM4-40GB
GNMT V21724.02 BLEU Score882,331 total tokens/sec8x A100DGX A10020.12-py3Mixed128wmt16-en-deA100-SXM-80GB
NCF0.37.96 Hit Rate at 10153,085,874 samples/sec8x A100DGX A10021.03-py3Mixed131072MovieLens 20MA100-SXM-80GB
BERT-LARGE391.03 F1938 sequences/sec8x A100DGX A10021.03-py3Mixed32SQuaD v1.1A100-SXM-80GB
Transformer-XL Large42914.07 Perplexity192,038 total tokens/sec8x A100DGX A10020.12-py3Mixed16WikiText-103A100-SXM-80GB
Transformer-XL Base21416.92 Perplexity614,404 total tokens/sec8x A100DGX A10020.12-py3Mixed128WikiText-103A100-SXM-80GB
BERT-Large Pre-Training P12,379-3,231 sequences/sec8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
BERT-Large Pre-Training P21,3771.34 Final Loss630 sequences/sec8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
BERT-Large Pre-Training E2E3,7561.34 Final Loss-8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
TensorflowResNext10119579.19 Top 19,939 images/sec8x A100DGX A10021.02-py3Mixed256Imagenet2012A100-SXM-80GB
Mask R-CNN193.34 AP Segm137 samples/sec8x A100DGX A10021.03-py3Mixed4COCO 2014A100-SXM-80GB
U-Net Industrial1.99 IoU Threshold1,027 images/sec8x A100DGX A10021.03-py3Mixed2DAGM2007A100-SXM-80GB
U-Net Medical6.9 DICE Score952 images/sec8x A100DGX A10021.03-py3Mixed8EM segmentation challengeA100-SXM-80GB
VAE-CF1.43 NDCG@1001,534,868 users processed/sec8x A100DGX A10021.03-py3TF323072MovieLens 20MA100-SXM-80GB
Wide and Deep107.68 MAP at 121,111,976 samples/sec8x A100DGX A10020.10-py3Mixed16384Kaggle Outbrain Click PredictionA100 SXM4-40GB
BERT-LARGE1191.2 F1847 sequences/sec8x A100DGX A10021.03-py3Mixed24SQuaD v1.1A100-SXM-80GB
Electra Fine Tuning392.56 F12,461 sequences/sec8x A100DGX A10021.03-py3Mixed32SQuaD v1.1A100 SXM 80GB
EfficientNet-B44,23182.81 Top 12,535 images/sec8x A100DGX A10020.08-py3Mixed160ImageNet2012A100-SXM-80GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
BERT-Large Pre-Training Sequence Length for Phase 1 = 128 and Phase 2 = 512 | Batch Size for Phase 1 = 65,536 and Phase 2 = 32,768
EfficientNet-B4: Mixup = 0.2 | Auto-Augmentation | cuDNN Version = 8.0.5.39 | NCCL Version = 2.7.8

A30 Training Performance

FrameworkNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorchResNet-50 v1.566276.78 Top 12,963 images/sec4x A30GIGABYTE G482-Z52-SW-QZ-00121.02-py3Mixed256ImageNet2012A30

A10 Training Performance

FrameworkNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorchNCF1.96 Hit Rate at 1047,713,385 samples/sec8x A10GIGABYTE G482-Z52-0021.03-py3Mixed131072MovieLens 20MA10
BERT-LARGE1391.15 F1235 sequences/sec8x A10GIGABYTE G482-Z52-0021.03-py3Mixed10SQuaD v1.1A10
TensorflowU-Net Industrial1.99 IoU Threshold 0.95541 images/sec8x A10GIGABYTE G482-Z52-0021.03-py3Mixed2DAGM2007A10
VAE-CF1.43 NDCG@100637,683 users processed/sec8x A10GIGABYTE G482-Z52-0021.03-py3Mixed3072MovieLens 20MA10

Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

V100 Training Performance

FrameworkNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetResNet-50 v1.516877.03 Top 111,802 images/sec8x V100DGX-221.03-py3Mixed256ImageNet2012V100-SXM3-32GB
PyTorchMask R-CNN269.34 AP Segm109 images/sec8x V100DGX-220.12-py3Mixed8COCO 2014V100-SXM3-32GB
Tacotron2192.53 Training Loss163,818 total output mels/sec8x V100DGX-221.03-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
WaveGlow472-5.75 Training Loss904,322 output samples/sec8x V100DGX-221.03-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
Jasper6,3003.49 dev-clean WER312 sequences/sec8x V100DGX-220.06-py3Mixed64LibriSpeechV100 SXM2-32GB
Transformer27627.82 BLEU Score223,245 words/sec8x V100DGX-120.06-py3Mixed5120wmt14-en-deV100 SXM2-16GB
FastPitch354.18 Training Loss570,968 frames/sec8x V100DGX-120.06-py3Mixed32LJSpeech 1.1V100 SXM2-16GB
GNMT V23124.03 BLEU Score479,039 total tokens/sec8x V100DGX-220.10-py3Mixed128wmt16-en-deV100-SXM3-32GB
NCF0.55.96 Hit Rate at 1099,061,003 samples/sec8x V100DGX-221.03-py3Mixed131072MovieLens 20MV100-SXM3-32GB
BERT-LARGE891.23 F1367 sequences/sec8x V100DGX-221.03-py3Mixed10SQuaD v1.1V100-SXM3-32GB
Transformer-XL Base11822.15 Perplexity279,880 total tokens/sec8x V100DGX-220.12-py3Mixed32WikiText-103V100-SXM3-32GB
TensorflowResNet-50 v1.518476.99 Top 110,491 images/sec8x V100DGX-221.03-py3Mixed256ImageNet2012V100-SXM3-32GB
ResNext10141379.43 Top 14,697 images/sec8x V100DGX-221.02-py3Mixed128Imagenet2012V100-SXM3-32GB
SE-ResNext10149879.96 Top 13,915 images/sec8x V100DGX-221.03-py3Mixed96Imagenet2012V100-SXM3-32GB
U-Net Industrial0.97.99 IoU Threshold661 images/sec8x V100DGX-221.03-py3Mixed2DAGM2007V100-SXM3-32GB
U-Net Medical14.9 DICE Score470 images/sec8x V100DGX-221.03-py3Mixed8EM segmentation challengeV100-SXM3-32GB
VAE-CF0.70.43 NDCG@1001,001,210 users processed/sec8x V100DGX-120.06-py3Mixed3072MovieLens 20MV100 SXM2-32GB
Wide and Deep185.68 MAP at 12643,334 samples/sec8x V100DGX-120.10-py3Mixed16384Kaggle Outbrain Click PredictionV100 SXM2-16GB
BERT-LARGE1991.13 F1331 sequences/sec8x V100DGX-221.03-py3Mixed10SQuaD v1.1V100-SXM3-32GB
Electra Fine-Tuning692.72 F11,051 images/sec8x V100DGX-120.07-py3Mixed16SQuaD v1.1V100 SXM2-16GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

T4 Training Performance

FrameworkNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetResNet-50 v1.547877.26 Top 14,085 images/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed192ImageNet2012NVIDIA T4
PyTorchResNeXt1011,73878.75 Top 11,124 images/sec8x T4Supermicro SYS-4029GP-TRT20.10-py3Mixed128Imagenet2012NVIDIA T4
Tacotron2247.53 Training Loss123,614 total output mels/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed104LJSpeech 1.1NVIDIA T4
WaveGlow1,034-5.74 Training Loss406,920 output samples/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed10LJSpeech 1.1NVIDIA T4
Transformer2,28827.65 BLEU Score42,030 words/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed2560wmt14-en-deNVIDIA T4
FastPitch319.21 Training Loss281,406 frames/sec8x T4Supermicro SYS-4029GP-TRT20.10-py3Mixed32LJSpeech 1.1NVIDIA T4
GNMT V29024.27 BLEU Score160,009 total tokens/sec8x T4Supermicro SYS-4029GP-TRT20.10-py3Mixed128wmt16-en-deNVIDIA T4
NCF2.96 Hit Rate at 1026,354,810 samples/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed131072MovieLens 20MNVIDIA T4
BERT-LARGE2391.25 F1127 sequences/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed10SQuaD v1.1NVIDIA T4
Transformer-XL Base33322.21 Perplexity98,917 total tokens/sec8x T4Supermicro SYS-4029GP-TRT20.12-py3Mixed32WikiText-103NVIDIA T4
TensorflowU-Net Industrial2.99 IoU Threshold 0.95292 images/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed2DAGM2007NVIDIA T4
U-Net Medical13.89 DICE Score151 images/sec8x T4Supermicro SYS-4029GP-TRT21.02-py3Mixed8EM segmentation challengeNVIDIA T4
VAE-CF2.43 NDCG@100373,317 users processed/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed3072MovieLens 20MNVIDIA T4
SSD112.28 mAP549 images/sec8x T4Supermicro SYS-4029GP-TRT20.12-py3Mixed32COCO 2017NVIDIA T4
Mask R-CNN492.34 AP Segm52 samples/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed4COCO 2014NVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384


Deploying AI in real world applications, requires training the networks to convergence at a specified accuracy. This is the best methodology to test AI systems, and is typically done on multi-accelerator systems (see the ‘Training-Convergence’ tab or read our blog on convergence for more details) to shorten training-to-convergence times, especially for recurrent monthly container builds.

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.

NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit NVIDIA GPU Cloud (NGC) to pull containers and quickly get up and running with deep learning.

Single GPU Training Performance of NVIDIA A100, V100 and T4

Benchmarks are reproducible by following links to NGC scripts

A100 Training Performance

FrameworkNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetResNet-50 v1.52,751 images/sec1x A100--Mixed408ImageNet2012A100-SXM-80GB
PyTorchMask R-CNN29 images/sec1x A100DGX A10021.03-py3TF328COCO 2014A100-SXM-80GB
SSD v1.1442 images/sec1x A100DGX A10021.03-py3Mixed128COCO 2017A100-SXM-80GB
Tacotron239,288 total output mels/sec1x A100DGX A10021.03-py3TF32128LJSpeech 1.1A100-SXM-80GB
WaveGlow206,623 output samples/sec1x A100DGX A10021.03-py3Mixed10LJSpeech 1.1A100-SXM-80GB
Jasper83 sequences/sec1x A100DGX A10020.12-py3Mixed64LibriSpeechA100-SXM4-40GB
Transformer82,618 words/sec1x A100DGX A10020.06-py3Mixed10240wmt14-en-deA100 SXM4-40GB
FastPitch169,838 frames/sec1x A100DGX A10021.03-py3Mixed128LJSpeech 1.1A100-SXM-80GB
GNMT V2148,754 total tokens/sec1x A100DGX A10020.12-py3Mixed128wmt16-en-deA100-SXM-80GB
NCF36,090,563 samples/sec1x A100DGX A10021.03-py3Mixed1048576MovieLens 20MA100-SXM-80GB
BERT-LARGE123 sequences/sec1x A100DGX A10021.03-py3Mixed32SQuaD v1.1A100-SXM-80GB
Transformer-XL Large27,364 total tokens/sec1x A100DGX A10020.12-py3Mixed16WikiText-103A100-SXM-80GB
Transformer-XL Base80,692 total tokens/sec1x A100DGX A10020.12-py3Mixed128WikiText-103A100-SXM-80GB
TensorflowResNet-50 v1.52,678 images/sec1x A100DGX A10021.03-py3Mixed256ImageNet2012A100-SXM-80GB
ResNext1011,325 images/sec1x A100DGX A10021.03-py3Mixed256Imagenet2012A100-SXM-80GB
SE-ResNext1011,132 images/sec1x A100DGX A10021.03-py3Mixed256Imagenet2012A100-SXM-80GB
U-Net Industrial319 images/sec1x A100DGX A10020.11-py3Mixed16DAGM2007A100-SXM4-40GB
U-Net Medical146 images/sec1x A100DGX A10021.03-py3Mixed8EM segmentation challengeA100-SXM-80GB
VAE-CF395,263 users processed/sec1x A100DGX A10021.03-py3Mixed24576MovieLens 20MA100-SXM-80GB
Wide and Deep333,751 samples/sec1x A100DGX A10021.03-py3TF32131072Kaggle Outbrain Click PredictionA100-SXM-80GB
BERT-LARGE117 sequences/sec1x A100DGX A10021.03-py3Mixed24SQuaD v1.1A100-SXM-80GB
Electra Fine Tuning343 sequences/sec1x A100DGX A10021.03-py3Mixed32SQuaD v1.1A100 SXM 80GB
EfficientNet-B4332 images/sec1x A100DGX A100-Mixed160ImageNet2012A100-SXM-80GB
NCF40,477,425 samples/sec1x A100DGX A10020.11-py3Mixed1048576MovieLens 20MA100-SXM4-40GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
EfficientNet-B4: Basic Augmentation | cuDNN Version = 8.0.5.32 | NCCL Version = 2.7.8 | Installation Source = NGC

A30 Training Performance

FrameworkNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetResNet-50 v1.51,334 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed64ImageNet2012A30
PyTorchMask R-CNN16 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed8COCO 2014A30
SSD v1.1224 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed64COCO 2017A30
Tacotron218,980 total output mels/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed104LJSpeech 1.1A30
WaveGlow119,513 output samples/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed10LJSpeech 1.1A30
Transformer25,314 words/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed2560wmt14-en-deA30
FastPitch95,100 frames/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed64LJSpeech 1.1A30
NCF19,126,726 samples/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed1048576MovieLens 20MA30
ResNeXt101561 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed128Imagenet2012A30
TensorflowResNet-50 v1.51,341 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed256ImageNet2012A30
ResNext101597 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed128Imagenet2012A30
SE-ResNext101488 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed96Imagenet2012A30
U-Net Industrial102 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed16DAGM2007A30
U-Net Medical70 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed8EM segmentation challengeA30
VAE-CF201,658 users processed/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed24576MovieLens 20MA30
Wide and Deep229,387 samples/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3TF32131072Kaggle Outbrain Click PredictionA30
Electra Fine Tuning152 sequences/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed16SQuaD v1.1A30
NCF13,574,785 samples/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed1048576MovieLens 20MA30
Transformer XL Base15,976 total tokens/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3Mixed16WikiText-103A30

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A10 Training Performance

FrameworkNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetResNet-50 v1.51,075 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed192ImageNet2012A10
PyTorchMask R-CNN13 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed8COCO 2014A10
SSD v1.1174 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed64COCO 2017A10
Tacotron219,520 total output mels/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed104LJSpeech 1.1A10
WaveGlow99,358 output samples/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed10LJSpeech 1.1A10
Transformer22,365 words/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed2560wmt14-en-deA10
FastPitch89,154 frames/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed64LJSpeech 1.1A10
NCF16,478,978 samples/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed1048576MovieLens 20MA10
TensorflowResNet-50 v1.51,006 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed256ImageNet2012A10
ResNext101460 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed128Imagenet2012A10
SE-ResNext101342 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed96Imagenet2012A10
U-Net Industrial94 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed16DAGM2007A10
U-Net Medical50 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed8EM segmentation challengeA10
VAE-CF174,443 users processed/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed24576MovieLens 20MA10
Wide and Deep249,905 samples/sec1x A10Supermicro SYS-1029GQ-TRT20.11-py3Mixed131072Kaggle Outbrain Click PredictionA10
Electra Fine Tuning133 sequences/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed16SQuaD v1.1A10
NCF11,962,242 samples/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed1048576MovieLens 20MA10
Transformer XL Base13,220 total tokens/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed16WikiText-103A10

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

V100 Training Performance

FrameworkNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetResNet-50 v1.51,510 images/sec1x V100DGX-221.03-py3Mixed256ImageNet2012V100-SXM3-32GB
PyTorchMask R-CNN18 images/sec1x V100DGX-221.03-py3Mixed8COCO 2014V100-SXM3-32GB
ResNeXt101575 images/sec1x V100DGX-221.03-py3Mixed128Imagenet2012V100-SXM3-32GB
SSD v1.1234 images/sec1x V100DGX-221.03-py3Mixed64COCO 2017V100-SXM3-32GB
Tacotron222,545 total output mels/sec1x V100DGX-221.02-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
WaveGlow134,178 output samples/sec1x V100DGX-221.03-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
Jasper47 sequences/sec1x V100DGX-220.12-py3Mixed64LibriSpeechV100-SXM3-32GB
Transformer33,468 words/sec1x V100DGX-120.06-py3Mixed5120wmt14-en-deV100 SXM2-16GB
FastPitch110,370 frames/sec1x V100DGX-120.06-py3Mixed64LJSpeech 1.1V100 SXM2-16GB
GNMT V283,200 total tokens/sec1x V100DGX-220.09-py3Mixed128wmt16-en-deV100-SXM3-32GB
NCF22,123,689 samples/sec1x V100DGX-221.03-py3Mixed1048576MovieLens 20MV100-SXM3-32GB
BERT-LARGE53 sequences/sec1x V100DGX-221.03-py3Mixed10SQuaD v1.1V100-SXM3-32GB
Transformer-XL Base42,292 total tokens/sec1x V100DGX-220.12-py3Mixed32WikiText-103V100-SXM3-32GB
Transformer-XL Large14,499 total tokens/sec1x V100DGX-220.12-py3Mixed8WikiText-103V100-SXM3-32GB
TensorflowResNet-50 v1.51,396 images/sec1x V100DGX-221.03-py3Mixed256ImageNet2012V100-SXM3-32GB
ResNext101636 images/sec1x V100DGX-221.03-py3Mixed128Imagenet2012V100-SXM3-32GB
SE-ResNext101550 images/sec1x V100DGX-221.03-py3Mixed96Imagenet2012V100-SXM3-32GB
U-Net Industrial169 images/sec1x V100DGX-120.06-py3Mixed16DAGM2007V100 SXM2-16GB
U-Net Medical68 images/sec1x V100DGX-221.03-py3Mixed8EM segmentation challengeV100-SXM3-32GB
VAE-CF223,120 users processed/sec1x V100DGX-221.03-py3Mixed24576MovieLens 20MV100-SXM3-32GB
Wide and Deep276,971 samples/sec1x V100DGX-221.03-py3Mixed131072Kaggle Outbrain Click PredictionV100-SXM3-32GB
BERT-LARGE48 sequences/sec1x V100DGX-221.03-py3Mixed10SQuaD v1.1V100-SXM3-32GB
Electra Fine-Tuning194 sequences/sec1x V100DGX-220.12-py3Mixed32SQuaD v1.1V100-SXM3-32GB
SSD233 images/sec1x V100DGX-221.03-py3Mixed32COCO 2017V100-SXM3-32GB
Mask R-CNN15 samples/sec1x V100DGX-221.03-py3Mixed4COCO 2014V100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

T4 Training Performance

FrameworkNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNetResNet-50 v1.5514 images/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed64ImageNet2012NVIDIA T4
PyTorchResNeXt101208 images/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed128Imagenet2012NVIDIA T4
Tacotron216,844 total output mels/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed104LJSpeech 1.1NVIDIA T4
WaveGlow57,673 output samples/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed10LJSpeech 1.1NVIDIA T4
Transformer10,602 words/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed2560wmt14-en-deNVIDIA T4
FastPitch38,985 frames/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed64LJSpeech 1.1NVIDIA T4
GNMT V231,592 total tokens/sec1x T4Supermicro SYS-4029GP-TRT20.12-py3Mixed128wmt16-en-deNVIDIA T4
NCF8,075,883 samples/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed1048576MovieLens 20MNVIDIA T4
BERT-LARGE18 sequences/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed10SQuaD v1.1NVIDIA T4
Transformer-XL Base17,062 total tokens/sec1x T4Supermicro SYS-4029GP-TRT20.12-py3Mixed32WikiText-103NVIDIA T4
Mask R-CNN7 images/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed4COCO 2014NVIDIA T4
Jasper14 sequences/sec1x T4Supermicro SYS-4029GP-TRT20.12-py3Mixed32LibriSpeechNVIDIA T4
SE-ResNeXt101162 images/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed128Imagenet2012NVIDIA T4
Transformer-XL Large5,064 total tokens/sec1x T4Supermicro SYS-4029GP-TRT20.12-py3Mixed4WikiText-103NVIDIA T4
TensorflowU-Net Industrial42 images/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed16DAGM2007NVIDIA T4
U-Net Medical22 images/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed8EM segmentation challengeNVIDIA T4
VAE-CF82,118 users processed/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed24576MovieLens 20MNVIDIA T4
SSD98 images/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed32COCO 2017NVIDIA T4
Mask R-CNN8 samples/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed4COCO 2014NVIDIA T4
Wide and Deep203,867 samples/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed131072Kaggle Outbrain Click PredictionNVIDIA T4
SE-ResNext101172 images/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed96Imagenet2012NVIDIA T4
Electra Fine-Tuning62 sequences/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed16SQuaD v1.1NVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

Real-world AI inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry leading solution enables customers to quickly deploy AI models into real-world production with the highest performance from data centers to the edge.

NVIDIA landed top performance spots on all MLPerf™ Inference 1.0 tests, the AI-industry’s leading benchmark. NVIDIA TensorRT™ running on NVIDIA Tensor Core GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA® GPU Cloud (NGC) to download any of these containers and immediately race into production. The inference whitepaper provides an overview of inference platforms.

Measuring inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to the PLASTER whitepaper for more details.


MLPerf Inference v1.0 Performance Benchmarks

Offline Scenario - Closed Division

NetworkThroughputTarget AccuracyGPUServerDatasetGPU Version
ResNet-50 v1.5105,677 samples/sec76.46% Top18x A10Supermicro 4029GP-TRT-OTO-28ImageNetA10
141,518 samples/sec76.46% Top18x A30Gigabyte G482-Z54ImageNetA30
5,108 samples/sec76.46% Top11x1g.10gb A100DGX A100ImageNetA100 SXM-80GB
38,010 samples/sec76.46% Top11x A100DGX A100ImageNetA100 SXM-80GB
304,876 samples/sec76.46% Top18x A100DGX A100ImageNetA100 SXM-80GB
248,179 samples/sec76.46% Top18x A100Gigabyte G482-Z54ImageNetA100-PCIe-40GB
132,926 samples/sec76.46% Top14x A100DGX-Station-A100ImageNetA100 SXM-80GB
SSD ResNet-342,496 samples/sec0.2 mAP8x A10Supermicro 4029GP-TRT-OTO-28COCOA10
3,756 samples/sec0.2 mAP8x A30Gigabyte G482-Z54COCOA30
134 samples/sec0.2 mAP1x1g.10gb A100DGX A100COCOA100 SXM-80GB
989 samples/sec0.2 mAP1x A100DGX A100COCOA100 SXM-80GB
7,879 samples/sec0.2 mAP8x A100DGX A100COCOA100 SXM-80GB
6,586 samples/sec0.2 mAP8x A100Gigabyte G482-Z54COCOA100-PCIe-40GB
3,370 samples/sec0.2 mAP4x A100DGX-Station-A100COCOA100 SXM-80GB
3D-UNet172 samples/sec0.853 DICE mean8x A10Supermicro 4029GP-TRT-OTO-28BraTS 2019A10
237 samples/sec0.853 DICE mean8x A30Gigabyte G482-Z54BraTS 2019A30
7 samples/sec0.853 DICE mean1x1g.10gb A100DGX A100BraTS 2019A100 SXM-80GB
61 samples/sec0.853 DICE mean1x A100DGX A100BraTS 2019A100 SXM-80GB
480 samples/sec0.853 DICE mean8x A100DGX A100BraTS 2019A100 SXM-80GB
412 samples/sec0.853 DICE mean8x A100Gigabyte G482-Z54BraTS 2019A100-PCIe-40GB
214 samples/sec0.853 DICE mean4x A100DGX-Station-A100BraTS 2019A100 SXM-80GB
RNN-T36,116 samples/sec7.45% WER8x A10Supermicro 4029GP-TRT-OTO-28LibriSpeechA10
51,690 samples/sec7.45% WER8x A30Gigabyte G482-Z54LibriSpeechA30
1,553 samples/sec7.45% WER1x1g.10gb A100DGX A100LibriSpeechA100 SXM-80GB
14,008 samples/sec7.45% WER1x A100DGX A100LibriSpeechA100 SXM-80GB
105,677 samples/sec7.45% WER8x A100DGX A100LibriSpeechA100 SXM-80GB
90,853 samples/sec7.45% WER8x A100Gigabyte G482-Z54LibriSpeechA100-PCIe-40GB
48,886 samples/sec7.45% WER4x A100DGX-Station-A100LibriSpeechA100 SXM-80GB
BERT8,454 samples/sec90.07% f18x A10Supermicro 4029GP-TRT-OTO-28SQuAD v1.1A10
13,260 samples/sec90.07% f18x A30Gigabyte G482-Z54SQuAD v1.1A30
492 samples/sec90.07% f11x1g.10gb A100DGX A100SQuAD v1.1A100 SXM-80GB
3,602 samples/sec90.07% f11x A100DGX A100SQuAD v1.1A100 SXM-80GB
28,347 samples/sec90.07% f18x A100DGX A100SQuAD v1.1A100 SXM-80GB
22,847 samples/sec90.07% f18x A100Gigabyte G482-Z54SQuAD v1.1A100-PCIe-40GB
11,305 samples/sec90.07% f14x A100DGX-Station-A100SQuAD v1.1A100 SXM-80GB
DLRM772,378 samples/sec80.25% AUC8x A10Supermicro 4029GP-TRT-OTO-28Criteo 1TB Click LogsA10
1,067,510 samples/sec80.25% AUC8x A30Gigabyte G482-Z54Criteo 1TB Click LogsA30
36,473 samples/sec80.25% AUC1x1g.10gb A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
311,826 samples/sec80.25% AUC1x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
2,462,300 samples/sec80.25% AUC8x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
1,057,550 samples/sec80.25% AUC4x A100DGX-Station-A100Criteo 1TB Click LogsA100 SXM-80GB

Server Scenario - Closed Division

NetworkThroughputTarget AccuracyMLPerf Server Latency
Constraints (ms)
GPUServerDatasetGPU Version
ResNet-50 v1.587,984 queries/sec76.46% Top1158x A10Supermicro 4029GP-TRT-OTO-28ImageNetA10
115,987 queries/sec76.46% Top1158x A30Gigabyte G482-Z54ImageNetA30
3,602 queries/sec76.46% Top1151x1g.10gb A100DGX A100ImageNetA100 SXM-80GB
30,794 queries/sec76.46% Top1151x A100DGX A100ImageNetA100 SXM-80GB
259,994 queries/sec76.46% Top1158x A100DGX A100ImageNetA100 SXM-80GB
207,976 queries/sec76.46% Top1158x A100Gigabyte G482-Z54ImageNetA100-PCIe-40GB
106,988 queries/sec76.46% Top1154x A100DGX-Station-A100ImageNetA100 SXM-80GB
SSD ResNet-342,000 queries/sec0.2 mAP1008x A10Supermicro 4029GP-TRT-OTO-28COCOA10
3,575 queries/sec0.2 mAP1008x A30Gigabyte G482-Z54COCOA30
100 queries/sec0.2 mAP1001x1g.10gb A100DGX A100COCOA100 SXM-80GB
926 queries/sec0.2 mAP1001x A100DGX A100COCOA100 SXM-80GB
7,654 queries/sec0.2 mAP1008x A100DGX A100COCOA100 SXM-80GB
6,162 queries/sec0.2 mAP1008x A100Gigabyte G482-Z54COCOA100-PCIe-40GB
3,081 queries/sec0.2 mAP1004x A100DGX-Station-A100COCOA100 SXM-80GB
RNN-T22,597 queries/sec7.45% WER1,0008x A10Supermicro 4029GP-TRT-OTO-28LibriSpeechA10
36,991 queries/sec7.45% WER1,0008x A30Gigabyte G482-Z54LibriSpeechA30
1,303 queries/sec7.45% WER1,0001x1g.10gb A100DGX A100LibriSpeechA100 SXM-80GB
12,751 queries/sec7.45% WER1,0001x A100DGX A100LibriSpeechA100 SXM-80GB
103,986 queries/sec7.45% WER1,0008x A100DGX A100LibriSpeechA100 SXM-80GB
85,985 queries/sec7.45% WER1,0008x A100Gigabyte G482-Z54LibriSpeechA100-PCIe-40GB
43,389 queries/sec7.45% WER1,0004x A100DGX-Station-A100LibriSpeechA100 SXM-80GB
BERT7,204 queries/sec90.07% f11308x A10Supermicro 4029GP-TRT-OTO-28SQuAD v1.1A10
11,500 queries/sec90.07% f11308x A30Gigabyte G482-Z54SQuAD v1.1A30
381 queries/sec90.07% f11301x1g.10gb A100DGX A100SQuAD v1.1A100 SXM-80GB
3,202 queries/sec90.07% f11301x A100DGX A100SQuAD v1.1A100 SXM-80GB
25,792 queries/sec90.07% f11308x A100DGX A100SQuAD v1.1A100 SXM-80GB
20,792 queries/sec90.07% f11308x A100Gigabyte G482-Z54SQuAD v1.1A100-PCIe-40GB
10,203 queries/sec90.07% f11304x A100DGX-Station-A100SQuAD v1.1A100 SXM-80GB
DLRM680,147 queries/sec80.25% AUC308x A10Supermicro 4029GP-TRT-OTO-28Criteo 1TB Click LogsA10
750,204 queries/sec80.25% AUC308x A30Gigabyte G482-Z54Criteo 1TB Click LogsA30
35,991 queries/sec80.25% AUC301x1g.10gb A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
286,002 queries/sec80.25% AUC301x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
2,302,570 queries/sec80.25% AUC308x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
942,395 queries/sec80.25% AUC304x A100DGX-Station-A100Criteo 1TB Click LogsA100 SXM-80GB

Power Efficiency Offline Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerDatasetGPU Version
ResNet-50 v1.5213,599 samples/sec97.31 samples/sec/watt8x A100Gigabyte G482-Z54ImageNetA100 PCIe-40GB
270,706 samples/sec78.27 samples/sec/watt8x A100DGX A100ImageNetA100 SXM-80GB
124,529 samples/sec98.14 samples/sec/watt4x A100DGX-Station-A100ImageNetA100 SXM-80GB
SSD ResNet-345,824 samples/sec2.6 samples/sec/watt8x A100Gigabyte G482-Z54COCOA100 PCIe-40GB
6,875 samples/sec1.96 samples/sec/watt8x A100DGX A100COCOA100 SXM-80GB
3,110 samples/sec2.44 samples/sec/watt4x A100DGX-Station-A100COCOA100 SXM-80GB
3D-UNet372 samples/sec0.16 samples/sec/watt8x A100Gigabyte G482-Z54BraTS 2019A100 PCIe-40GB
433 samples/sec0.12 samples/sec/watt8x A100DGX A100BraTS 2019A100 SXM-80GB
202 samples/sec0.16 samples/sec/watt4x A100DGX-Station-A100BraTS 2019A100 SXM-80GB
RNN-T82,540 samples/sec36.23 samples/sec/watt8x A100Gigabyte G482-Z54LibriSpeechA100 PCIe-40GB
93,803 samples/sec26.39 samples/sec/watt8x A100DGX A100LibriSpeechA100 SXM-80GB
47,255 samples/sec36.16 samples/sec/watt4x A100DGX-Station-A100LibriSpeechA100 SXM-80GB
BERT17,697 samples/sec7.73 samples/sec/watt8x A100Gigabyte G482-Z54SQuAD v1.1A100 PCIe-40GB
23,406 samples/sec6.77 samples/sec/watt8x A100DGX A100SQuAD v1.1A100 SXM-80GB
9,865 samples/sec7.76 samples/sec/watt4x A100DGX-Station-A100SQuAD v1.1A100 SXM-80GB
DLRM1,577,960 samples/sec730.76 samples/sec/watt8x A100Gigabyte G482-Z54Criteo 1TB Click LogsA100 PCIe-40GB
2,115,950 samples/sec619.54 samples/sec/watt8x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
974,571 samples/sec762.64 samples/sec/watt4x A100DGX-Station-A100Criteo 1TB Click LogsA100 SXM-80GB

Power Efficiency Server Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerDatasetGPU Version
ResNet-50 v1.5184,984 queries/sec82.39 queries/sec/watt8x A100Gigabyte G482-Z54ImageNetA100 PCIe-40GB
239,991 queries/sec69.53 queries/sec/watt8x A100DGX A100ImageNetA100 SXM-80GB
106,988 queries/sec84.51 queries/sec/watt4x A100DGX-Station-A100ImageNetA100 SXM-80GB
SSD ResNet-345,702 queries/sec2.52 queries/sec/watt8x A100Gigabyte G482-Z54COCOA100 PCIe-40GB
6,301 queries/sec1.82 queries/sec/watt8x A100DGX A100COCOA100 SXM-80GB
3,081 queries/sec2.43 queries/sec/watt4x A100DGX-Station-A100COCOA100 SXM-80GB
RNN-T74,974 queries/sec32.25 queries/sec/watt8x A100Gigabyte G482-Z54LibriSpeechA100 PCIe-40GB
87,984 queries/sec24.78 queries/sec/watt8x A100DGX A100LibriSpeechA100 SXM-80GB
43,389 queries/sec33.03 queries/sec/watt4x A100DGX-Station-A100LibriSpeechA100 SXM-80GB
BERT17,499 queries/sec7.58 queries/sec/watt8x A100Gigabyte G482-Z54SQuAD v1.1A100 PCIe-40GB
21,492 queries/sec6.03 queries/sec/watt8x A100DGX A100SQuAD v1.1A100 SXM-80GB
10,203 queries/sec7.84 queries/sec/watt4x A100DGX-Station-A100SQuAD v1.1A100 SXM-80GB
DLRM2,001,940 queries/sec575.72 queries/sec/watt8x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
890,334 queries/sec663.62 queries/sec/watt4x A100DGX-Station-A100Criteo 1TB Click LogsA100 SXM-80GB

MLPerf™ v1.0 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net, DLRM 99.9% of FP32 accuracy target: 1.0-25, 1.0-26, 1.0-29, 1.0-30, 1.0-32, 1.0-55, 1.0-57. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
A10 and A30 results are preview
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here


Inference Performance of NVIDIA A100, V100 and T4

Benchmarks are reproducible by following links to NGC scripts

Inference Natural Langugage Processing

BERT Inference Throughput

DGX A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128

 

NVIDIA A100 BERT Inference Benchmarks

NetworkNetwork
Type
Batch
Size
ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
BERT-Large with SparsityAttention946,188 sequences/sec--1x A100DGX A100-INT8SQuaD v1.1-A100 SXM4-40GB

A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 7.2 | Batch Size = 128 | 21.03-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-SW-QZ-001: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 7.2 | Batch Size = 128 | 21.03-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC@2.25GHz w/ 1x NVIDIA A10 | TensorRT 7.2 | Batch Size = 128 | 21.03-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.2 | Batch Size = 128 | 21.03-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.2 | Batch Size = 128 | 21.03-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 7.2 | Batch Size = 128 | 21.03-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-SW-QZ-001: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 7.2 | Batch Size = 128 | 21.03-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC@2.25GHz w/ 1x NVIDIA A10 | TensorRT 7.2 | Batch Size = 128 | 21.03-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.2 | Batch Size = 128 | 21.03-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.2 | Batch Size = 128 | 21.03-py3 | Precision: INT8 | Dataset: Synthetic

 

A100 1/7 MIG Inference Performance

NetworkBatch Size1/7 MIG ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5011,411 images/sec0.711x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
22,221 images/sec0.91x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
83,418 images/sec2.341x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
274,052 images/sec6.661x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
1284,480 images/sec28.571x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
ResNet-50v1.511,384 images/sec0.721x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
22,150 images/sec0.931x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
83,316 images/sec2.411x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
273,915 images/sec6.91x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
1284,323 images/sec29.611x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
BERT-BASE1724 sequences/sec1.381x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
21,095 sequences/sec1.831x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
81,657 sequences/sec4.831x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
191,917 sequences/sec9.911x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
1282,163 sequences/sec59.171x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
BERT-LARGE1265 sequences/sec3.781x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
2376 sequences/sec5.311x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
5517 sequences/sec9.671x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
8550 sequences/sec14.561x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
128674 sequences/sec189.891x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 7 MIG Inference Performance

NetworkBatch Size7 MIG ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5019,607 images/sec0.711x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
214,947 images/sec0.91x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
823,533 images/sec2.341x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
2728,042 images/sec6.661x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
12831,225 images/sec28.571x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
ResNet-50v1.519,350 images/sec0.721x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
214,734 images/sec0.931x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
822,822 images/sec2.411x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
2627,254 images/sec6.91x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
12830,120 images/sec29.611x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
BERT-BASE15,144 sequences/sec1.371x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
28,265 sequences/sec1.711x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
811,665 sequences/sec4.821x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
1813,250 sequences/sec9.531x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
12814,420 sequences/sec62.181x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
BERT-LARGE11,787 sequences/sec3.931x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
22,632 sequences/sec5.351x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
53,575 sequences/sec9.841x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
83,810 sequences/sec14.781x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB
1284,465 sequences/sec200.871x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM4-40GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 Full Chip Inference Performance

NetworkBatch SizeFull Chip ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5024,049 images/sec0.491x A100DGX A10021.03-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
811,304 images/sec0.711x A100DGX A10021.03-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
12829,128 images/sec4.391x A100DGX A10021.03-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
21130,899 images/sec6.81x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
ResNet-50v1.523,898 images/sec0.511x A100DGX A10021.03-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
810,985 images/sec0.731x A100DGX A10021.03-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
12828,247 images/sec4.531x A100DGX A10021.03-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
20829,896 images/sec71x A100DGX A10020.12-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
ResNext101327,674 samples/sec4.171x A100--INT8SyntheticTensorRT 7.2A100-SXM4-40GB
EfficientNet-B012822,346 images/sec5.731x A100--INT8SyntheticTensorRT 7.2A100-SXM4-40GB
BERT-BASE22,425 sequences/sec0.821x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM4-40GB
86,837 sequences/sec1.21x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM4-40GB
12813,700 sequences/sec9.31x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM4-40GB
25614,490 sequences/sec17.671x A100DGX A100-INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM4-40GB
BERT-LARGE21,087 sequences/sec1.81x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM4-40GB
82,232 sequences/sec3.61x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM4-40GB
1284,509 sequences/sec281x A100DGX A10020.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM4-40GB
2564,679 sequences/sec54.711x A100--INT8Real (Q&A provided as text input)TensorRT 7.2A100-SXM-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
For BS=1 inference refer to the Triton Inference Server tab

 

A30 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5012,049 images/sec24 images/sec/watt0.491x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
23,487 images/sec38 images/sec/watt0.571x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
88,497 images/sec70 images/sec/watt0.941x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
12815,299 images/sec93 images/sec/watt8.371x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
ResNet-50v1.512,049 images/sec24 images/sec/watt0.491x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
23,498 images/sec37 images/sec/watt0.571x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
88,330 images/sec68 images/sec/watt0.961x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
12814,778 images/sec90 images/sec/watt8.661x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A10 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5012,510 images/sec20 images/sec/watt0.41x A10GIGABYTE G482-Z52-0021.03-py3INT8SyntheticTensorRT 7.2A10
24,362 images/sec30 images/sec/watt0.461x A10GIGABYTE G482-Z52-0021.03-py3INT8SyntheticTensorRT 7.2A10
87,803 images/sec52 images/sec/watt1.031x A10GIGABYTE G482-Z52-0021.03-py3INT8SyntheticTensorRT 7.2A10
12811,779 images/sec79 images/sec/watt10.871x A10GIGABYTE G482-Z52-0021.03-py3INT8SyntheticTensorRT 7.2A10
ResNet-50v1.512,509 images/sec20 images/sec/watt0.41x A10GIGABYTE G482-Z52-0021.03-py3INT8SyntheticTensorRT 7.2A10
24,256 images/sec29 images/sec/watt0.471x A10GIGABYTE G482-Z52-0021.03-py3INT8SyntheticTensorRT 7.2A10
87,517 images/sec50 images/sec/watt1.061x A10GIGABYTE G482-Z52-0021.03-py3INT8SyntheticTensorRT 7.2A10
12811,136 images/sec74 images/sec/watt11.491x A10GIGABYTE G482-Z52-0021.03-py3INT8SyntheticTensorRT 7.2A10
BERT-BASE11,415 sequences/sec-0.711x A10Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2A10
22,109 sequences/sec15 sequences/sec/watt0.951x A10Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2A10
83,634 sequences/sec27 sequences/sec/watt2.731x A10Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2A10
1284,496 sequences/sec34 sequences/sec/watt28.471x A10Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2A10
BERT-LARGE1576 sequences/sec5 sequences/sec/watt1.741x A10Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2A10
2788 sequences/sec6 sequences/sec/watt2.541x A10Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2A10
81,281 sequences/sec10 sequences/sec/watt6.251x A10Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2A10
1281,368 sequences/sec10 sequences/sec/watt93.571x A10Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2A10

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

V100 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5021,976 images/sec11 images/sec/watt11x V100DGX-221.03-py3MixedSyntheticTensorRT 7.2V100-SXM3-32GB
84,304 images/sec16 images/sec/watt1.861x V100DGX-221.03-py3MixedSyntheticTensorRT 7.2V100-SXM3-32GB
1288,168 images/sec24 images/sec/watt15.671x V100DGX-221.03-py3MixedSyntheticTensorRT 7.2V100-SXM3-32GB
ResNet-50v1.521,979 images/sec11 images/sec/watt1.011x V100DGX-221.03-py3MixedSyntheticTensorRT 7.2V100-SXM3-32GB
84,219 images/sec16 images/sec/watt1.91x V100DGX-221.03-py3MixedSyntheticTensorRT 7.2V100-SXM3-32GB
1287,815 images/sec23 images/sec/watt16.381x V100DGX-221.03-py3MixedSyntheticTensorRT 7.2V100-SXM3-32GB
BERT-BASE82,315 sequences/sec8 sequences/sec/watt3.461x V100DGX-220.11-py3MixedReal (Q&A provided as text input)TensorRT 7.2V100-SXM3-32GB
1283,194 sequences/sec11 sequences/sec/watt40.081x V100DGX-220.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2V100-SXM3-32GB
BERT-LARGE2523 sequences/sec2 sequences/sec/watt3.821x V100DGX-220.11-py3MixedReal (Q&A provided as text input)TensorRT 7.2V100-SXM3-32GB
8792 sequences/sec3 sequences/sec/watt10.11x V100DGX-220.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2V100-SXM3-32GB
128978 sequences/sec3 sequences/sec/watt130.921x V100DGX-220.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2V100-SXM3-32GB

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server tab

 

T4 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU
Version
ResNet-5022,105 images/sec30 images/sec/watt0.951x T4Supermicro SYS-1029GQ-TRT21.03-py3INT8SyntheticTensorRT 7.2NVIDIA T4
83,916 images/sec56 images/sec/watt2.041x T4Supermicro SYS-1029GQ-TRT21.03-py3INT8SyntheticTensorRT 7.2NVIDIA T4
1285,073 images/sec75 images/sec/watt25.231x T4Supermicro SYS-1029GQ-TRT21.03-py3INT8SyntheticTensorRT 7.2NVIDIA T4
ResNet-50v1.522,092 images/sec30 images/sec/watt0.961x T4Supermicro SYS-1029GQ-TRT21.03-py3INT8SyntheticTensorRT 7.2NVIDIA T4
83,715 images/sec53 images/sec/watt2.151x T4Supermicro SYS-1029GQ-TRT21.03-py3INT8SyntheticTensorRT 7.2NVIDIA T4
1284,824 images/sec69 images/sec/watt26.531x T4Supermicro SYS-1029GQ-TRT21.03-py3INT8SyntheticTensorRT 7.2NVIDIA T4
BERT-BASE21,079 sequences/sec17 sequences/sec/watt1.851x T4Supermicro SYS-1029GQ-TRT20.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2NVIDIA T4
81,720 sequences/sec28 sequences/sec/watt4.651x T4Supermicro SYS-1029GQ-TRT20.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2NVIDIA T4
1281,818 sequences/sec28 sequences/sec/watt70.41x T4Supermicro SYS-1029GQ-TRT20.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2NVIDIA T4
BERT-LARGE2390 sequences/sec6 sequences/sec/watt5.121x T4Supermicro SYS-1029GQ-TRT20.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2NVIDIA T4
8555 sequences/sec9 sequences/sec/watt14.411x T4Supermicro SYS-1029GQ-TRT20.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2NVIDIA T4
128561 sequences/sec8 sequences/sec/watt227.981x T4Supermicro SYS-1029GQ-TRT20.11-py3INT8Real (Q&A provided as text input)TensorRT 7.2NVIDIA T4

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server tab

The Triton Inference Server is an open source inference serving software which maximizes performance and simplifies the deployment of AI models at scale in production. Triton lets teams deploy trained AI models from multiple model frameworks (TensorFlow, TensorRT, PyTorch, ONNX Runtime, OpenVino, or custom backends). They can deploy from local storage, Google Cloud Platform, or Amazon S3 on any GPU or CPU based infrastructure (in cloud, data center, or embedded devices). Triton is open source on GitHub and available as a docker container on NGC.

NVIDIA landed top performance spots on all MLPerf™ Inference 1.0 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.


NVIDIA TRITON Inference Delivered Performance vs. MLPerf v1.0

MLPerf™ v1.0 A100 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net, DLRM 99% of FP32 accuracy target: 1.0-30, 1.0-31. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.​

Starting in the previous MLPerf™ round (v0.7), Triton Inference Server was used to submit GPU inference results. The chart above compares the performance of Triton to the custom MLPerf™ serving harness across five different TensorRT networks on bare metal. The results show that Triton is highly efficient and delivers nearly equal or identical performance to the highly optimized MLPerf™ harness.

To deliver such performance, the team brought many optimizations to Triton, such as new lightweight data structures for low latency communication with applications, support for variable sequence length inputs to avoid padding, and CUDA graphs for the TensorRT backend for higher inference performance. These enhancements are available in every Triton release starting from 20.09. In the latest MLPerf™ v1.0 submission, we used Triton for both GPU and CPU inference submissions. We added a new OpenVino backend in Triton for high performance inference on CPU.

 

NVIDIA Client BS=1 Performance with Triton Inference Server

NetworkAcceleratorTraining FrameworkFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatencyThroughputSequence/Input LengthTriton Container Version
ResNet-50 V1.5 InferenceA100-PCIE-40GBPyTorchTensorRTFP16216425661.024,197 inf/sec-20.07-py3
ResNet-50 V1.5 InferenceA100-SXM4-40GBPyTorchTensorRTTF32216425648.355,294 inf/sec-21.03-py3
ResNet-50 V1.5 InferenceNVIDIA T4PyTorchTensorRTFP161164256257.91992 inf/sec-20.07-py3
ResNet-50 V1.5 InferenceV100 SXM2-32GBPyTorchTensorRTFP324164384215.791,781 inf/sec-21.03-py3
BERT Large InferenceA100-PCIE-40GBTensorFlowTensorRTFP161181617.48915 inf/sec38420.09-py3
BERT Large InferenceA100-SXM4-40GBTensorFlowTensorRTFP161181616.1994 inf/sec38420.09-py3
BERT Large InferenceNVIDIA T4TensorFlowTensorRTFP161181681.14197 inf/sec38420.09-py3
DLRM InferenceA100-PCIE-40GBPyTorchTorchscriptFP161165,53697.511,197 inf/sec-20.08-py3
DLRM InferenceA100-SXM4-40GBPyTorchTorchscriptFP161165,5362016.071,245 inf/sec-20.08-py3
DLRM InferenceV100-SXM2-32GBPyTorchTorchscriptFP161165,5362225.04879 inf/sec-20.08-py3

NVIDIA Jarvis is an application framework for multimodal conversational AI services that delivers real-time performance on GPUs. Jarvis 1.0 Beta includes fully optimized pipelines for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), Text to Speech (TTS) that can be used for deploying real-time conversational AI apps such as transcription, virtual assistants and chatbots. Please visit Jarvis – Getting Started to download and get started with Jarvis.


Jarvis Benchmarks

Automatic Speech Recognition

A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet114.41A100 SXM4-40GB
Quartznet256254.364A100 SXM4-40GB
Quartznet512351.2506A100 SXM4-40GB
Quartznet1024630.81005A100 SXM4-40GB
Jasper117.61A100 SXM4-40GB
Jasper256244.9254A100 SXM4-40GB
Jasper512381507A100 SXM4-40GB
Jasper1024749.31,004A100 SXM4-40GB

A100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet19.61A100 SXM4-40GB
Quartznet1625.916A100 SXM4-40GB
Quartznet128132.4128A100 SXM4-40GB
Jasper113.41A100 SXM4-40GB
Jasper1626.316A100 SXM4-40GB
Jasper128258.9128A100 SXM4-40GB

A100 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet128.11A100 SXM4-40GB
Quartznet512566.5505A100 SXM4-40GB
Quartznet1,024899.31,000A100 SXM4-40GB
Quartznet1,5121,303.81,460A100 SXM4-40GB
Jasper1311A100 SXM4-40GB
Jasper512667.5504A100 SXM4-40GB
Jasper1,0241,089997A100 SXM4-40GB
Jasper1,5121,753.81,449A100 SXM4-40GB

V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet114.41V100 SXM2-16GB
Quartznet256222.2254V100 SXM2-16GB
Quartznet512385.2505V100 SXM2-16GB
Quartznet768574.5752V100 SXM2-16GB
Jasper126.81V100 SXM2-16GB
Jasper128239.4127V100 SXM2-16GB
Jasper256416253V100 SXM2-16GB
Jasper512969.7500V100 SXM2-16GB

V100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet18.81V100 SXM2-16GB
Quartznet1622.416V100 SXM2-16GB
Quartznet128114.7127V100 SXM2-16GB
Jasper121.51V100 SXM2-16GB
Jasper1636.916V100 SXM2-16GB
Jasper64406.464V100 SXM2-16GB
Jasper512969.7500V100 SXM2-16GB

V100 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet132.9331V100 SXM2-16GB
Quartznet256461.44253V100 SXM2-16GB
Quartznet512784.73502V100 SXM2-16GB
Quartznet7681,121.6747V100 SXM2-16GB
Quartznet1,0241,551.5986V100 SXM2-16GB
Jasper148.3511V100 SXM2-16GB
Jasper256734.99252V100 SXM2-16GB
Jasper5121,423.3498V100 SXM2-16GB
Jasper7682,190.2730V100 SXM2-16GB

T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet133.1831NVIDIA T4
Quartznet64162.6364NVIDIA T4
Quartznet128263.6127NVIDIA T4
Quartznet256449.28253NVIDIA T4
Quartznet384732.75376NVIDIA T4
Jasper172.3771NVIDIA T4
Jasper64259.6464NVIDIA T4
Jasper128450.81127NVIDIA T4
Jasper2561,200.8249NVIDIA T4

T4 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet119.21NVIDIA T4
Quartznet1656.416NVIDIA T4
Quartznet64242.464NVIDIA T4
Jasper146.91NVIDIA T4
Jasper851.18NVIDIA T4
Jasper1684.416NVIDIA T4

T4 Offline Mode (3200 ms chunk)
Acoustic model# of streamsLatency (ms) (avg)Throughput (RTFX)GPU Version
Quartznet1157.621NVIDIA T4
Quartznet256906.17251NVIDIA T4
Quartznet5121,515.2495NVIDIA T4
Jasper196.2011NVIDIA T4
Jasper2561,758.4247NVIDIA T4

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Jarvis streaming client jarvis_streaming_asr_client, provided in the Jarvis client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Jarvis version: v1.0.0-b1 | Hardware: NVIDIA DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

Natural Language Processing

A100 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER13.19311A100 SXM4-40GB
NER25695.52549A100 SXM4-40GB
Q&A14.95201A100 SXM4-40GB
Q&A128279453A100 SXM4-40GB

V100 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER14.87204V100 SXM2-16GB
NER2561351,797V100 SXM2-16GB
Q&A17.47134V100 SXM2-16GB
Q&A128521244V100 SXM2-16GB

T4 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER19.31107NVIDIA T4
NER256255960NVIDIA T4
Q&A111.587NVIDIA T4
Q&A128571223NVIDIA T4

Named Entity Recogniton (NER): 128 seq len, BERT-base | Question Answering (QA): 384 seq len, BERT-large | NLP Throughput (seq/s) - Number of sequences processed per second | Performance of the Jarvis named entity recognition (NER) service (using a BERT-base model, sequence length of 128) and the Jarvis question answering (QA) service (using a BERT-large model, sequence length of 384) was measured in Jarvis. Batch size 1 latency and maximum throughput were measured. Jarvis version: v1.0.0-b1 | Hardware: NVIDIA DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

Text to Speech

A100 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.060.0420A100 SXM4-40GB
40.480.0337A100 SXM4-40GB
60.690.0342A100 SXM4-40GB
80.880.0346A100 SXM4-40GB
101.060.0349A100 SXM4-40GB

V100 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.080.0514V100 SXM2-16GB
40.770.0523V100 SXM2-16GB
61.110.0526V100 SXM2-16GB
81.40.0628V100 SXM2-16GB
101.740.0728V100 SXM2-16GB

T4 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.120.0711NVIDIA T4
41.020.0717NVIDIA T4
61.590.0718NVIDIA T4
82.130.0819NVIDIA T4
102.550.118NVIDIA T4

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Jarvis text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Jarvis version: v1.0.0-b1 | Hardware: NVIDIA DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

 

Last updated: April 21th, 2021