Review the latest GPU acceleration factors of popular HPC applications.

Please refer to Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide for instructions on how to reproduce these performance claims.


Deploying AI in real world applications, requires training the networks to convergence at a specified accuracy. This is the best methodology to test AI systems- where they are ready to be deployed in the field, as the networks can then deliver meaningful results (for example, correctly performing image recognition on video streams). Read our blog on convergence for more details. Training that does not converge is a measurement of hardware’s throughput capabilities on the specified AI network, but is not representative of real world applications.

NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC), allows data scientists to quickly get up and running with deep learning. NVIDIA A100 Tensor Core GPUs provides unprecedented acceleration at every scale, setting records in MLPerf™, the AI industry’s leading benchmark and a testament to our accelerated platform approach.

NVIDIA Performance on MLPerf 1.0 AI Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA A100 Performance on MLPerf 1.0 AI Benchmarks - Closed Division

FrameworkNetworkTime to Train (mins)MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.528.7775.90% classification8x A100DGX A1001.0-1059MixedImageNet2012A100-SXM4-80GB
4.9175.90% classification64x A100DGX A1001.0-1064MixedImageNet2012A100-SXM4-80GB
0.5875.90% classification1024x A100DGX A1001.0-1072MixedImageNet2012A100-SXM4-80GB
0.475.90% classification2480x A100DGX A1001.0-1076MixedImageNet2012A100-SXM4-80GB
SSD8.5223.0% mAP8x A100DGX A1001.0-1059MixedCOCO2017A100-SXM4-80GB
1.923.0% mAP64x A100DGX A1001.0-1064MixedCOCO2017A100-SXM4-80GB
0.4823.0% mAP1024x A100DGX A1001.0-1072MixedCOCO2017A100-SXM4-80GB
UNet-3D29.160.908 Mean DICE score8x A100DGX A1001.0-1059MixedKiTS19A100-SXM4-80GB
4.680.908 Mean DICE score104x A100DGX A1001.0-1066MixedKiTS19A100-SXM4-80GB
30.908 Mean DICE score800x A100DGX A1001.0-1071MixedKiTS19A100-SXM4-80GB
PyTorchBERT21.690.712 Mask-LM accuracy8x A100DGX A1001.0-1060MixedWikipedia 2020/01/01A100-SXM4-80GB
3.370.712 Mask-LM accuracy64x A100DGX A1001.0-1065MixedWikipedia 2020/01/01A100-SXM4-80GB
0.730.712 Mask-LM accuracy1024x A100DGX A1001.0-1073MixedWikipedia 2020/01/01A100-SXM4-80GB
0.320.712 Mask-LM accuracy4096x A100DGX A1001.0-1077MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN50.390.377 Box min AP and 0.339 Mask min AP8x A100DGX A1001.0-1060MixedCOCO2017A100-SXM4-80GB
15.750.377 Box min AP and 0.339 Mask min AP32x A100DGX A1001.0-1062MixedCOCO2017A100-SXM4-80GB
3.950.377 Box min AP and 0.339 Mask min AP272x A100DGX A1001.0-1070MixedCOCO2017A100-SXM4-80GB
RNN-T38.70.058 Word Error Rate8x A100DGX A1001.0-1060MixedLibriSpeechA100-SXM4-80GB
4.410.058 Word Error Rate128x A100DGX A1001.0-1068MixedLibriSpeechA100-SXM4-80GB
2.750.058 Word Error Rate1536x A100DGX A1001.0-1074MixedLibriSpeechA100-SXM4-80GB
TensorFlowMiniGo269.5450% win rate vs. checkpoint8x A100DGX A1001.0-1061MixedGoA100-SXM4-80GB
29.3250% win rate vs. checkpoint256x A100DGX A1001.0-1069MixedGoA100-SXM4-80GB
15.5350% win rate vs. checkpoint1792x A100DGX A1001.0-1075MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM1.960.8025 AUC8x A100DGX A1001.0-1058MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
1.050.8025 AUC64x A100DGX A1001.0-1063MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.990.8025 AUC112x A100DGX A1001.0-1067MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

Converged Training Performance of NVIDIA A100, A40, A30, A10, V100 and T4

Benchmarks are reproducible by following links to NGC scripts

A100 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.6.0ResNet-50 v1.54075.9 Top 1 Accuracy22,008 images/sec8x A100DGX A10020.06-py3Mixed408ImageNet2012A100-SXM4-40GB
PyTorch1.8.0a0Mask R-CNN176.34 AP Segm167 images/sec8x A100DGX A10021.12-py3TF328COCO 2014A100-SXM-80GB
1.6.0a0SSD v1.143.25 mAP3,092 images/sec8x A100DGX A10020.06-py3Mixed128COCO 2017A100-SXM-80GB
1.9.0a0Tacotron2100.56 Training Loss308,404 total output mels/sec8x A100DGX A10021.05-py3TF32128LJSpeech 1.1A100-SXM-80GB
1.9.0a0WaveGlow295-5.86 Training Loss1,453,539 output samples/sec8x A100DGX A10021.05-py3Mixed10LJSpeech 1.1A100-SXM-80GB
1.6.0a0Jasper3,6003.53 dev-clean WER603 sequences/sec8x A100DGX A10020.06-py3Mixed64LibriSpeechA100 SXM4-40GB
1.6.0a0Transformer16727.76 BLEU Score582,721 words/sec8x A100DGX A10020.06-py3Mixed10240wmt14-en-deA100 SXM4-40GB
1.6.0a0FastPitch216.18 Training Loss1,040,206 frames/sec8x A100DGX A10020.06-py3Mixed32LJSpeech 1.1A100 SXM4-40GB
1.9.0a0GNMT V22024.51 BLEU Score913,199 total tokens/sec8x A100DGX A10021.05-py3Mixed128wmt16-en-deA100-SXM-80GB
1.9.0a0NCF0.37.96 Hit Rate at 10154,525,495 samples/sec8x A100DGX A10021.05-py3Mixed131072MovieLens 20MA100-SXM-80GB
1.9.0a0BERT-LARGE391.31 F1926 sequences/sec8x A100DGX A10021.05-py3Mixed32SQuaD v1.1A100-SXM-80GB
1.9.0a0Transformer-XL Large40514.03 Perplexity203,314 total tokens/sec8x A100DGX A10021.05-py3Mixed16WikiText-103A100-SXM-80GB
1.9.0a0Transformer-XL Base20616.94 Perplexity637,779 total tokens/sec8x A100DGX A10021.05-py3Mixed128WikiText-103A100-SXM-80GB
1.6.0a0BERT-Large Pre-Training P12,379-3,231 sequences/sec8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
1.6.0a0BERT-Large Pre-Training P21,3771.34 Final Loss630 sequences/sec8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
1.6.0a0BERT-Large Pre-Training E2E3,7561.34 Final Loss-8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
Tensorflow1.15.5ResNet-50 v1.59577.01 Top120,478 images/sec8x A100DGX A10021.05-py3Mixed256ImageNet2012A100-SXM-80GB
1.15.5ResNext10119279.18 Top110,129 images/sec8x A100DGX A10021.04-py4Mixed256Imagenet2012A100-SXM-80GB
1.15.5SE-ResNext10122279.48 Top18,743 images/sec8x A100DGX A10021.05-py3Mixed256Imagenet2012A100-SXM-80GB
1.15.5U-Net Industrial1.99 IoU Threshold1,027 images/sec8x A100DGX A10021.03-py3Mixed2DAGM2007A100-SXM-80GB
1.15.5U-Net Medical6.9 Dice Score952 images/sec8x A100DGX A10021.03-py3Mixed8EM segmentation challengeA100-SXM-80GB
1.15.5VAE-CF1.43 NDCG@1001,555,512 users processed/sec8x A100DGX A10021.05-py3TF323072MovieLens 20MA100-SXM-80GB
1.15.4Wide and Deep107.68 MAP at 121,111,976 samples/sec8x A100DGX A10020.10-py3Mixed16384Kaggle Outbrain Click PredictionA100 SXM4-40GB
1.15.5BERT-LARGE1191.18 F1841 sequences/sec8x A100DGX A10021.05-py3Mixed24SQuaD v1.1A100-SXM-80GB
2.4.0Electra Fine Tuning392.71 F12,457 sequences/sec8x A100DGX A10021.05-py3Mixed32SQuaD v1.1A100-SXM-80GB
2.2.0EfficientNet-B44,23182.81 Top 12,535 images/sec8x A100DGX A10020.08-py3Mixed160ImageNet2012A100-SXM-80GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
BERT-Large Pre-Training Sequence Length for Phase 1 = 128 and Phase 2 = 512 | Batch Size for Phase 1 = 65,536 and Phase 2 = 32,768
EfficientNet-B4: Mixup = 0.2 | Auto-Augmentation | cuDNN Version = 8.0.5.39 | NCCL Version = 2.7.8

A40 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.9.0a0NCF1.96 Hit Rate at 1059,667,265 samples/sec8x A40GIGABYTE G482-Z52-0021.05-py3Mixed131072MovieLens 20MA40
1.9.0a0BERT-LARGE791.18 F1429 sequences/sec8x A40GIGABYTE G482-Z52-0021.05-py3Mixed32SQuaD v1.1A40

Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

A30 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.518277.34 Top110,739 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed192ImageNet2012A30
PyTorch1.9.0a0Tacotron2215.54 Training Loss144,326 total output mels/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed104LJSpeech 1.1A30
1.9.0a0WaveGlow533-5.82 Training Loss794,511 output samples/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed10LJSpeech 1.1A30
1.9.0a0Transformer1,10827.58 BLEU Score87,584 words/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed2560wmt14-en-deA30
1.9.0a0GNMT V28124.65 BLEU Score219,582 total tokens/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3TF32128wmt16-en-deA30
1.9.0a0NCF1.96 Hit Rate at 1056,399,904 samples/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed131072MovieLens 20MA30
1.9.0a0BERT-LARGE1091.2 F1294 sequences/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed10SQuaD v1.1A30
1.9.0a0Transformer-XL Base15122.16 Perplexity219,994 total tokens/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed32WikiText-103A30
Tensorflow1.15.5ResNet-50 v1.519876.78 Top19,798 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed256ImageNet2012A30
1.15.5U-Net Industrial1.99 IoU Threshold 0.95580 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed2DAGM2007A30
1.15.5U-Net Medical9.9 DICE Score461 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed8EM segmentation challengeA30
1.15.5VAE-CF1.43 NDCG@100861,797 users processed/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3TF323072MovieLens 20MA30
1.15.5SE-ResNext10157379.83 Top13,399 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed96Imagenet2012A30
2.4.0Electra Fine-Tuning692.65 F1904 sequences/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed16SQuaD v1.1A30

Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

A10 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.524277.25 Top18,117 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed192ImageNet2012A10
PyTorch1.9.0a0SE-ResNeXt10199680.24 Top11,953 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed112Imagenet2012A10
1.9.0a0Tacotron2204.5 Training Loss151,946 total output mels/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed104LJSpeech 1.1A10
1.9.0a0WaveGlow637-5.84 Training Loss664,022 output samples/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed10LJSpeech 1.1A10
1.9.0a0Transformer1,36527.8 BLEU Score70,844 words/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed2560wmt14-en-deA10
1.9.0a0FastPitch177.25 Training Loss467,464 frames/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed32LJSpeech 1.1A10
1.9.0a0GNMT V26124.49 BLEU Score292,052 total tokens/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed128wmt16-en-deA10
1.9.0a0NCF1.96 Hit Rate at 1047,605,092 samples/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed131072MovieLens 20MA10
1.9.0a0BERT-LARGE1391.3 F1236 sequences/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed10SQuaD v1.1A10
1.9.0a0Transformer-XL Base17622.16 Perplexity187,731 total tokens/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed32WikiText-103A10
Tensorflow1.15.5ResNet-50 v1.526676.74 Top17,283 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed256ImageNet2012A10
1.15.5U-Net Industrial1.99 IoU Threshold 0.95550 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed2DAGM2007A10
1.15.5U-Net Medical14.9 DICE Score324 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed8EM segmentation challengeA10
1.15.5VAE-CF1.43 NDCG@100664,902 users processed/sec8x A10GIGABYTE G482-Z52-0021.05-py3TF323072MovieLens 20MA10
1.15.5SE-ResNext10186679.65 Top12,240 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed96Imagenet2012A10
2.4.0Electra Fine-Tuning692.62 F1745 sequences/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed16SQuaD v1.1A10

Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

V100 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.516977.37 Top111,740 images/sec8x V100DGX-221.05-py3Mixed256ImageNet2012V100-SXM3-32GB
PyTorch1.8.0a0Mask R-CNN269.34 AP Segm109 images/sec8x V100DGX-220.12-py3Mixed8COCO 2014V100-SXM3-32GB
1.9.0a0Tacotron2185.5 Training Loss165,599 total output mels/sec8x V100DGX-221.05-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.9.0a0WaveGlow458-5.65 Training Loss922,820 output samples/sec8x V100DGX-221.05-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.6.0a0Jasper6,3003.49 dev-clean WER312 sequences/sec8x V100DGX-220.06-py3Mixed64LibriSpeechV100 SXM2-32GB
1.9.0a0Transformer47027.6 BLEU Score210,671 words/sec8x V100DGX-221.05-py3Mixed5120wmt14-en-deV100-SXM3-32GB
1.6.0a0FastPitch354.18 Training Loss570,968 frames/sec8x V100DGX-120.06-py3Mixed32LJSpeech 1.1V100 SXM2-16GB
1.6.0a0GNMT V23924.38 BLEU Score447,832 total tokens/sec8x V100DGX-120.06-py3Mixed128wmt16-en-deV100 SXM2-16GB
1.9.0a0NCF1.96 Hit Rate at 1098,983,601 samples/sec8x V100DGX-221.05-py3Mixed131072MovieLens 20MV100-SXM3-32GB
1.9.0a0BERT-LARGE891.31 F1368 sequences/sec8x V100DGX-221.05-py3Mixed10SQuaD v1.1V100-SXM3-32GB
1.9.0a0Transformer-XL Base11522.05 Perplexity286,404 total tokens/sec8x V100DGX-221.05-py3Mixed32WikiText-103V100-SXM3-32GB
Tensorflow1.15.5ResNet-50 v1.518576.92 Top110,476 images/sec8x V100DGX-221.05-py3Mixed256ImageNet2012V100-SXM3-32GB
1.15.5ResNext10141679.36 Top14,669 images/sec8x V100DGX-221.04-py3Mixed128Imagenet2012V100-SXM3-32GB
1.15.5SE-ResNext10150479.81 Top13,867 images/sec8x V100DGX-221.05-py3Mixed96Imagenet2012V100-SXM3-32GB
1.15.5U-Net Industrial1.99 IoU Threshold 0.95668 images/sec8x V100DGX-221.05-py3Mixed2DAGM2007V100-SXM3-32GB
2.4.0U-Net Medical12.84 DICE Score473 images/sec8x V100DGX-221.05-py3Mixed8EM segmentation challengeV100-SXM3-32GB
1.15.5VAE-CF1.43 NDCG@100913,611 users processed/sec8x V100DGX-221.05-py3Mixed3072MovieLens 20MV100-SXM3-32GB
1.15.4Wide and Deep185.68 MAP at 12643,334 samples/sec8x V100DGX-120.10-py3Mixed16384Kaggle Outbrain Click PredictionV100 SXM2-16GB
1.15.5BERT-LARGE1891.39 F1325 sequences/sec8x V100DGX-221.05-py3Mixed10SQuaD v1.1V100-SXM3-32GB
2.2.0Electra Fine-Tuning692.72 F11,051 sequences/sec8x V100DGX-120.07-py3Mixed16SQuaD v1.1V100 SXM2-16GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

T4 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.549277.3 Top13,967 images/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed192ImageNet2012NVIDIA T4
PyTorch1.7.0a0ResNeXt1011,73878.75 Top11,124 images/sec8x T4Supermicro SYS-4029GP-TRT20.10-py3Mixed128Imagenet2012NVIDIA T4
1.9.0a0SE-ResNeXt1011,77079.94 Top11,102 images/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed112Imagenet2012NVIDIA T4
1.9.0a0Tacotron2245.52 Training Loss126,056 total output mels/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed104LJSpeech 1.1NVIDIA T4
1.9.0a0WaveGlow1,016-5.87 Training Loss411,083 output samples/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed10LJSpeech 1.1NVIDIA T4
1.9.0a0Transformer2,28827.65 BLEU Score42,030 words/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed2560wmt14-en-deNVIDIA T4
1.7.0a0FastPitch319.21 Training Loss281,406 frames/sec8x T4Supermicro SYS-4029GP-TRT20.10-py3Mixed32LJSpeech 1.1NVIDIA T4
1.9.0a0GNMT V210324.45 BLEU Score168,938 total tokens/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed128wmt16-en-deNVIDIA T4
1.9.0a0NCF2.96 Hit Rate at 1028,185,539 samples/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed131072MovieLens 20MNVIDIA T4
1.9.0a0BERT-LARGE2291.34 F1136 sequences/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed10SQuaD v1.1NVIDIA T4
1.9.0a0Transformer-XL Base32022.12 Perplexity102,981 total tokens/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed32WikiText-103NVIDIA T4
Tensorflow1.15.5ResNet-50 v1.556376.83 Top13,416 images/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed256ImageNet2012NVIDIA T4
1.15.5U-Net Industrial2.99 IoU Threshold 0.95312 images/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed2DAGM2007NVIDIA T4
1.15.5U-Net Medical31.89 DICE Score158 images/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5VAE-CF2.43 NDCG@100389,406 users processed/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed3072MovieLens 20MNVIDIA T4
1.15.4SSD112.28 mAP549 images/sec8x T4Supermicro SYS-4029GP-TRT20.12-py3Mixed32COCO 2017NVIDIA T4
1.15.5Mask R-CNN492.34 AP Segm53 samples/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed4COCO 2014NVIDIA T4
1.15.5ResNext1011,22479.38 Top11,575 images/sec8x T4Supermicro SYS-4029GP-TRT21.02-py3Mixed128Imagenet2012NVIDIA T4
1.15.5SE-ResNext1011,55279.53 Top11,243 images/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed96Imagenet2012NVIDIA T4
2.4.0Electra Fine-Tuning992.72 F1412 sequences/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed16SQuaD v1.1NVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384


Deploying AI in real world applications, requires training the networks to convergence at a specified accuracy. This is the best methodology to test AI systems, and is typically done on multi-accelerator systems (see the ‘Training-Convergence’ tab or read our blog on convergence for more details) to shorten training-to-convergence times, especially for recurrent monthly container builds.

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.

NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit NVIDIA GPU Cloud (NGC) to pull containers and quickly get up and running with deep learning.

Single GPU Training Performance of NVIDIA A100, A40, A30, A10, V100 and T4

Benchmarks are reproducible by following links to NGC scripts

A100 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.52,751 images/sec1x A100--Mixed408ImageNet2012A100-SXM-80GB
PyTorch1.9.0a0Mask R-CNN29 images/sec1x A100DGX A10021.05-py3TF328COCO 2014A100-SXM-80GB
1.9.0a0SSD v1.1445 images/sec1x A100DGX A10021.05-py3Mixed128COCO 2017A100-SXM-80GB
1.9.0a0Tacotron238,901 total output mels/sec1x A100DGX A10021.05-py3TF32128LJSpeech 1.1A100-SXM-80GB
1.9.0a0WaveGlow202,226 output samples/sec1x A100DGX A10021.05-py3Mixed10LJSpeech 1.1A100-SXM-80GB
1.9.0a0Jasper83 sequences/sec1x A100DGX A10021.05-py3Mixed64LibriSpeechA100-SXM-80GB
1.6.0a0Transformer82,618 words/sec1x A100DGX A10020.06-py3Mixed10240wmt14-en-deA100 SXM4-40GB
1.9.0a0FastPitch180,170 frames/sec1x A100DGX A10021.05-py3Mixed128LJSpeech 1.1A100-SXM-80GB
1.9.0a0GNMT V2157,886 total tokens/sec1x A100DGX A10021.05-py3Mixed128wmt16-en-deA100-SXM-80GB
1.9.0a0NCF37,371,854 samples/sec1x A100DGX A10021.05-py3Mixed1048576MovieLens 20MA100-SXM-80GB
1.9.0a0BERT-LARGE122 sequences/sec1x A100DGX A10021.05-py3Mixed32SQuaD v1.1A100-SXM-80GB
1.9.0a0Transformer-XL Large28,503 total tokens/sec1x A100DGX A10021.05-py3Mixed16WikiText-103A100-SXM-80GB
1.9.0a0Transformer-XL Base83,345 total tokens/sec1x A100DGX A10021.05-py3Mixed128WikiText-103A100-SXM-80GB
Tensorflow1.15.5ResNet-50 v1.52,662 images/sec1x A100DGX A10021.05-py3Mixed256ImageNet2012A100-SXM-80GB
1.15.5ResNext1011,312 images/sec1x A100DGX A10021.05-py3Mixed256Imagenet2012A100-SXM-80GB
1.15.5SE-ResNext1011,135 images/sec1x A100DGX A10021.05-py3Mixed256Imagenet2012A100-SXM-80GB
1.15.5U-Net Industrial365 images/sec1x A100DGX A10021.05-py3Mixed16DAGM2007A100-SXM4-40GB
2.4.0U-Net Medical149 images/sec1x A100DGX A10021.05-py3Mixed8EM segmentation challengeA100-SXM-80GB
1.15.5VAE-CF393,529 users processed/sec1x A100DGX A10021.05-py3Mixed24576MovieLens 20MA100-SXM-80GB
1.15.5Wide and Deep321,808 samples/sec1x A100DGX A10021.05-py3TF32131072Kaggle Outbrain Click PredictionA100-SXM4-40GB
1.15.5BERT-LARGE117 sequences/sec1x A100DGX A10021.05-py3Mixed24SQuaD v1.1A100-SXM-80GB
2.4.0Electra Fine Tuning348 sequences/sec1x A100DGX A10021.05-py3Mixed32SQuaD v1.1A100-SXM-80GB
-EfficientNet-B4332 images/sec1x A100DGX A100-Mixed160ImageNet2012A100-SXM-80GB
1.15.5NCF40,189,461 samples/sec1x A100DGX A10021.05-py3Mixed1048576MovieLens 20MA100-SXM4-40GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
EfficientNet-B4: Basic Augmentation | cuDNN Version = 8.0.5.32 | NCCL Version = 2.7.8 | Installation Source = NGC

A40 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.9.0a0NCF20,386,212 samples/sec1x A40GIGABYTE G482-Z52-0021.05-py3Mixed1048576MovieLens 20MA40
1.9.0a0BERT-LARGE62 sequences/sec1x A40GIGABYTE G482-Z52-0021.05-py3Mixed32SQuaD v1.1A40
Tensorflow1.15.5BERT-LARGE55 sequences/sec1x A40GIGABYTE G482-Z52-0021.05-py3Mixed24SQuaD v1.1A40

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A30 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.51,254 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed192ImageNet2012A30
PyTorch1.9.0a0SSD v1.1226 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed64COCO 2017A30
1.9.0a0Tacotron218,541 total output mels/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed104LJSpeech 1.1A30
1.9.0a0WaveGlow119,858 output samples/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed10LJSpeech 1.1A30
1.9.0a0Transformer24,662 words/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed2560wmt14-en-deA30
1.9.0a0FastPitch100,243 frames/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed64LJSpeech 1.1A30
1.9.0a0NCF19,634,381 samples/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed1048576MovieLens 20MA30
1.9.0a0GNMT V255,112 total tokens/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3TF32128wmt16-en-deA30
1.9.0a0Transformer-XL Base40,981 total tokens/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed32WikiText-103A30
1.9.0a0ResNeXt101544 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed112Imagenet2012A30
1.9.0a0Jasper34 sequences/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed16LibriSpeechA30
1.9.0a0Transformer-XL Large12,617 total tokens/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed4WikiText-103A30
1.9.0a0BERT-LARGE50 sequences/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed10SQuaD v1.1A30
Tensorflow1.15.5ResNet-50 v1.51,339 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed256ImageNet2012A30
1.15.5ResNext101595 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed128Imagenet2012A30
1.15.5SE-ResNext101495 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed96Imagenet2012A30
1.15.5U-Net Industrial109 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed16DAGM2007A30
2.4.0U-Net Medical71 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed8EM segmentation challengeA30
1.15.5VAE-CF198,703 users processed/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed24576MovieLens 20MA30
1.15.5Wide and Deep232,356 samples/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed131072Kaggle Outbrain Click PredictionA30
2.4.0Mask R-CNN21 samples/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed4COCO 2014A30
2.4.0Electra Fine Tuning143 sequences/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed16SQuaD v1.1A30
1.15.5SSD201 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed32COCO 2017A30

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A10 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.51,019 images/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed192ImageNet2012A10
PyTorch1.9.0a0SSD v1.1175 images/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed64COCO 2017A10
1.9.0a0Tacotron219,741 total output mels/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed104LJSpeech 1.1A10
1.9.0a0WaveGlow100,432 output samples/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed10LJSpeech 1.1A10
1.9.0a0Transformer22,248 words/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed2560wmt14-en-deA10
1.9.0a0FastPitch92,906 frames/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed64LJSpeech 1.1A10
1.9.0a0Transformer-XL Base34,753 total tokens/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed32WikiText-103A10
1.9.0a0GNMT V266,783 total tokens/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed128wmt16-en-deA10
1.9.0a0ResNeXt101421 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed128Imagenet2012A10
1.9.0a0SE-ResNeXt101327 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed128Imagenet2012A10
1.9.0a0NCF16,862,317 samples/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed1048576MovieLens 20MA10
1.8.0a0Jasper29 sequences/sec1x A10GIGABYTE G482-Z52-0021.02-py3Mixed32LibriSpeechA10
1.9.0a0Transformer-XL Large10,699 total tokens/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed4WikiText-103A10
1.9.0a0BERT-LARGE41 sequences/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed10SQuaD v1.1A10
Tensorflow1.15.5ResNet-50 v1.5996 images/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed256ImageNet2012A10
1.15.5ResNext101460 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed128Imagenet2012A10
1.15.5SE-ResNext101342 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed96Imagenet2012A10
1.15.5U-Net Industrial96 images/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed16DAGM2007A10
2.4.0U-Net Medical49 images/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed8EM segmentation challengeA10
1.15.5VAE-CF174,947 users processed/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed24576MovieLens 20MA10
1.15.4Wide and Deep249,905 samples/sec1x A10Supermicro SYS-1029GQ-TRT20.11-py3Mixed131072Kaggle Outbrain Click PredictionA10
2.4.0Electra Fine Tuning128 sequences/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed16SQuaD v1.1A10
2.4.0Mask R-CNN18 samples/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed4COCO 2014A10
1.15.5SSD181 images/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed32COCO 2017A10

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

V100 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.51,473 images/sec1x V100DGX-221.05-py3Mixed256ImageNet2012V100-SXM3-32GB
PyTorch1.9.0a0ResNeXt101554 images/sec1x V100DGX-221.05-py3Mixed112Imagenet2012V100-SXM3-32GB
1.9.0a0SSD v1.1233 images/sec1x V100DGX-221.05-py3Mixed64COCO 2017V100-SXM3-32GB
1.9.0a0Tacotron224,540 total output mels/sec1x V100DGX-221.05-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.9.0a0WaveGlow136,702 output samples/sec1x V100DGX-221.05-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.9.0a0Jasper44 sequences/sec1x V100DGX-221.05-py3Mixed64LibriSpeechV100-SXM3-32GB
1.9.0a0Transformer32,218 words/sec1x V100DGX-221.05-py3Mixed5120wmt14-en-deV100-SXM3-32GB
1.9.0a0FastPitch123,294 frames/sec1x V100DGX-221.05-py3Mixed64LJSpeech 1.1V100-SXM3-32GB
1.7.0a0GNMT V283,200 total tokens/sec1x V100DGX-220.09-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.9.0a0NCF22,193,821 samples/sec1x V100DGX-221.05-py3Mixed1048576MovieLens 20MV100-SXM3-32GB
1.9.0a0BERT-LARGE53 sequences/sec1x V100DGX-221.05-py3Mixed10SQuaD v1.1V100-SXM3-32GB
1.9.0a0Transformer-XL Base44,072 total tokens/sec1x V100DGX-221.05-py3Mixed32WikiText-103V100-SXM3-32GB
1.9.0a0Transformer-XL Large15,360 total tokens/sec1x V100DGX-221.05-py3Mixed8WikiText-103V100-SXM3-32GB
Tensorflow1.15.5ResNet-50 v1.51,393 images/sec1x V100DGX-221.05-py3Mixed256ImageNet2012V100-SXM3-32GB
1.15.5ResNext101632 images/sec1x V100DGX-221.05-py3Mixed128Imagenet2012V100-SXM3-32GB
1.15.5SE-ResNext101545 images/sec1x V100DGX-221.05-py3Mixed96Imagenet2012V100-SXM3-32GB
1.15.2U-Net Industrial169 images/sec1x V100DGX-120.06-py3Mixed16DAGM2007V100 SXM2-16GB
1.15.5U-Net Medical68 images/sec1x V100DGX-221.05-py3Mixed8EM segmentation challengeV100-SXM3-32GB
1.15.5VAE-CF222,573 users processed/sec1x V100DGX-221.05-py3Mixed24576MovieLens 20MV100-SXM3-32GB
1.15.5Wide and Deep301,631 samples/sec1x V100DGX-221.05-py3Mixed131072Kaggle Outbrain Click PredictionV100-SXM3-32GB
1.15.5BERT-LARGE48 sequences/sec1x V100DGX-221.05-py3Mixed10SQuaD v1.1V100-SXM3-32GB
2.4.0Electra Fine-Tuning192 sequences/sec1x V100DGX-221.05-py3Mixed32SQuaD v1.1V100-SXM3-32GB
2.4.0Mask R-CNN22 samples/sec1x V100DGX-221.05-py3Mixed4COCO 2014V100-SXM3-32GB
1.15.5SSD222 images/sec1x V100DGX-221.05-py3Mixed32COCO 2017V100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

T4 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.5514 images/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed64ImageNet2012NVIDIA T4
PyTorch1.8.0ResNeXt101209 images/sec1x T4Supermicro SYS-1029GQ-TRT21.02-py3Mixed128Imagenet2012NVIDIA T4
1.9.0a0Tacotron216,884 total output mels/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed104LJSpeech 1.1NVIDIA T4
1.9.0a0WaveGlow55,618 output samples/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed10LJSpeech 1.1NVIDIA T4
1.9.0a0Transformer10,512 words/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed2560wmt14-en-deNVIDIA T4
1.9.0a0FastPitch41,078 frames/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed64LJSpeech 1.1NVIDIA T4
1.9.0a0GNMT V231,038 total tokens/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed128wmt16-en-deNVIDIA T4
1.9.0a0NCF8,104,090 samples/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed1048576MovieLens 20MNVIDIA T4
1.9.0a0BERT-LARGE19 sequences/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed10SQuaD v1.1NVIDIA T4
1.9.0a0Transformer-XL Base17,119 total tokens/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed32WikiText-103NVIDIA T4
1.8.0a0Jasper14 sequences/sec1x T4Supermicro SYS-4029GP-TRT20.12-py3Mixed32LibriSpeechNVIDIA T4
1.8.0a0SE-ResNeXt101163 images/sec1x T4Supermicro SYS-1029GQ-TRT21.02-py3Mixed128Imagenet2012NVIDIA T4
1.9.0a0Transformer-XL Large5,231 total tokens/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed4WikiText-103NVIDIA T4
Tensorflow1.15.5ResNet-50 v1.5452 images/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed256ImageNet2012NVIDIA T4
1.15.5U-Net Industrial47 images/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed16DAGM2007NVIDIA T4
1.15.5U-Net Medical22 images/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5VAE-CF83,310 users processed/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed24576MovieLens 20MNVIDIA T4
1.15.5SSD97 images/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed32COCO 2017NVIDIA T4
2.4.0Mask R-CNN9 samples/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed4COCO 2014NVIDIA T4
1.15.5Wide and Deep203,867 samples/sec1x T4Supermicro SYS-1029GQ-TRT21.03-py3Mixed131072Kaggle Outbrain Click PredictionNVIDIA T4
1.15.5SE-ResNext101170 images/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed96Imagenet2012NVIDIA T4
1.15.5ResNext101207 images/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed128Imagenet2012NVIDIA T4
2.4.0Electra Fine-Tuning62 sequences/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed16SQuaD v1.1NVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

Real-world AI inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry leading solution enables customers to quickly deploy AI models into real-world production with the highest performance from data centers to the edge.

NVIDIA landed top performance spots on all MLPerf™ Inference 1.0 tests, the AI-industry’s leading benchmark. NVIDIA TensorRT™ running on NVIDIA Tensor Core GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA® GPU Cloud (NGC) to download any of these containers and immediately race into production. The inference whitepaper provides an overview of inference platforms.

The Triton Inference Server is an open source inference serving software which maximizes performance and simplifies the deployment of AI models at scale in production. Triton lets teams deploy trained AI models from multiple model frameworks (TensorFlow, TensorRT, PyTorch, ONNX Runtime, OpenVino, or custom backends). They can deploy from local storage, Google Cloud Platform, Azure Storage, or Amazon S3 on any GPU or CPU based infrastructure (in cloud, data center, or embedded devices). Triton is open source on GitHub and available as a docker container on NGC.


MLPerf Inference v1.0 Performance Benchmarks

Offline Scenario - Closed Division

NetworkThroughputTarget AccuracyGPUServerDatasetGPU Version
ResNet-50 v1.5105,677 samples/sec76.46% Top18x A10Supermicro 4029GP-TRT-OTO-28ImageNetA10
141,518 samples/sec76.46% Top18x A30Gigabyte G482-Z54ImageNetA30
5,108 samples/sec76.46% Top11x1g.10gb A100DGX A100ImageNetA100 SXM-80GB
38,010 samples/sec76.46% Top11x A100DGX A100ImageNetA100 SXM-80GB
304,876 samples/sec76.46% Top18x A100DGX A100ImageNetA100 SXM-80GB
248,179 samples/sec76.46% Top18x A100Gigabyte G482-Z54ImageNetA100-PCIe-40GB
132,926 samples/sec76.46% Top14x A100DGX-Station-A100ImageNetA100 SXM-80GB
SSD ResNet-342,496 samples/sec0.2 mAP8x A10Supermicro 4029GP-TRT-OTO-28COCOA10
3,756 samples/sec0.2 mAP8x A30Gigabyte G482-Z54COCOA30
134 samples/sec0.2 mAP1x1g.10gb A100DGX A100COCOA100 SXM-80GB
989 samples/sec0.2 mAP1x A100DGX A100COCOA100 SXM-80GB
7,879 samples/sec0.2 mAP8x A100DGX A100COCOA100 SXM-80GB
6,586 samples/sec0.2 mAP8x A100Gigabyte G482-Z54COCOA100-PCIe-40GB
3,370 samples/sec0.2 mAP4x A100DGX-Station-A100COCOA100 SXM-80GB
3D-UNet172 samples/sec0.853 DICE mean8x A10Supermicro 4029GP-TRT-OTO-28BraTS 2019A10
237 samples/sec0.853 DICE mean8x A30Gigabyte G482-Z54BraTS 2019A30
7 samples/sec0.853 DICE mean1x1g.10gb A100DGX A100BraTS 2019A100 SXM-80GB
61 samples/sec0.853 DICE mean1x A100DGX A100BraTS 2019A100 SXM-80GB
480 samples/sec0.853 DICE mean8x A100DGX A100BraTS 2019A100 SXM-80GB
412 samples/sec0.853 DICE mean8x A100Gigabyte G482-Z54BraTS 2019A100-PCIe-40GB
214 samples/sec0.853 DICE mean4x A100DGX-Station-A100BraTS 2019A100 SXM-80GB
RNN-T36,116 samples/sec7.45% WER8x A10Supermicro 4029GP-TRT-OTO-28LibriSpeechA10
51,690 samples/sec7.45% WER8x A30Gigabyte G482-Z54LibriSpeechA30
1,553 samples/sec7.45% WER1x1g.10gb A100DGX A100LibriSpeechA100 SXM-80GB
14,008 samples/sec7.45% WER1x A100DGX A100LibriSpeechA100 SXM-80GB
105,677 samples/sec7.45% WER8x A100DGX A100LibriSpeechA100 SXM-80GB
90,853 samples/sec7.45% WER8x A100Gigabyte G482-Z54LibriSpeechA100-PCIe-40GB
48,886 samples/sec7.45% WER4x A100DGX-Station-A100LibriSpeechA100 SXM-80GB
BERT8,454 samples/sec90.07% f18x A10Supermicro 4029GP-TRT-OTO-28SQuAD v1.1A10
13,260 samples/sec90.07% f18x A30Gigabyte G482-Z54SQuAD v1.1A30
492 samples/sec90.07% f11x1g.10gb A100DGX A100SQuAD v1.1A100 SXM-80GB
3,602 samples/sec90.07% f11x A100DGX A100SQuAD v1.1A100 SXM-80GB
28,347 samples/sec90.07% f18x A100DGX A100SQuAD v1.1A100 SXM-80GB
22,847 samples/sec90.07% f18x A100Gigabyte G482-Z54SQuAD v1.1A100-PCIe-40GB
11,305 samples/sec90.07% f14x A100DGX-Station-A100SQuAD v1.1A100 SXM-80GB
DLRM772,378 samples/sec80.25% AUC8x A10Supermicro 4029GP-TRT-OTO-28Criteo 1TB Click LogsA10
1,067,510 samples/sec80.25% AUC8x A30Gigabyte G482-Z54Criteo 1TB Click LogsA30
36,473 samples/sec80.25% AUC1x1g.10gb A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
311,826 samples/sec80.25% AUC1x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
2,462,300 samples/sec80.25% AUC8x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
1,057,550 samples/sec80.25% AUC4x A100DGX-Station-A100Criteo 1TB Click LogsA100 SXM-80GB

Server Scenario - Closed Division

NetworkThroughputTarget AccuracyMLPerf Server Latency
Constraints (ms)
GPUServerDatasetGPU Version
ResNet-50 v1.587,984 queries/sec76.46% Top1158x A10Supermicro 4029GP-TRT-OTO-28ImageNetA10
115,987 queries/sec76.46% Top1158x A30Gigabyte G482-Z54ImageNetA30
3,602 queries/sec76.46% Top1151x1g.10gb A100DGX A100ImageNetA100 SXM-80GB
30,794 queries/sec76.46% Top1151x A100DGX A100ImageNetA100 SXM-80GB
259,994 queries/sec76.46% Top1158x A100DGX A100ImageNetA100 SXM-80GB
207,976 queries/sec76.46% Top1158x A100Gigabyte G482-Z54ImageNetA100-PCIe-40GB
106,988 queries/sec76.46% Top1154x A100DGX-Station-A100ImageNetA100 SXM-80GB
SSD ResNet-342,000 queries/sec0.2 mAP1008x A10Supermicro 4029GP-TRT-OTO-28COCOA10
3,575 queries/sec0.2 mAP1008x A30Gigabyte G482-Z54COCOA30
100 queries/sec0.2 mAP1001x1g.10gb A100DGX A100COCOA100 SXM-80GB
926 queries/sec0.2 mAP1001x A100DGX A100COCOA100 SXM-80GB
7,654 queries/sec0.2 mAP1008x A100DGX A100COCOA100 SXM-80GB
6,162 queries/sec0.2 mAP1008x A100Gigabyte G482-Z54COCOA100-PCIe-40GB
3,081 queries/sec0.2 mAP1004x A100DGX-Station-A100COCOA100 SXM-80GB
RNN-T22,597 queries/sec7.45% WER1,0008x A10Supermicro 4029GP-TRT-OTO-28LibriSpeechA10
36,991 queries/sec7.45% WER1,0008x A30Gigabyte G482-Z54LibriSpeechA30
1,303 queries/sec7.45% WER1,0001x1g.10gb A100DGX A100LibriSpeechA100 SXM-80GB
12,751 queries/sec7.45% WER1,0001x A100DGX A100LibriSpeechA100 SXM-80GB
103,986 queries/sec7.45% WER1,0008x A100DGX A100LibriSpeechA100 SXM-80GB
85,985 queries/sec7.45% WER1,0008x A100Gigabyte G482-Z54LibriSpeechA100-PCIe-40GB
43,389 queries/sec7.45% WER1,0004x A100DGX-Station-A100LibriSpeechA100 SXM-80GB
BERT7,204 queries/sec90.07% f11308x A10Supermicro 4029GP-TRT-OTO-28SQuAD v1.1A10
11,500 queries/sec90.07% f11308x A30Gigabyte G482-Z54SQuAD v1.1A30
381 queries/sec90.07% f11301x1g.10gb A100DGX A100SQuAD v1.1A100 SXM-80GB
3,202 queries/sec90.07% f11301x A100DGX A100SQuAD v1.1A100 SXM-80GB
25,792 queries/sec90.07% f11308x A100DGX A100SQuAD v1.1A100 SXM-80GB
20,792 queries/sec90.07% f11308x A100Gigabyte G482-Z54SQuAD v1.1A100-PCIe-40GB
10,203 queries/sec90.07% f11304x A100DGX-Station-A100SQuAD v1.1A100 SXM-80GB
DLRM680,147 queries/sec80.25% AUC308x A10Supermicro 4029GP-TRT-OTO-28Criteo 1TB Click LogsA10
750,204 queries/sec80.25% AUC308x A30Gigabyte G482-Z54Criteo 1TB Click LogsA30
35,991 queries/sec80.25% AUC301x1g.10gb A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
286,002 queries/sec80.25% AUC301x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
2,302,570 queries/sec80.25% AUC308x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
942,395 queries/sec80.25% AUC304x A100DGX-Station-A100Criteo 1TB Click LogsA100 SXM-80GB

Power Efficiency Offline Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerDatasetGPU Version
ResNet-50 v1.5213,599 samples/sec97.31 samples/sec/watt8x A100Gigabyte G482-Z54ImageNetA100 PCIe-40GB
270,706 samples/sec78.27 samples/sec/watt8x A100DGX A100ImageNetA100 SXM-80GB
124,529 samples/sec98.14 samples/sec/watt4x A100DGX-Station-A100ImageNetA100 SXM-80GB
SSD ResNet-345,824 samples/sec2.6 samples/sec/watt8x A100Gigabyte G482-Z54COCOA100 PCIe-40GB
6,875 samples/sec1.96 samples/sec/watt8x A100DGX A100COCOA100 SXM-80GB
3,110 samples/sec2.44 samples/sec/watt4x A100DGX-Station-A100COCOA100 SXM-80GB
3D-UNet372 samples/sec0.16 samples/sec/watt8x A100Gigabyte G482-Z54BraTS 2019A100 PCIe-40GB
433 samples/sec0.12 samples/sec/watt8x A100DGX A100BraTS 2019A100 SXM-80GB
202 samples/sec0.16 samples/sec/watt4x A100DGX-Station-A100BraTS 2019A100 SXM-80GB
RNN-T82,540 samples/sec36.23 samples/sec/watt8x A100Gigabyte G482-Z54LibriSpeechA100 PCIe-40GB
93,803 samples/sec26.39 samples/sec/watt8x A100DGX A100LibriSpeechA100 SXM-80GB
47,255 samples/sec36.16 samples/sec/watt4x A100DGX-Station-A100LibriSpeechA100 SXM-80GB
BERT17,697 samples/sec7.73 samples/sec/watt8x A100Gigabyte G482-Z54SQuAD v1.1A100 PCIe-40GB
23,406 samples/sec6.77 samples/sec/watt8x A100DGX A100SQuAD v1.1A100 SXM-80GB
9,865 samples/sec7.76 samples/sec/watt4x A100DGX-Station-A100SQuAD v1.1A100 SXM-80GB
DLRM1,577,960 samples/sec730.76 samples/sec/watt8x A100Gigabyte G482-Z54Criteo 1TB Click LogsA100 PCIe-40GB
2,115,950 samples/sec619.54 samples/sec/watt8x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
974,571 samples/sec762.64 samples/sec/watt4x A100DGX-Station-A100Criteo 1TB Click LogsA100 SXM-80GB

Power Efficiency Server Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerDatasetGPU Version
ResNet-50 v1.5184,984 queries/sec82.39 queries/sec/watt8x A100Gigabyte G482-Z54ImageNetA100 PCIe-40GB
239,991 queries/sec69.53 queries/sec/watt8x A100DGX A100ImageNetA100 SXM-80GB
106,988 queries/sec84.51 queries/sec/watt4x A100DGX-Station-A100ImageNetA100 SXM-80GB
SSD ResNet-345,702 queries/sec2.52 queries/sec/watt8x A100Gigabyte G482-Z54COCOA100 PCIe-40GB
6,301 queries/sec1.82 queries/sec/watt8x A100DGX A100COCOA100 SXM-80GB
3,081 queries/sec2.43 queries/sec/watt4x A100DGX-Station-A100COCOA100 SXM-80GB
RNN-T74,974 queries/sec32.25 queries/sec/watt8x A100Gigabyte G482-Z54LibriSpeechA100 PCIe-40GB
87,984 queries/sec24.78 queries/sec/watt8x A100DGX A100LibriSpeechA100 SXM-80GB
43,389 queries/sec33.03 queries/sec/watt4x A100DGX-Station-A100LibriSpeechA100 SXM-80GB
BERT17,499 queries/sec7.58 queries/sec/watt8x A100Gigabyte G482-Z54SQuAD v1.1A100 PCIe-40GB
21,492 queries/sec6.03 queries/sec/watt8x A100DGX A100SQuAD v1.1A100 SXM-80GB
10,203 queries/sec7.84 queries/sec/watt4x A100DGX-Station-A100SQuAD v1.1A100 SXM-80GB
DLRM2,001,940 queries/sec575.72 queries/sec/watt8x A100DGX A100Criteo 1TB Click LogsA100 SXM-80GB
890,334 queries/sec663.62 queries/sec/watt4x A100DGX-Station-A100Criteo 1TB Click LogsA100 SXM-80GB

MLPerf™ v1.0 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net, DLRM 99.9% of FP32 accuracy target: 1.0-25, 1.0-26, 1.0-29, 1.0-30, 1.0-32, 1.0-55, 1.0-57. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
A10 and A30 results are preview
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here


NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v1.0


NVIDIA landed top performance spots on all MLPerf™ Inference 1.0 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.

MLPerf™ v1.0 A100 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net, DLRM 99% of FP32 accuracy target: 1.0-30, 1.0-31. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.​

The chart compares the performance of Triton to the custom MLPerf™ serving harness across five different TensorRT networks on A100 SXM-80GB on bare metal. The results show that Triton is highly efficient and delivers nearly equal or identical performance to the highly optimized MLPerf™ harness.

 

NVIDIA Client Batch Size=1 Performance with Triton Inference Server

NetworkAcceleratorTraining FrameworkFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
ResNet-50 V1.5 InferenceA100-PCIE-40GBPyTorchTensorRTFP16216425661.024,197 inf/sec-20.07-py3
ResNet-50 V1.5 InferenceA100-SXM4-40GBPyTorchTensorRTTF32216425648.355,294 inf/sec-21.03-py3
ResNet-50 V1.5 InferenceNVIDIA T4PyTorchTensorRTFP161164256257.91992 inf/sec-20.07-py3
ResNet-50 V1.5 InferenceV100 SXM2-32GBPyTorchTensorRTFP324164384215.791,781 inf/sec-21.03-py3
BERT Large InferenceA100-PCIE-40GBTensorFlowTensorRTFP161181617.48915 inf/sec38420.09-py3
BERT Large InferenceA100-SXM4-40GBTensorFlowTensorRTINT82186456.341,136 inf/sec38420.09-py3
BERT Large InferenceNVIDIA T4TensorFlowTensorRTFP161181681.14197 inf/sec38420.09-py3
DLRM InferenceA100-PCIE-40GBPyTorchTorchscriptMixed2165,536242.529,521 inf/sec-21.05-py3
DLRM InferenceA100-SXM4-40GBPyTorchTorchscriptFP162165,536302.7111,076 inf/sec-21.03-py3
DLRM InferenceV100-SXM2-32GBPyTorchTorchscriptMixed2165,536304.247,068 inf/sec-21.05-py3


Inference Performance of NVIDIA A100, A40, A30, A10, V100 and T4

Benchmarks are reproducible by following links to NGC scripts

Inference Natural Langugage Processing

BERT Inference Throughput

DGX A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128

 

NVIDIA A100 BERT Inference Benchmarks

NetworkNetwork
Type
Batch
Size
ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
BERT-Large with SparsityAttention946,188 sequences/sec--1x A100DGX A100-INT8SQuaD v1.1-A100 SXM4-40GB

A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC@2.25GHz w/ 1x NVIDIA A40 | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-SW-QZ-001: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC@2.25GHz w/ 1x NVIDIA A10 | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC@2.25GHz w/ 1x NVIDIA A40 | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-SW-QZ-001: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC@2.25GHz w/ 1x NVIDIA A10 | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: Mixed | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 7.2 | Batch Size = 128 | 21.05-py3 | Precision: INT8 | Dataset: Synthetic

 

A100 1/7 MIG Inference Performance

NetworkBatch Size1/7 MIG ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5011,445 images/sec0.691x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
22,250 images/sec0.891x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
83,508 images/sec2.281x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
284,035 images/sec6.941x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
1284,492 images/sec28.51x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
ResNet-50v1.511,406 images/sec0.711x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
22,170 images/sec0.921x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
83,375 images/sec2.371x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
273,956 images/sec6.831x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
1284,333 images/sec29.541x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
BERT-BASE1800 sequences/sec1.251x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
21,192 sequences/sec1.681x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
4474 sequences/sec8.551x A100DGX A10020.11-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
81,706 sequences/sec4.691x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
1282,175 sequences/sec58.851x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
BERT-LARGE1268 sequences/sec3.731x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
2386 sequences/sec5.181x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
5517 sequences/sec9.841x A100DGX A10020.11-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
8558 sequences/sec14.331x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
128677 sequences/sec189.21x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 7 MIG Inference Performance

NetworkBatch Size7 MIG ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50110,017 images/sec0.71x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
215,345 images/sec0.911x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
824,035 images/sec2.331x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
2728,369 images/sec6.671x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
12831,301 images/sec28.631x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
ResNet-50v1.519,875 images/sec0.711x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
215,159 images/sec0.941x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
823,306 images/sec2.41x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
2727,498 images/sec6.871x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
12830,179 images/sec29.731x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
BERT-BASE15,595 sequences/sec1.271x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
28,298 sequences/sec1.691x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
43,284 sequences/sec8.551x A100DGX A10020.11-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
811,878 sequences/sec4.731x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
12814,393 sequences/sec62.371x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
BERT-LARGE11,880 sequences/sec3.761x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
22,694 sequences/sec5.211x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
53,575 sequences/sec9.841x A100DGX A10020.11-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
83,838 sequences/sec14.631x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
1284,467 sequences/sec200.761x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100 SXM4-40GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 Full Chip Inference Performance

NetworkBatch SizeFull Chip ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5023,995 images/sec0.51x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100-SXM4-40GB
811,418 images/sec0.71x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100-SXM4-40GB
12829,109 images/sec4.41x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
21131,257 images/sec6.751x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
ResNet-50v1.524,020 images/sec0.51x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
811,095 images/sec0.721x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
12828,222 images/sec4.541x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
20830,039 images/sec6.921x A100DGX A10021.05-py3INT8SyntheticTensorRT 7.2A100-SXM-80GB
ResNext101327,674 samples/sec4.171x A100--INT8SyntheticTensorRT 7.2A100-SXM4-40GB
EfficientNet-B012822,346 images/sec5.731x A100--INT8SyntheticTensorRT 7.2A100-SXM4-40GB
BERT-BASE22,599 sequences/sec0.771x A100DGX A10021.05-py3INT8Sample TextTensorRT 7.2A100-SXM4-40GB
86,889 sequences/sec1.161x A100DGX A10021.05-py3INT8Sample TextTensorRT 7.2A100-SXM4-40GB
12813,661 sequences/sec9.371x A100DGX A10021.05-py3INT8Sample TextTensorRT 7.2A100-SXM4-40GB
25614,490 sequences/sec17.671x A100DGX A100-INT8Sample TextTensorRT 7.2A100-SXM4-40GB
BERT-LARGE21,093 sequences/sec1.831x A100DGX A10021.05-py3INT8Sample TextTensorRT 7.2A100-SXM4-40GB
82,311 sequences/sec3.461x A100DGX A10021.05-py3INT8Sample TextTensorRT 7.2A100-SXM-80GB
1284,515 sequences/sec28.351x A100DGX A10021.05-py3INT8Sample TextTensorRT 7.2A100-SXM-80GB
2564,679 sequences/sec54.711x A100--INT8Sample TextTensorRT 7.2A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
For BS=1 inference refer to the Triton Inference Server section

 

A40 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5024,523 images/sec24 images/sec/watt0.441x A40GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A40
89,279 images/sec40 images/sec/watt0.861x A40GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A40
10916,738 images/sec- images/sec/watt6.511x A40GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A40
12816,286 images/sec54 images/sec/watt7.861x A40GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A40
ResNet-50v1.524,451 images/sec24 images/sec/watt0.451x A40GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A40
88,903 images/sec38 images/sec/watt0.91x A40GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A40
10915,892 images/sec- images/sec/watt6.861x A40GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A40
12815,510 images/sec52 images/sec/watt8.251x A40GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A40
BERT-BASE22,351 sequences/sec13 sequences/sec/watt0.851x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
84,494 sequences/sec19 sequences/sec/watt1.781x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
1287,180 sequences/sec27 sequences/sec/watt17.831x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
BERT-LARGE2893 sequences/sec4 sequences/sec/watt2.241x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
81,666 sequences/sec6 sequences/sec/watt4.81x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
1282,216 sequences/sec9 sequences/sec/watt57.761x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5023,487 images/sec38 images/sec/watt0.571x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
88,497 images/sec70 images/sec/watt0.941x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
10014,963 images/sec- images/sec/watt6.681x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3INT8SyntheticTensorRT 7.2A30
12815,246 images/sec93 images/sec/watt8.41x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3INT8SyntheticTensorRT 7.2A30
ResNet-50v1.523,498 images/sec37 images/sec/watt0.571x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
88,330 images/sec68 images/sec/watt0.961x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
9514,191 images/sec- images/sec/watt6.691x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3INT8SyntheticTensorRT 7.2A30
12814,568 images/sec88 images/sec/watt8.791x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3INT8SyntheticTensorRT 7.2A30
BERT-BASE22,132 sequences/sec22 sequences/sec/watt0.941x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3INT8Sample TextTensorRT 7.2A30
84,515 sequences/sec34 sequences/sec/watt1.771x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3INT8Sample TextTensorRT 7.2A30
1286,940 sequences/sec48 sequences/sec/watt18.441x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3INT8Sample TextTensorRT 7.2A30
BERT-LARGE2837 sequences/sec7 sequences/sec/watt2.391x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3INT8Sample TextTensorRT 7.2A30
81,441 sequences/sec12 sequences/sec/watt5.551x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3INT8Sample TextTensorRT 7.2A30
1282,211 sequences/sec15 sequences/sec/watt57.91x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3INT8Sample TextTensorRT 7.2A30

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A10 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5024,282 images/sec29 images/sec/watt0.471x A10GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A10
87,778 images/sec52 images/sec/watt1.031x A10GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A10
7510,846 images/sec- images/sec/watt6.821x A10GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A10
12811,520 images/sec77 images/sec/watt11.111x A10GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A10
ResNet-50v1.524,187 images/sec28 images/sec/watt0.481x A10GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A10
87,434 images/sec50 images/sec/watt1.081x A10GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A10
7010,785 images/sec- images/sec/watt6.491x A10GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A10
12810,928 images/sec73 images/sec/watt11.711x A10GIGABYTE G482-Z52-0021.05-py3INT8SyntheticTensorRT 7.2A10
BERT-BASE22,095 sequences/sec17 sequences/sec/watt0.951x A10GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A10
83,598 sequences/sec26 sequences/sec/watt2.221x A10GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A10
1284,764 sequences/sec35 sequences/sec/watt26.871x A10GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A10
BERT-LARGE2779 sequences/sec6 sequences/sec/watt2.571x A10GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A10
81,288 sequences/sec10 sequences/sec/watt6.211x A10GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A10
1281,368 sequences/sec10 sequences/sec/watt93.571x A10Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2A10

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

V100 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5021,995 images/sec11 images/sec/watt11x V100DGX-221.05-py3MixedSyntheticTensorRT 7.2.3V100-SXM3-32GB
84,305 images/sec17 images/sec/watt1.861x V100DGX-221.05-py3MixedSyntheticTensorRT 7.2.3V100-SXM3-32GB
527,909 images/sec- images/sec/watt6.571x V100DGX-221.05-py3INT8SyntheticTensorRT 7.2.3V100-SXM3-32GB
1288,202 images/sec24 images/sec/watt15.611x V100DGX-221.05-py3MixedSyntheticTensorRT 7.2.3V100-SXM3-32GB
ResNet-50v1.522,005 images/sec11 images/sec/watt11x V100DGX-221.05-py3MixedSyntheticTensorRT 7.2V100-SXM3-32GB
84,221 images/sec16 images/sec/watt1.91x V100DGX-221.05-py3INT8SyntheticTensorRT 7.2V100-SXM3-32GB
527,545 images/sec- images/sec/watt6.891x V100DGX-221.05-py3MixedSyntheticTensorRT 7.2V100-SXM3-32GB
1287,857 images/sec23 images/sec/watt16.291x V100DGX-221.05-py3INT8SyntheticTensorRT 7.2V100-SXM3-32GB
BERT-BASE21,219 sequences/sec6 sequences/sec/watt1.641x V100DGX-221.05-py3MixedSample TextTensorRT 7.2V100-SXM3-32GB
82,300 sequences/sec8 sequences/sec/watt3.481x V100DGX-221.05-py3MixedSample TextTensorRT 7.2V100-SXM3-32GB
1283,206 sequences/sec10 sequences/sec/watt39.931x V100DGX-221.05-py3MixedSample TextTensorRT 7.2V100-SXM3-32GB
BERT-LARGE2530 sequences/sec2 sequences/sec/watt3.781x V100DGX-221.05-py3MixedSample TextTensorRT 7.2V100-SXM3-32GB
8793 sequences/sec3 sequences/sec/watt10.091x V100DGX-221.05-py3INT8Sample TextTensorRT 7.2V100-SXM3-32GB
128978 sequences/sec3 sequences/sec/watt130.921x V100DGX-220.11-py3INT8Sample TextTensorRT 7.2V100-SXM3-32GB

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server section

 

T4 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5022,105 images/sec30 images/sec/watt0.951x T4Supermicro SYS-1029GQ-TRT21.03-py3INT8SyntheticTensorRT 7.2NVIDIA T4
83,846 images/sec55 images/sec/watt2.081x T4Supermicro SYS-4029GP-TRT21.05-py3INT8SyntheticTensorRT 7.2NVIDIA T4
324,798 images/sec- images/sec/watt6.671x T4Supermicro SYS-4029GP-TRT21.05-py3INT8SyntheticTensorRT 7.2NVIDIA T4
1285,358 images/sec77 images/sec/watt23.891x T4Supermicro SYS-4029GP-TRT21.05-py3INT8SyntheticTensorRT 7.2NVIDIA T4
ResNet-50v1.522,092 images/sec30 images/sec/watt0.961x T4Supermicro SYS-1029GQ-TRT21.03-py3INT8SyntheticTensorRT 7.2NVIDIA T4
83,698 images/sec53 images/sec/watt2.161x T4Supermicro SYS-4029GP-TRT21.05-py3INT8SyntheticTensorRT 7.2NVIDIA T4
294,540 images/sec- images/sec/watt6.391x T4Supermicro SYS-4029GP-TRT21.05-py3INT8SyntheticTensorRT 7.2NVIDIA T4
1284,960 images/sec71 images/sec/watt25.811x T4Supermicro SYS-4029GP-TRT21.05-py3INT8SyntheticTensorRT 7.2NVIDIA T4
BERT-BASE21,089 sequences/sec18 sequences/sec/watt1.841x T4Supermicro SYS-4029GP-TRT21.05-py3INT8Sample TextTensorRT 7.2NVIDIA T4
81,677 sequences/sec25 sequences/sec/watt4.771x T4Supermicro SYS-4029GP-TRT21.05-py3INT8Sample TextTensorRT 7.2NVIDIA T4
1281,818 sequences/sec28 sequences/sec/watt70.41x T4Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2NVIDIA T4
BERT-LARGE2386 sequences/sec6 sequences/sec/watt5.181x T4Supermicro SYS-4029GP-TRT21.05-py3INT8Sample TextTensorRT 7.2NVIDIA T4
8551 sequences/sec9 sequences/sec/watt14.521x T4Supermicro SYS-4029GP-TRT21.05-py3INT8Sample TextTensorRT 7.2NVIDIA T4
128561 sequences/sec8 sequences/sec/watt227.981x T4Supermicro SYS-1029GQ-TRT20.11-py3INT8Sample TextTensorRT 7.2NVIDIA T4

NGC: TensorRT Container
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server section

NVIDIA Jarvis is an application framework for multimodal conversational AI services that delivers real-time performance on GPUs. Jarvis 1.0 Beta includes fully optimized pipelines for Automatic Speech Recognition (ASR), Natural Language Processing (NLP), Text to Speech (TTS) that can be used for deploying real-time conversational AI apps such as transcription, virtual assistants and chatbots. Please visit Jarvis – Getting Started to download and get started with Jarvis.


Jarvis Benchmarks

Automatic Speech Recognition

A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet114.41A100 SXM4-40GB
Quartznet256254.364A100 SXM4-40GB
Quartznet512351.2506A100 SXM4-40GB
Quartznet1024630.81005A100 SXM4-40GB
Jasper117.61A100 SXM4-40GB
Jasper256244.9254A100 SXM4-40GB
Jasper512381507A100 SXM4-40GB
Jasper1024749.31,004A100 SXM4-40GB

A100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet19.61A100 SXM4-40GB
Quartznet1625.916A100 SXM4-40GB
Quartznet128132.4128A100 SXM4-40GB
Jasper113.41A100 SXM4-40GB
Jasper1626.316A100 SXM4-40GB
Jasper128258.9128A100 SXM4-40GB

A100 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet128.11A100 SXM4-40GB
Quartznet512566.5505A100 SXM4-40GB
Quartznet1,024899.31,000A100 SXM4-40GB
Quartznet1,5121,303.81,460A100 SXM4-40GB
Jasper1311A100 SXM4-40GB
Jasper512667.5504A100 SXM4-40GB
Jasper1,0241,089997A100 SXM4-40GB
Jasper1,5121,753.81,449A100 SXM4-40GB

V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet114.41V100 SXM2-16GB
Quartznet256222.2254V100 SXM2-16GB
Quartznet512385.2505V100 SXM2-16GB
Quartznet768574.5752V100 SXM2-16GB
Jasper126.81V100 SXM2-16GB
Jasper128239.4127V100 SXM2-16GB
Jasper256416253V100 SXM2-16GB
Jasper512969.7500V100 SXM2-16GB

V100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet18.81V100 SXM2-16GB
Quartznet1622.416V100 SXM2-16GB
Quartznet128114.7127V100 SXM2-16GB
Jasper121.51V100 SXM2-16GB
Jasper1636.916V100 SXM2-16GB
Jasper64406.464V100 SXM2-16GB
Jasper512969.7500V100 SXM2-16GB

V100 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet132.9331V100 SXM2-16GB
Quartznet256461.44253V100 SXM2-16GB
Quartznet512784.73502V100 SXM2-16GB
Quartznet7681,121.6747V100 SXM2-16GB
Quartznet1,0241,551.5986V100 SXM2-16GB
Jasper148.3511V100 SXM2-16GB
Jasper256734.99252V100 SXM2-16GB
Jasper5121,423.3498V100 SXM2-16GB
Jasper7682,190.2730V100 SXM2-16GB

T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet133.1831NVIDIA T4
Quartznet64162.6364NVIDIA T4
Quartznet128263.6127NVIDIA T4
Quartznet256449.28253NVIDIA T4
Quartznet384732.75376NVIDIA T4
Jasper172.3771NVIDIA T4
Jasper64259.6464NVIDIA T4
Jasper128450.81127NVIDIA T4
Jasper2561,200.8249NVIDIA T4

T4 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet119.21NVIDIA T4
Quartznet1656.416NVIDIA T4
Quartznet64242.464NVIDIA T4
Jasper146.91NVIDIA T4
Jasper851.18NVIDIA T4
Jasper1684.416NVIDIA T4

T4 Offline Mode (3200 ms chunk)
Acoustic model# of streamsLatency (ms) (avg)Throughput (RTFX)GPU Version
Quartznet1157.621NVIDIA T4
Quartznet256906.17251NVIDIA T4
Quartznet5121,515.2495NVIDIA T4
Jasper196.2011NVIDIA T4
Jasper2561,758.4247NVIDIA T4

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Jarvis streaming client jarvis_streaming_asr_client, provided in the Jarvis client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Jarvis version: v1.0.0-b1 | Hardware: NVIDIA DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

Natural Language Processing

A100 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER13.19311A100 SXM4-40GB
NER25695.52549A100 SXM4-40GB
Q&A14.95201A100 SXM4-40GB
Q&A128279453A100 SXM4-40GB

V100 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER14.87204V100 SXM2-16GB
NER2561351,797V100 SXM2-16GB
Q&A17.47134V100 SXM2-16GB
Q&A128521244V100 SXM2-16GB

T4 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER19.31107NVIDIA T4
NER256255960NVIDIA T4
Q&A111.587NVIDIA T4
Q&A128571223NVIDIA T4

Named Entity Recogniton (NER): 128 seq len, BERT-base | Question Answering (QA): 384 seq len, BERT-large | NLP Throughput (seq/s) - Number of sequences processed per second | Performance of the Jarvis named entity recognition (NER) service (using a BERT-base model, sequence length of 128) and the Jarvis question answering (QA) service (using a BERT-large model, sequence length of 384) was measured in Jarvis. Batch size 1 latency and maximum throughput were measured. Jarvis version: v1.0.0-b1 | Hardware: NVIDIA DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

Text to Speech

A100 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.060.0420A100 SXM4-40GB
40.480.0337A100 SXM4-40GB
60.690.0342A100 SXM4-40GB
80.880.0346A100 SXM4-40GB
101.060.0349A100 SXM4-40GB

V100 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.080.0514V100 SXM2-16GB
40.770.0523V100 SXM2-16GB
61.110.0526V100 SXM2-16GB
81.40.0628V100 SXM2-16GB
101.740.0728V100 SXM2-16GB

T4 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.120.0711NVIDIA T4
41.020.0717NVIDIA T4
61.590.0718NVIDIA T4
82.130.0819NVIDIA T4
102.550.118NVIDIA T4

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Jarvis text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Jarvis version: v1.0.0-b1 | Hardware: NVIDIA DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

 

Last updated: June 30th, 2021