Reproducible Performance

Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide

Related Resources

HPC Performance

Review the latest GPU-acceleration factors of popular HPC applications.


Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Related Resources

Read our blog on convergence for more details.

Get up and running quickly with NVIDIA’s complete solution stack:

  • Pull software containers from NVIDIA NGC.

  • Learn how NVIDIA A100 Tensor Core GPUs provide unprecedented acceleration at every scale, setting records in MLPerf.


NVIDIA Performance on MLPerf 1.0 AI Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA A100 Performance on MLPerf 1.0 AI Benchmarks - Closed Division

FrameworkNetworkTime to Train (mins)MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.528.7775.90% classification8x A100DGX A1001.0-1059MixedImageNet2012A100-SXM4-80GB
4.9175.90% classification64x A100DGX A1001.0-1064MixedImageNet2012A100-SXM4-80GB
0.5875.90% classification1024x A100DGX A1001.0-1072MixedImageNet2012A100-SXM4-80GB
0.475.90% classification2480x A100DGX A1001.0-1076MixedImageNet2012A100-SXM4-80GB
SSD8.5223.0% mAP8x A100DGX A1001.0-1059MixedCOCO2017A100-SXM4-80GB
1.923.0% mAP64x A100DGX A1001.0-1064MixedCOCO2017A100-SXM4-80GB
0.4823.0% mAP1024x A100DGX A1001.0-1072MixedCOCO2017A100-SXM4-80GB
UNet-3D29.160.908 Mean DICE score8x A100DGX A1001.0-1059MixedKiTS19A100-SXM4-80GB
4.680.908 Mean DICE score104x A100DGX A1001.0-1066MixedKiTS19A100-SXM4-80GB
30.908 Mean DICE score800x A100DGX A1001.0-1071MixedKiTS19A100-SXM4-80GB
PyTorchBERT21.690.712 Mask-LM accuracy8x A100DGX A1001.0-1060MixedWikipedia 2020/01/01A100-SXM4-80GB
3.370.712 Mask-LM accuracy64x A100DGX A1001.0-1065MixedWikipedia 2020/01/01A100-SXM4-80GB
0.730.712 Mask-LM accuracy1024x A100DGX A1001.0-1073MixedWikipedia 2020/01/01A100-SXM4-80GB
0.320.712 Mask-LM accuracy4096x A100DGX A1001.0-1077MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN50.390.377 Box min AP and 0.339 Mask min AP8x A100DGX A1001.0-1060MixedCOCO2017A100-SXM4-80GB
15.750.377 Box min AP and 0.339 Mask min AP32x A100DGX A1001.0-1062MixedCOCO2017A100-SXM4-80GB
3.950.377 Box min AP and 0.339 Mask min AP272x A100DGX A1001.0-1070MixedCOCO2017A100-SXM4-80GB
RNN-T38.70.058 Word Error Rate8x A100DGX A1001.0-1060MixedLibriSpeechA100-SXM4-80GB
4.410.058 Word Error Rate128x A100DGX A1001.0-1068MixedLibriSpeechA100-SXM4-80GB
2.750.058 Word Error Rate1536x A100DGX A1001.0-1074MixedLibriSpeechA100-SXM4-80GB
TensorFlowMiniGo269.5450% win rate vs. checkpoint8x A100DGX A1001.0-1061MixedGoA100-SXM4-80GB
29.3250% win rate vs. checkpoint256x A100DGX A1001.0-1069MixedGoA100-SXM4-80GB
15.5350% win rate vs. checkpoint1792x A100DGX A1001.0-1075MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM1.960.8025 AUC8x A100DGX A1001.0-1058MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
1.050.8025 AUC64x A100DGX A1001.0-1063MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.990.8025 AUC112x A100DGX A1001.0-1067MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

Converged Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.58677.31 Top 1 Accuracy22,892 images/sec8x A100DGX A10021.08-py3Mixed192ImageNet2012A100-SXM-80GB
PyTorch1.8.0a0Mask R-CNN176.34 AP Segm167 images/sec8x A100DGX A10021.12-py3TF328COCO 2014A100-SXM-80GB
1.6.0a0SSD v1.143.25 mAP3,092 images/sec8x A100DGX A10020.06-py3Mixed128COCO 2017A100-SXM-80GB
1.10.0a0Tacotron299.56 Training Loss306,044 total output mels/sec8x A100DGX A10021.08-py3TF32128LJSpeech 1.1A100-SXM-80GB
1.9.0a0WaveGlow288-5.81 Training Loss1,489,519 output samples/sec8x A100DGX A10021.06-py3Mixed10LJSpeech 1.1A100-SXM-80GB
1.6.0a0Jasper3,6003.53 dev-clean WER603 sequences/sec8x A100DGX A10020.06-py3Mixed64LibriSpeechA100 SXM4-40GB
1.6.0a0Transformer16727.76 BLEU Score582,721 words/sec8x A100DGX A10020.06-py3Mixed10240wmt14-en-deA100 SXM4-40GB
1.6.0a0FastPitch216.18 Training Loss1,040,206 frames/sec8x A100DGX A10020.06-py3Mixed32LJSpeech 1.1A100 SXM4-40GB
1.10.0a0GNMT V21624.45 BLEU Score919,980 total tokens/sec8x A100DGX A10021.08-py3Mixed128wmt16-en-deA100-SXM-80GB
1.10.0a0NCF0.37.96 Hit Rate at 10154,575,327 samples/sec8x A100DGX A10021.08-py3Mixed131072MovieLens 20MA100-SXM-80GB
1.10.0a0BERT-LARGE391.05 F1931 sequences/sec8x A100DGX A10021.08-py3Mixed32SQuaD v1.1A100-SXM-80GB
1.9.0a0Transformer-XL Large40814.03 Perplexity202,130 total tokens/sec8x A100DGX A10021.06-py3Mixed16WikiText-103A100-SXM-80GB
1.10.0a0Transformer-XL Base20922.53 Perplexity629,664 total tokens/sec8x A100DGX A10021.08-py3Mixed128WikiText-103A100-SXM-80GB
1.6.0a0BERT-Large Pre-Training P12,379-3,231 sequences/sec8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
1.6.0a0BERT-Large Pre-Training P21,3771.34 Final Loss630 sequences/sec8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
1.6.0a0BERT-Large Pre-Training E2E3,7561.34 Final Loss-8x A100DGX A10020.06-py3Mixed-Wikipedia+BookCorpusA100-SXM4-40GB
Tensorflow1.15.5ResNet-50 v1.59576.84 Top 120,366 images/sec8x A100DGX A10021.08-py3Mixed256ImageNet2012A100-SXM-80GB
1.15.5ResNext10119279.34 Top 110,078 images/sec8x A100DGX A10021.08-py3Mixed256Imagenet2012A100-SXM-80GB
1.15.5SE-ResNext10122279.48 Top18,743 images/sec8x A100DGX A10021.05-py3Mixed256Imagenet2012A100-SXM-80GB
1.15.5U-Net Industrial1.99 IoU Threshold 0.991,038 images/sec8x A100DGX A10021.08-py3Mixed2DAGM2007A100-SXM-80GB
1.15.5U-Net Medical6.9 Dice Score952 images/sec8x A100DGX A10021.03-py3Mixed8EM segmentation challengeA100-SXM-80GB
1.15.5VAE-CF1.43 NDCG@1001,599,893 users processed/sec8x A100DGX A10021.08-py3TF323072MovieLens 20MA100-SXM4-40GB
2.5.0Wide and Deep8.66 MAP at 123,349,230 samples/sec8x A100DGX A10021.08-py3Mixed16384Kaggle Outbrain Click PredictionA100-SXM-80GB
1.15.5BERT-LARGE1091.3 F1865 sequences/sec8x A100DGX A10021.08-py3Mixed24SQuaD v1.1A100-SXM-80GB
2.5.0Electra Base Fine Tuning392.67 F12,578 sequences/sec8x A100DGX A10021.08-py3Mixed32SQuaD v1.1A100-SXM-80GB
2.2.0EfficientNet-B44,23182.81 Top 12,535 images/sec8x A100DGX A10020.08-py3Mixed160ImageNet2012A100-SXM-80GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
BERT-Large Pre-Training Sequence Length for Phase 1 = 128 and Phase 2 = 512 | Batch Size for Phase 1 = 65,536 and Phase 2 = 32,768
EfficientNet-B4: Mixup = 0.2 | Auto-Augmentation | cuDNN Version = 8.0.5.39 | NCCL Version = 2.7.8

A40 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.9.0a0NCF1.96 Hit Rate at 1059,667,265 samples/sec8x A40GIGABYTE G482-Z52-0021.05-py3Mixed131072MovieLens 20MA40
1.10.0a0BERT-LARGE791.03 F1414 sequences/sec8x A40GIGABYTE G482-Z52-0021.08-py3Mixed32SQuaD v1.1A40

Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

A30 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.518277.34 Top110,739 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed192ImageNet2012A30
PyTorch1.9.0a0Tacotron2215.54 Training Loss144,326 total output mels/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed104LJSpeech 1.1A30
1.9.0a0WaveGlow533-5.82 Training Loss794,511 output samples/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed10LJSpeech 1.1A30
1.9.0a0Transformer1,10827.58 BLEU Score87,584 words/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed2560wmt14-en-deA30
1.9.0a0GNMT V28124.65 BLEU Score219,582 total tokens/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3TF32128wmt16-en-deA30
1.9.0a0NCF1.96 Hit Rate at 1056,273,829 samples/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.06-py3Mixed131072MovieLens 20MA30
1.10.0a0BERT-LARGE1091.11 F1282 sequences/sec8x A30GIGABYTE G482-Z52-0021.08-py3Mixed10SQuaD v1.1A30
1.9.0a0Transformer-XL Base15122.16 Perplexity219,994 total tokens/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed32WikiText-103A30
Tensorflow1.15.5ResNet-50 v1.519876.78 Top19,798 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed256ImageNet2012A30
1.15.5U-Net Industrial1.99 IoU Threshold 0.99565 images/sec8x A30GIGABYTE G482-Z52-0021.08-py3Mixed2DAGM2007A30
1.15.5U-Net Medical9.9 DICE Score461 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed8EM segmentation challengeA30
1.15.5VAE-CF1.43 NDCG@100854,405 users processed/sec8x A30GIGABYTE G482-Z52-0021.06-py3TF323072MovieLens 20MA30
1.15.5SE-ResNext10157379.83 Top13,399 images/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed96Imagenet2012A30
2.4.0Electra Base Fine Tuning692.65 F1904 sequences/sec8x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed16SQuaD v1.1A30

Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

A10 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.8.0ResNet-50 v1.524277.25 Top18,117 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed192ImageNet2012A10
PyTorch1.9.0a0SE-ResNeXt10199680.24 Top11,953 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed112Imagenet2012A10
1.9.0a0Tacotron2204.5 Training Loss151,946 total output mels/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed104LJSpeech 1.1A10
1.9.0a0WaveGlow637-5.84 Training Loss664,022 output samples/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed10LJSpeech 1.1A10
1.9.0a0Transformer1,36527.8 BLEU Score70,844 words/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed2560wmt14-en-deA10
1.9.0a0FastPitch177.25 Training Loss467,464 frames/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed32LJSpeech 1.1A10
1.9.0a0GNMT V26124.49 BLEU Score292,052 total tokens/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed128wmt16-en-deA10
1.10.0a0NCF1.96 Hit Rate at 1047,145,497 samples/sec8x A10GIGABYTE G482-Z52-0021.08-py3Mixed131072MovieLens 20MA10
1.10.0a0BERT-LARGE1391.54 F1224 sequences/sec8x A10GIGABYTE G482-Z52-0021.08-py3Mixed10SQuaD v1.1A10
1.9.0a0Transformer-XL Base17622.16 Perplexity187,731 total tokens/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed32WikiText-103A10
Tensorflow1.15.5ResNet-50 v1.526676.74 Top17,283 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed256ImageNet2012A10
1.15.5U-Net Industrial1.99 IoU Threshold 0.99542 images/sec8x A10GIGABYTE G482-Z52-0021.08-py3Mixed2DAGM2007A10
1.15.5U-Net Medical14.9 DICE Score324 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed8EM segmentation challengeA10
1.15.5VAE-CF1.43 NDCG@100644,608 users processed/sec8x A10GIGABYTE G482-Z52-0021.08-py3TF323072MovieLens 20MA10
1.15.5SE-ResNext10186679.65 Top12,240 images/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed96Imagenet2012A10
2.4.0Electra Base Fine Tuning692.62 F1745 sequences/sec8x A10GIGABYTE G482-Z52-0021.05-py3Mixed16SQuaD v1.1A10

Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

T4 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.550777.28 Top 13,860 images/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed192ImageNet2012NVIDIA T4
PyTorch1.9.0a0SE-ResNeXt1011,77079.94 Top11,102 images/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed112Imagenet2012NVIDIA T4
1.10.0a0Tacotron2241.53 Training Loss125,992 total output mels/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed104LJSpeech 1.1NVIDIA T4
1.10.0a0WaveGlow1,041-5.82 Training Loss400,494 output samples/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed10LJSpeech 1.1NVIDIA T4
1.10.0a0Transformer2,09227.56 BLEU Score45,948 words/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed2560wmt14-en-deNVIDIA T4
1.7.0a0FastPitch319.21 Training Loss281,406 frames/sec8x T4Supermicro SYS-4029GP-TRT20.10-py3Mixed32LJSpeech 1.1NVIDIA T4
1.10.0a0GNMT V28524.46 BLEU Score173,312 total tokens/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed128wmt16-en-deNVIDIA T4
1.10.0a0NCF2.96 Hit Rate at 1028,643,324 samples/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed131072MovieLens 20MNVIDIA T4
1.9.0a0BERT-LARGE2391.34 F1129 sequences/sec8x T4Supermicro SYS-4029GP-TRT21.06-py3Mixed10SQuaD v1.1NVIDIA T4
1.9.0a0Transformer-XL Base31822.12 Perplexity103,740 total tokens/sec8x T4Supermicro SYS-4029GP-TRT21.06-py3Mixed32WikiText-103NVIDIA T4
Tensorflow1.15.5ResNet-50 v1.555076.81 Top 13,496 images/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed256ImageNet2012NVIDIA T4
1.15.5U-Net Industrial2.99 IoU Threshold 0.99312 images/sec8x T4Supermicro SYS-4029GP-TRT21.06-py3Mixed2DAGM2007NVIDIA T4
1.15.5U-Net Medical31.9 DICE Score155 images/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5VAE-CF2.43 NDCG@100383,665 users processed/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed3072MovieLens 20MNVIDIA T4
1.15.4SSD112.28 mAP549 images/sec8x T4Supermicro SYS-4029GP-TRT20.12-py3Mixed32COCO 2017NVIDIA T4
1.15.5Mask R-CNN492.34 AP Segm53 samples/sec8x T4Supermicro SYS-4029GP-TRT21.03-py3Mixed4COCO 2014NVIDIA T4
1.15.5ResNext1011,22279.22 Top 11,577 images/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed128Imagenet2012NVIDIA T4
1.15.5SE-ResNext1011,55279.53 Top11,243 images/sec8x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed96Imagenet2012NVIDIA T4
2.5.0Electra Base Fine Tuning992.69 F1419 sequences/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed16SQuaD v1.1NVIDIA T4
2.5.0Wide and Deep27.66 MAP at 12827,819 samples/sec8x T4Supermicro SYS-4029GP-TRT21.08-py3Mixed16384Kaggle Outbrain Click PredictionNVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384


V100 Training Performance

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.517077.02 Top 111,683 images/sec8x V100DGX-221.08-py3Mixed256ImageNet2012V100-SXM3-32GB
PyTorch1.8.0a0Mask R-CNN269.34 AP Segm109 images/sec8x V100DGX-220.12-py3Mixed8COCO 2014V100-SXM3-32GB
1.10.0a0Tacotron2187.53 Traing Loss165,755 total output mels/sec8x V100DGX-221.08-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.10.0a0WaveGlow453-5.82 Traing Loss924,330 output samples/sec8x V100DGX-221.08-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.6.0a0Jasper6,3003.49 dev-clean WER312 sequences/sec8x V100DGX-220.06-py3Mixed64LibriSpeechV100 SXM2-32GB
1.10.0a0Transformer47227.64 BLEU Score209,534 words/sec8x V100DGX-221.08-py3Mixed5120wmt14-en-deV100-SXM3-32GB
1.6.0a0FastPitch354.18 Training Loss570,968 frames/sec8x V100DGX-120.06-py3Mixed32LJSpeech 1.1V100 SXM2-16GB
1.10.0a0GNMT V23424.33 BLEU Score445,675 total tokens/sec8x V100DGX-221.08-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.10.0a0NCF1.96 Hit Rate at 1098,934,300 samples/sec8x V100DGX-221.08-py3Mixed131072MovieLens 20MV100-SXM3-32GB
1.10.0a0BERT-LARGE891.31 F1367 sequences/sec8x V100DGX-221.08-py3Mixed10SQuaD v1.1V100-SXM3-32GB
1.9.0a0Transformer-XL Base11622.05 Perplexity284,635 total tokens/sec8x V100DGX-221.06-py3Mixed32WikiText-103V100-SXM3-32GB
1.10.0a0ResNeXt10150079.5 Top 13,915 images/sec8x V100DGX-221.08-py3Mixed112Imagenet2012V100-SXM3-32GB
Tensorflow1.15.5ResNet-50 v1.518776.9 Top 110,320 images/sec8x V100DGX-221.08-py3Mixed256ImageNet2012V100-SXM3-32GB
1.15.5ResNext10142579.3 Top 14,558 images/sec8x V100DGX-221.08-py3Mixed128Imagenet2012V100-SXM3-32GB
1.15.5SE-ResNext10150479.81 Top13,867 images/sec8x V100DGX-221.05-py3Mixed96Imagenet2012V100-SXM3-32GB
1.15.5U-Net Industrial1.99 IoU Threshold 0.99665 images/sec8x V100DGX-221.06-py3Mixed2DAGM2007V100-SXM3-32GB
1.15.5U-Net Medical12.89 DICE Score465 images/sec8x V100DGX-221.06-py3Mixed8EM segmentation challengeV100-SXM3-32GB
1.15.5VAE-CF1.43 NDCG@100907,709 users processed/sec8x V100DGX-221.08-py3Mixed3072MovieLens 20MV100-SXM3-32GB
2.5.0Wide and Deep11.66 MAP at 122,124,603 samples/sec8x V100DGX-221.08-py3Mixed16384Kaggle Outbrain Click PredictionV100-SXM3-32GB
1.15.5BERT-LARGE1891.55 F1331 sequences/sec8x V100DGX-221.08-py3Mixed10SQuaD v1.1V100-SXM3-32GB
2.5.0Electra Base Fine Tuning492.47 F11,390 sequences/sec8x V100DGX-221.08-py3Mixed32SQuaD v1.1V100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384


Converged Training Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance on Cloud

FrameworkFramework VersionNetworkTime to Train (mins)AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.9.0a0BERT-LARGE391.31 F1876 images/sec8x A100AWS EC2 p4d.24xlarge21.06-py3Mixed32SQuaD v1.1A100-SXM4-40GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

Single-GPU Training

Some scenarios aren’t used in real-world training, such as single-GPU throughput. The table below provides an indication of a platform’s single-chip throughput.

Related Resources

Achieve unprecedented acceleration at every scale with NVIDIA’s complete solution stack.

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.

NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit the NVIDIA NGC catalog to pull containers and quickly get up and running with deep learning.


Single GPU Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.52,986 images/sec1x A100DGX A10021.08-py3Mixed192ImageNet2012A100-SXM4-80GB
PyTorch1.9.0a0Mask R-CNN30 images/sec1x A100DGX A10021.06-py3TF328COCO 2014A100-SXM-80GB
1.9.0a0SSD v1.1447 images/sec1x A100DGX A10021.06-py3Mixed128COCO 2017A100-SXM-80GB
1.10.0a0Tacotron240,849 total output mels/sec1x A100DGX A10021.08-py3TF32128LJSpeech 1.1A100-SXM4-80GB
1.10.0a0WaveGlow201,377 output samples/sec1x A100DGX A10021.08-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
1.9.0a0Jasper83 sequences/sec1x A100DGX A10021.05-py3Mixed64LibriSpeechA100-SXM-80GB
1.6.0a0Transformer82,618 words/sec1x A100DGX A10020.06-py3Mixed10240wmt14-en-deA100 SXM4-40GB
1.10.0a0FastPitch180,663 frames/sec1x A100DGX A10021.08-py3Mixed128LJSpeech 1.1A100-SXM4-80GB
1.10.0a0GNMT V2159,485 total tokens/sec1x A100DGX A10021.08-py3Mixed128wmt16-en-deA100-SXM4-80GB
1.10.0a0NCF37,202,700 samples/sec1x A100DGX A10021.08-py3Mixed1048576MovieLens 20MA100-SXM4-80GB
1.10.0a0BERT-LARGE123 sequences/sec1x A100DGX A10021.08-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.9.0a0Transformer-XL Large28,583 total tokens/sec1x A100DGX A10021.04-py3Mixed16WikiText-103A100-SXM-80GB
1.10.0a0Transformer-XL Base77,166 total tokens/sec1x A100DGX A10021.08-py3Mixed128WikiText-103A100-SXM4-80GB
Tensorflow1.15.5ResNet-50 v1.52,638 images/sec1x A100DGX A10021.08-py3Mixed256ImageNet2012A100-SXM4-80GB
1.15.5ResNext1011,301 images/sec1x A100DGX A10021.08-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5SE-ResNext1011,122 images/sec1x A100DGX A10021.08-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5U-Net Industrial342 images/sec1x A100DGX A10021.08-py3Mixed16DAGM2007A100-SXM4-40GB
2.5.0U-Net Medical148 images/sec1x A100DGX A10021.08-py3Mixed8EM segmentation challengeA100-SXM4-80GB
1.15.5VAE-CF395,735 users processed/sec1x A100DGX A10021.08-py3Mixed24576MovieLens 20MA100-SXM4-80GB
2.5.0Wide and Deep2,341,155 samples/sec1x A100DGX A10021.08-py3Mixed131072Kaggle Outbrain Click PredictionA100-SXM4-40GB
1.15.5BERT-LARGE117 sequences/sec1x A100DGX A10021.08-py3Mixed24SQuaD v1.1A100-SXM4-80GB
2.5.0Electra Base Fine Tuning365 sequences/sec1x A100DGX A10021.08-py3Mixed32SQuaD v1.1A100-SXM4-80GB
-EfficientNet-B4332 images/sec1x A100DGX A100-Mixed160ImageNet2012A100-SXM-80GB
1.15.5NCF42,093,685 samples/sec1x A100DGX A10021.08-py3Mixed1048576MovieLens 20MA100-SXM4-40GB
2.4.0Mask R-CNN30 samples/sec1x A100DGX A10021.05-py3Mixed4COCO 2014A100-SXM4-40GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
EfficientNet-B4: Basic Augmentation | cuDNN Version = 8.0.5.32 | NCCL Version = 2.7.8 | Installation Source = NGC catalog

A40 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.51,192 images/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed192ImageNet2012A40
PyTorch1.9.0a0Mask R-CNN14 images/sec1x A40GIGABYTE G482-Z52-0021.06-py3Mixed8COCO 2014A40
1.9.0a0SSD v1.1222 images/sec1x A40GIGABYTE G482-Z52-0021.06-py3Mixed128COCO 2017A40
1.10.0a0Tacotron224,051 total output mels/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed128LJSpeech 1.1A40
1.10.0a0WaveGlow120,308 output samples/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed10LJSpeech 1.1A40
1.10.0a0GNMT V282,036 total tokens/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed128wmt16-en-deA40
1.10.0a0NCF20,388,435 samples/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed1048576MovieLens 20MA40
1.9.0a0Transformer-XL Large15,301 total tokens/sec1x A40GIGABYTE G482-Z52-0021.05-py3Mixed16WikiText-103A40
1.10.0a0BERT-LARGE63 sequences/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed32SQuaD v1.1A40
1.10.0a0FastPitch122,507 frames/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed128LJSpeech 1.1A40
1.10.0a0Transformer-XL Base44,361 total tokens/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed128WikiText-103A40
Tensorflow1.15.5ResNet-50 v1.51,325 images/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed256ImageNet2012A40
1.15.5SSD214 images/sec1x A40GIGABYTE G482-Z52-0021.06-py3Mixed32COCO 2017A40
1.15.5U-Net Industrial122 images/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed16DAGM2007A40
1.15.5BERT-LARGE55 sentences/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed24SQuaD v1.1A40
1.15.5VAE-CF214,146 users processed/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed24576MovieLens 20MA40
2.5.0U-Net Medical72 images/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed8EM segmentation challengeA40
2.5.0Wide and Deep943,904 samples/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed131072Kaggle Outbrain Click PredictionA40
1.15.5ResNext101571 images/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed256Imagenet2012A40
1.15.5SE-ResNext101525 images/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed256Imagenet2012A40
2.5.0Electra Base Fine Tuning180 sequences/sec1x A40GIGABYTE G482-Z52-0021.08-py3Mixed32SQuaD v1.1A40

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

A30 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.51,472 images/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed192ImageNet2012A30
PyTorch1.9.0a0SSD v1.1226 images/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.06-py3Mixed64COCO 2017A30
1.10.0a0Tacotron219,491 total output mels/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed104LJSpeech 1.1A30
1.10.0a0WaveGlow114,946 output samples/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed10LJSpeech 1.1A30
1.9.0a0Transformer24,662 words/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed2560wmt14-en-deA30
1.10.0a0FastPitch98,275 frames/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed64LJSpeech 1.1A30
1.10.0a0NCF19,593,087 samples/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed1048576MovieLens 20MA30
1.10.0a0GNMT V284,818 total tokens/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed128wmt16-en-deA30
1.9.0a0Transformer-XL Base41,124 total tokens/sec1x A30GIGABYTE G482-Z52-0021.04-py3Mixed32WikiText-103A30
1.10.0a0ResNeXt101549 images/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed112Imagenet2012A30
1.9.0a0Jasper34 sequences/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed16LibriSpeechA30
1.9.0a0Transformer-XL Large12,617 total tokens/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed4WikiText-103A30
1.10.0a0BERT-LARGE51 sequences/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed10SQuaD v1.1A30
Tensorflow1.15.5ResNet-50 v1.51,347 images/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed256ImageNet2012A30
1.15.5ResNext101595 images/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed128Imagenet2012A30
1.15.5SE-ResNext101494 images/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed96Imagenet2012A30
1.15.5U-Net Industrial107 images/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed16DAGM2007A30
2.5.0U-Net Medical68 images/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed8EM segmentation challengeA30
1.15.5VAE-CF199,577 users processed/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed24576MovieLens 20MA30
2.5.0Wide and Deep871,647 samples/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed131072Kaggle Outbrain Click PredictionA30
2.4.0Mask R-CNN21 samples/sec1x A30GIGABYTE G482-Z52-SW-QZ-00121.05-py3Mixed4COCO 2014A30
2.5.0Electra Base Fine Tuning149 sequences/sec1x A30GIGABYTE G482-Z52-0021.08-py3Mixed16SQuaD v1.1A30
1.15.5SSD201 images/sec1x A30GIGABYTE G482-Z52-0021.06-py3Mixed32COCO 2017A30

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

A10 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.51,017 images/sec1x A10GIGABYTE G482-Z52-0021.06-py3Mixed192ImageNet2012A10
PyTorch1.9.0a0SSD v1.1173 images/sec1x A10GIGABYTE G482-Z52-0021.06-py3Mixed64COCO 2017A10
1.10.0a0Tacotron219,710 total output mels/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed104LJSpeech 1.1A10
1.10.0a0WaveGlow96,531 output samples/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed10LJSpeech 1.1A10
1.9.0a0Transformer22,248 words/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed2560wmt14-en-deA10
1.10.0a0FastPitch94,160 frames/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed64LJSpeech 1.1A10
1.9.0a0Transformer-XL Base34,901 total tokens/sec1x A10GIGABYTE G482-Z52-0021.04-py3Mixed32WikiText-103A10
1.10.0a0GNMT V265,417 total tokens/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed128wmt16-en-deA10
1.9.0a0ResNeXt101421 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed128Imagenet2012A10
1.9.0a0SE-ResNeXt101327 images/sec1x A10GIGABYTE G482-Z52-0021.03-py3Mixed128Imagenet2012A10
1.10.0a0NCF16,806,585 samples/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed1048576MovieLens 20MA10
1.8.0a0Jasper29 sequences/sec1x A10GIGABYTE G482-Z52-0021.02-py3Mixed32LibriSpeechA10
1.9.0a0Transformer-XL Large10,699 total tokens/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed4WikiText-103A10
1.10.0a0BERT-LARGE40 sequences/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed10SQuaD v1.1A10
Tensorflow1.15.5ResNet-50 v1.5995 images/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed256ImageNet2012A10
1.15.5ResNext101412 images/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed128Imagenet2012A10
1.15.5SE-ResNext101316 images/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed96Imagenet2012A10
1.15.5U-Net Industrial96 images/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed16DAGM2007A10
2.5.0U-Net Medical50 images/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed8EM segmentation challengeA10
1.15.5VAE-CF175,029 users processed/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed24576MovieLens 20MA10
2.5.0Wide and Deep770,616 samples/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed131072Kaggle Outbrain Click PredictionA10
2.5.0Electra Base Fine Tuning130 sequences/sec1x A10GIGABYTE G482-Z52-0021.08-py3Mixed16SQuaD v1.1A10
2.4.0Mask R-CNN18 samples/sec1x A10GIGABYTE G482-Z52-0021.05-py3Mixed4COCO 2014A10
1.15.5SSD180 images/sec1x A10GIGABYTE G482-Z52-0021.06-py3Mixed32COCO 2017A10

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

T4 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.5486 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed192ImageNet2012NVIDIA T4
PyTorch1.10.0a0ResNeXt101180 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed112Imagenet2012NVIDIA T4
1.10.0a0Tacotron217,331 total output mels/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed104LJSpeech 1.1NVIDIA T4
1.10.0a0WaveGlow53,856 output samples/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed10LJSpeech 1.1NVIDIA T4
1.9.0a0Transformer10,512 words/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed2560wmt14-en-deNVIDIA T4
1.10.0a0FastPitch40,379 frames/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed64LJSpeech 1.1NVIDIA T4
1.10.0a0GNMT V232,528 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed128wmt16-en-deNVIDIA T4
1.10.0a0NCF8,091,013 samples/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed1048576MovieLens 20MNVIDIA T4
1.10.0a0BERT-LARGE19 sequences/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed10SQuaD v1.1NVIDIA T4
1.9.0a0Transformer-XL Base17,182 total tokens/sec1x T4Supermicro SYS-4029GP-TRT21.04-py3Mixed32WikiText-103NVIDIA T4
1.8.0a0Jasper14 sequences/sec1x T4Supermicro SYS-1029GQ-TRT21.02-py3Mixed32LibriSpeechNVIDIA T4
1.10.0a0SE-ResNeXt101146 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed112Imagenet2012NVIDIA T4
1.9.0a0Transformer-XL Large5,231 total tokens/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed4WikiText-103NVIDIA T4
Tensorflow1.15.5ResNet-50 v1.5443 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed256ImageNet2012NVIDIA T4
1.15.5U-Net Industrial46 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed16DAGM2007NVIDIA T4
2.5.0U-Net Medical22 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5VAE-CF81,989 users processed/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed24576MovieLens 20MNVIDIA T4
1.15.5SSD98 images/sec1x T4Supermicro SYS-4029GP-TRT21.06-py3Mixed32COCO 2017NVIDIA T4
2.4.0Mask R-CNN9 samples/sec1x T4Supermicro SYS-4029GP-TRT21.05-py3Mixed4COCO 2014NVIDIA T4
2.5.0Wide and Deep351,161 samples/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed131072Kaggle Outbrain Click PredictionNVIDIA T4
1.15.5SE-ResNext101167 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed96Imagenet2012NVIDIA T4
1.15.5ResNext101203 images/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed128Imagenet2012NVIDIA T4
2.5.0Electra Base Fine Tuning65 sequences/sec1x T4Supermicro SYS-1029GQ-TRT21.08-py3Mixed16SQuaD v1.1NVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

V100 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.51,489 images/sec1x V100DGX-221.08-py3Mixed256ImageNet2012V100-SXM3-32GB
PyTorch1.10.0a0ResNeXt101555 images/sec1x V100DGX-221.08-py3Mixed112Imagenet2012V100-SXM3-32GB
1.9.0a0SSD v1.1233 images/sec1x V100DGX-221.06-py3Mixed64COCO 2017V100-SXM3-32GB
1.9.0a0Tacotron222,904 total output mels/sec1x V100DGX-221.06-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.10.0a0WaveGlow128,867 output samples/sec1x V100DGX-221.08-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.9.0a0Jasper44 sequences/sec1x V100DGX-221.05-py3Mixed64LibriSpeechV100-SXM3-32GB
1.9.0a0Transformer32,218 words/sec1x V100DGX-221.05-py3Mixed5120wmt14-en-deV100-SXM3-32GB
1.10.0a0FastPitch121,642 frames/sec1x V100DGX-221.08-py3Mixed64LJSpeech 1.1V100-SXM3-32GB
1.10.0a0GNMT V277,162 total tokens/sec1x V100DGX-221.08-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.10.0a0NCF22,168,311 samples/sec1x V100DGX-221.08-py3Mixed1048576MovieLens 20MV100-SXM3-32GB
1.10.0a0BERT-LARGE53 sequences/sec1x V100DGX-221.08-py3Mixed10SQuaD v1.1V100-SXM3-32GB
1.9.0a0Transformer-XL Base44,072 total tokens/sec1x V100DGX-221.05-py3Mixed32WikiText-103V100-SXM3-32GB
1.9.0a0Transformer-XL Large15,360 total tokens/sec1x V100DGX-221.05-py3Mixed8WikiText-103V100-SXM3-32GB
Tensorflow1.15.5ResNet-50 v1.51,372 images/sec1x V100DGX-221.08-py3Mixed256ImageNet2012V100-SXM3-32GB
1.15.5ResNext101618 images/sec1x V100DGX-221.08-py3Mixed128Imagenet2012V100-SXM3-32GB
1.15.5SE-ResNext101517 images/sec1x V100DGX-221.08-py3Mixed96Imagenet2012V100-SXM3-32GB
1.15.5U-Net Industrial116 images/sec1x V100DGX-221.08-py3Mixed16DAGM2007V100-SXM3-32GB
2.5.0U-Net Medical67 images/sec1x V100DGX-221.08-py3Mixed8EM segmentation challengeV100-SXM3-32GB
1.15.5VAE-CF222,257 users processed/sec1x V100DGX-221.08-py3Mixed24576MovieLens 20MV100-SXM3-32GB
2.5.0Wide and Deep978,587 samples/sec1x V100DGX-221.08-py3Mixed131072Kaggle Outbrain Click PredictionV100-SXM3-32GB
1.15.5BERT-LARGE48 sequences/sec1x V100DGX-221.08-py3Mixed10SQuaD v1.1V100-SXM3-32GB
2.5.0Electra Base Fine Tuning191 sequences/sec1x V100DGX-221.08-py3Mixed32SQuaD v1.1V100-SXM3-32GB
2.4.0Mask R-CNN22 samples/sec1x V100DGX-221.05-py3Mixed4COCO 2014V100-SXM3-32GB
1.15.5SSD222 images/sec1x V100DGX-221.06-py3Mixed32COCO 2017V100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384


Single GPU Training Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

T4 Training Performance on Cloud

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.0ResNet-50 v1.5425 images/sec1x T4AWS EC2 g4dn.4xlarge21.06-py3Mixed192ImageNet2012NVIDIA T4
PyTorch1.9.0a0BERT-LARGE16 sequences/sec1x T4AWS EC2 g4dn.4xlarge21.06-py3Mixed10Imagenet2012NVIDIA T4

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Related Resources

Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:


MLPerf Inference v1.1 Performance Benchmarks

Offline Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionDatasetTarget Accuracy
ResNet-50 v1.5313,516 samples/sec8x A100DGX A100A100 SXM-80GBImageNet76.46% Top1
283,469 samples/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
145,742 samples/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
149,178 samples/sec8x A30Gigabyte G482-Z54A30
150,315 samples/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
110,197 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
SSD ResNet-347,851 samples/sec8x A100DGX A100A100 SXM-80GBCOCO0.2 mAP
7,316 samples/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
3,606 samples/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
3,788 samples/sec8x A30Gigabyte G482-Z54A30
3,727 samples/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
2,473 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
3D-UNet487 samples/sec8x A100DGX A100A100 SXM-80GBBraTS 20190.853 DICE mean
421 samples/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
227 samples/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
241 samples/sec8x A30Gigabyte G482-Z54A30
225 samples/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
173 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
RNN-T106,918 samples/sec8x A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
50,561 samples/sec8x A100Gigabyte G242-P31A100-PCIe-80GB
52,596 samples/sec8x A30Gigabyte G482-Z54A30
36,461 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
BERT28,302 samples/sec8x A100DGX A100A100 SXM-80GBSQuAD v1.190.07% f1
25,677 samples/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
12,606 samples/sec4x A100Gigabyte G242-P31A100-PCIe-40GB
13,385 samples/sec8x A30Gigabyte G482-Z54A30
12,867 samples/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
8,757 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
DLRM2,421,440 samples/sec8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
1,097,730 samples/sec4x A100Gigabyte G242-P31A100-PCIe-40GB
1,083,600 samples/sec8x A30Gigabyte G482-Z54A30
772,521 samples/sec8x A10Supermicro 4029GP-TRT-OTO-28A10

Server Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionTarget AccuracyMLPerf Server Latency
Constraints (ms)
Dataset
ResNet-50 v1.5260,042 queries/sec8x A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
70,007 queries/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
104,012 queries/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
116,014 queries/sec8x A30Gigabyte G482-Z54A30
65,004 queries/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
88,014 queries/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
SSD ResNet-347,581 queries/sec8x A100DGX A100A100 SXM-80GB0.2 mAP100COCO
5,802 queries/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
3,083 queries/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
3,575 queries/sec8x A30Gigabyte G482-Z54A30
3,002 queries/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
2,000 queries/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
RNN-T104,012 queries/sec8x A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
43,005 queries/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
36,999 queries/sec8x A30Gigabyte G482-Z54A30
22,600 queries/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
BERT25,795 queries/sec8x A100DGX A100A100 SXM-80GB90.07% f1130SQuAD v1.1
20,497 queries/sec8x (7x1g.10gb A100)DGX A100A100 SXM-80GB
10,402 queries/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
11,501 queries/sec8x A30Gigabyte G482-Z54A30
8,301 queries/sec8x (4x1g.6gb A30)Gigabyte G482-Z54A30
7,202 queries/sec8x A10Supermicro 4029GP-TRT-OTO-28A10
DLRM2,302,660 queries/sec8x A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
600,198 queries/sec4x A100Gigabyte G242-P31A100-PCIe-80GB
1,000,530 queries/sec8x A30Gigabyte G482-Z54A30
680,257 queries/sec8x A10Supermicro 4029GP-TRT-OTO-28A10

Power Efficiency Offline Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5244,537 samples/sec83 samples/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
125,232 samples/sec110.9 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
211,436 samples/sec112.03 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
SSD ResNet-346,482 samples/sec2.04 samples/sec/watt8x A100DGX A100A100 SXM-80GBCOCO
3,295 samples/sec2.65 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
5,866 samples/sec2.71 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
3D-UNet399 samples/sec0.13 samples/sec/watt8x A100DGX A100A100 SXM-80GBBraTS 2019
203 samples/sec0.18 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
345 samples/sec0.18 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
RNN-T90,243 samples/sec27.73 samples/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
44,495 samples/sec37.7 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
84,727 samples/sec38.44 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
BERT24,667 samples/sec6.95 samples/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
10,573 samples/sec8.5 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
20,401 samples/sec8.19 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
DLRM2,091,060 samples/sec629.03 samples/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs
987,260 samples/sec786.67 samples/sec/watt4x A100DGX-Station-A100A100 SXM-80GB

Power Efficiency Server Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5232,036 queries/sec79.14 queries/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
107,013 queries/sec94.74 queries/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
185,034 queries/sec87.76 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
SSD ResNet-346,301 queries/sec1.99 queries/sec/watt8x A100DGX A100A100 SXM-80GBCOCO
3,083 queries/sec2.49 queries/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
5,703 queries/sec2.62 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
RNN-T88,014 queries/sec25.46 queries/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
43,406 queries/sec33.55 queries/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
75,012 queries/sec33.11 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
BERT21,497 queries/sec6.22 queries/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
10,203 queries/sec8.01 queries/sec/watt4x A100DGX-Station-A100A100 SXM-80GB
17,496 queries/sec7.99 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-40GB
DLRM2,002,040 queries/sec591.77 queries/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs
890,424 queries/sec672.18 queries/sec/watt4x A100DGX-Station-A100A100 SXM-80GB

MLPerf™ v1.1 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net, DLRM 99% of FP32 accuracy target: 1.1-033, 1.1-037, 1.1-039, 1.1-042, 1.1-043, 1.1-046, 1.1-047, 1.1-048, 1.1-051. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
4x1g.6gb and 7x1g.10gb is a notation used to refer to the MIG configuration. In this example, the workload is running on 4 or 7 single GPC slices, each with 6GB or 10GB of memory on a single A30 and A100 respectively.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v1.1


NVIDIA landed top performance spots on all MLPerf™ Inference 1.1 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.

MLPerf™ v1.1 A100 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net, DLRM 99% of FP32 accuracy target: 1.1-047, 1.1-049. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.​

The chart compares the performance of Triton to the custom MLPerf™ serving harness across five different TensorRT networks on A100 SXM-80GB on bare metal. The results show that Triton is highly efficient and delivers nearly equal or identical performance to the highly optimized MLPerf™ harness.

 

NVIDIA Client Batch Size=1 Performance with Triton Inference Server

NetworkAcceleratorTraining FrameworkFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
ResNet-50 V1.5 InferenceA100-PCIE-40GBPyTorchTensorRTMixed216425661.024,197 inf/sec-20.07-py3
ResNet-50 V1.5 InferenceA100-SXM4-40GBPyTorchTensorRTTF32216425648.355,294 inf/sec-21.03-py3
ResNet-50 V1.5 InferenceNVIDIA T4PyTorchTensorRTMixed1164256257.91992 inf/sec-20.07-py3
ResNet-50 V1.5 InferenceV100 SXM2-32GBPyTorchTensorRTFP324164384215.791,781 inf/sec-21.03-py3
BERT Large InferenceA100-PCIE-40GBTensorFlowTensorRTMixed1181617.48915 inf/sec38420.09-py3
BERT Large InferenceA100-SXM4-40GBTensorFlowTensorRTINT82186456.341,136 inf/sec38420.09-py3
BERT Large InferenceNVIDIA T4TensorFlowTensorRTMixed1181681.14197 inf/sec38420.09-py3
DLRM InferenceA100-PCIE-40GBPyTorchTorchscriptMixed2165,536242.529,521 inf/sec-21.05-py3
DLRM InferenceA100-SXM4-40GBPyTorchTorchscriptMixed2165,536302.7111,076 inf/sec-21.03-py3
DLRM InferenceV100-SXM2-32GBPyTorchTorchscriptMixed2165,536263.677,083 inf/sec-21.06-py3

Inference Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

Inference Natural Langugage Processing

BERT Inference Throughput

DGX A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128

 

NVIDIA A100 BERT Inference Benchmarks

NetworkNetwork
Type
Batch
Size
ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
BERT-Large with SparsityAttention946,188 sequences/sec--1x A100DGX A100-INT8SQuaD v1.1-A100 SXM4-40GB

A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: Mixed | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.0 | Batch Size = 128 | 21.08-py3 | Precision: Mixed | Dataset: Synthetic

 

A100 Full Chip Inference Performance

NetworkBatch SizeFull Chip ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5024,084 images/sec0.491x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
811,498 images/sec0.71x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
12830,664 images/sec4.171x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
22332,204 images/sec6.921x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
ResNet-50v1.524,040 images/sec0.51x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100-SXM4-40GB
811,171 images/sec0.721x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
12829,856 images/sec4.291x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
21431,042 images/sec6.891x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
ResNext101327,674 samples/sec4.171x A100--INT8SyntheticTensorRT 7.2A100-SXM4-40GB
EfficientNet-B012822,346 images/sec5.731x A100--INT8SyntheticTensorRT 7.2A100-SXM4-40GB
BERT-BASE22,534 sequences/sec0.791x A100DGX A10021.06-py3INT8Sample TextTensorRT 7.2A100-SXM4-40GB
86,895 sequences/sec1.161x A100DGX A10021.06-py3INT8Sample TextTensorRT 7.2A100-SXM-80GB
12813,554 sequences/sec9.441x A100DGX A10021.06-py3INT8Sample TextTensorRT 7.2A100-SXM4-40GB
BERT-LARGE21,085 sequences/sec1.841x A100DGX A10021.06-py3INT8Sample TextTensorRT 7.2A100-SXM4-40GB
82,333 sequences/sec3.431x A100DGX A10021.06-py3INT8Sample TextTensorRT 7.2A100-SXM-80GB
1284,485 sequences/sec28.541x A100DGX A10021.06-py3INT8Sample TextTensorRT 7.2A100-SXM4-40GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
For BS=1 inference refer to the Triton Inference Server section

A100 1/7 MIG Inference Performance

NetworkBatch Size1/7 MIG ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5011,504 images/sec0.671x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
22,301 images/sec0.871x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
83,718 images/sec2.151x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
294,263 images/sec6.81x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
1284,646 images/sec27.551x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
ResNet-50v1.511,494 images/sec0.671x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
22,256 images/sec0.891x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
83,630 images/sec2.21x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
284,109 images/sec6.811x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
1284,501 images/sec28.441x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
BERT-BASE1807 sequences/sec1.241x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
21,116 sequences/sec1.791x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
81,676 sequences/sec4.771x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
1282,151 sequences/sec59.521x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
BERT-LARGE1268 sequences/sec3.741x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
2392 sequences/sec5.11x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
8553 sequences/sec14.481x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
128671 sequences/sec190.631x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 7 MIG Inference Performance

NetworkBatch Size7 MIG ThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50110,593 images/sec0.661x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
216,107 images/sec0.871x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
826,088 images/sec2.151x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
2829,994 images/sec6.771x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
12832,498 images/sec27.571x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
ResNet-50v1.5110,364 images/sec0.681x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
215,801 images/sec0.891x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
825,383 images/sec2.211x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
2828,901 images/sec6.781x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
12831,517 images/sec28.431x A100DGX A10021.08-py3INT8SyntheticTensorRT 8.0A100 SXM-80GB
BERT-BASE15,627 sequences/sec1.241x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
27,937 sequences/sec1.761x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
811,771 sequences/sec4.761x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
12815,052 sequences/sec59.531x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
BERT-LARGE11,892 sequences/sec3.71x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
22,733 sequences/sec5.121x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
83,865 sequences/sec14.491x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB
1284,702 sequences/sec190.541x A100DGX A10021.06-py3INT8SyntheticTensorRT 7.2A100 SXM-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

 

A40 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5024,707 images/sec25 images/sec/watt0.421x A40GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A40
89,828 images/sec42 images/sec/watt0.811x A40GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A40
11616,775 images/sec- images/sec/watt6.921x A40GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A40
12816,986 images/sec57 images/sec/watt7.541x A40GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A40
ResNet-50v1.524,612 images/sec24 images/sec/watt0.431x A40GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A40
89,547 images/sec40 images/sec/watt0.841x A40GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A40
10916,537 images/sec- images/sec/watt6.591x A40GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A40
12816,210 images/sec54 images/sec/watt7.91x A40GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A40
BERT-BASE22,351 sequences/sec13 sequences/sec/watt0.851x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
84,494 sequences/sec19 sequences/sec/watt1.781x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
1287,180 sequences/sec27 sequences/sec/watt17.831x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
BERT-LARGE2893 sequences/sec4 sequences/sec/watt2.241x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
81,666 sequences/sec6 sequences/sec/watt4.81x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40
1282,216 sequences/sec9 sequences/sec/watt57.761x A40GIGABYTE G482-Z52-0021.05-py3INT8Sample TextTensorRT 7.2A40

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5023,487 images/sec38 images/sec/watt0.571x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
88,497 images/sec70 images/sec/watt0.941x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
10916,083 images/sec- images/sec/watt6.781x A30GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A30
12815,612 images/sec95 images/sec/watt8.21x A30GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A30
ResNet-50v1.523,498 images/sec37 images/sec/watt0.571x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
88,330 images/sec68 images/sec/watt0.961x A30GIGABYTE G482-Z52-SW-QZ-00121.03-py3INT8SyntheticTensorRT 7.2A30
10615,495 images/sec- images/sec/watt6.841x A30GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A30
12815,667 images/sec95 images/sec/watt8.171x A30GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A30
BERT-BASE22,052 sequences/sec23 sequences/sec/watt0.971x A30GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A30
84,417 sequences/sec33 sequences/sec/watt1.811x A30GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A30
1286,815 sequences/sec50 sequences/sec/watt18.781x A30GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A30
BERT-LARGE2809 sequences/sec7 sequences/sec/watt2.471x A30GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A30
81,492 sequences/sec11 sequences/sec/watt5.361x A30GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A30
1282,207 sequences/sec15 sequences/sec/watt581x A30GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A10 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5024,374 images/sec29 images/sec/watt0.461x A10GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A10
88,131 images/sec54 images/sec/watt0.981x A10GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A10
7511,769 images/sec- images/sec/watt6.81x A10GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A10
12812,173 images/sec81 images/sec/watt10.521x A10GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A10
ResNet-50v1.524,341 images/sec29 images/sec/watt0.461x A10GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A10
87,918 images/sec53 images/sec/watt1.011x A10GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A10
7511,044 images/sec- images/sec/watt6.791x A10GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A10
12811,554 images/sec77 images/sec/watt11.081x A10GIGABYTE G482-Z52-0021.08-py3INT8SyntheticTensorRT 8.0A10
BERT-BASE22,067 sequences/sec16 sequences/sec/watt0.971x A10GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A10
83,598 sequences/sec27 sequences/sec/watt2.221x A10GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A10
1284,766 sequences/sec35 sequences/sec/watt26.861x A10GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A10
BERT-LARGE2766 sequences/sec6 sequences/sec/watt2.611x A10GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A10
81,257 sequences/sec10 sequences/sec/watt6.361x A10GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A10
1281,462 sequences/sec11 sequences/sec/watt881x A10GIGABYTE G482-Z52-0021.06-py3INT8Sample TextTensorRT 7.2A10

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

T4 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5022,105 images/sec30 images/sec/watt0.951x T4Supermicro SYS-1029GQ-TRT21.03-py3INT8SyntheticTensorRT 7.2NVIDIA T4
84,008 images/sec56 images/sec/watt2.041x T4Supermicro SYS-4029GP-TRT21.06-py3INT8SyntheticTensorRT 7.2.3NVIDIA T4
324,771 images/sec- images/sec/watt6.711x T4Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.0NVIDIA T4
1285,230 images/sec75 images/sec/watt24.481x T4Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.0NVIDIA T4
ResNet-50v1.522,092 images/sec30 images/sec/watt0.961x T4Supermicro SYS-1029GQ-TRT21.03-py3INT8SyntheticTensorRT 7.2NVIDIA T4
83,745 images/sec54 images/sec/watt2.141x T4Supermicro SYS-4029GP-TRT21.06-py3INT8SyntheticTensorRT 7.2.3NVIDIA T4
294,501 images/sec- images/sec/watt6.441x T4Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.0NVIDIA T4
1285,049 images/sec72 images/sec/watt25.351x T4Supermicro SYS-1029GQ-TRT21.08-py3INT8SyntheticTensorRT 8.0NVIDIA T4
BERT-BASE21,102 sequences/sec17 sequences/sec/watt1.811x T4Supermicro SYS-4029GP-TRT21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
81,766 sequences/sec27 sequences/sec/watt4.531x T4Supermicro SYS-4029GP-TRT21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
1281,872 sequences/sec28 sequences/sec/watt681x T4Supermicro SYS-4029GP-TRT21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
BERT-LARGE2392 sequences/sec6 sequences/sec/watt5.11x T4Supermicro SYS-4029GP-TRT21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
8573 sequences/sec9 sequences/sec/watt13.971x T4Supermicro SYS-4029GP-TRT21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
128565 sequences/sec8 sequences/sec/watt2271x T4Supermicro SYS-4029GP-TRT21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server section

 

V100 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5022,058 images/sec11 images/sec/watt0.971x V100DGX-221.08-py3INT8SyntheticTensorRT 8.0V100-SXM3-32GB
84,413 images/sec16 images/sec/watt1.811x V100DGX-221.08-py3MixedSyntheticTensorRT 8.0V100-SXM3-32GB
527,917 images/sec- images/sec/watt6.571x V100DGX-221.06-py3INT8SyntheticTensorRT 7.2.3V100-SXM3-32GB
1288,156 images/sec24 images/sec/watt15.691x V100DGX-221.08-py3INT8SyntheticTensorRT 8.0V100-SXM3-32GB
ResNet-50v1.522,054 images/sec11 images/sec/watt0.971x V100DGX-221.08-py3INT8SyntheticTensorRT 8.0V100-SXM3-32GB
84,248 images/sec15 images/sec/watt1.881x V100DGX-221.08-py3INT8SyntheticTensorRT 8.0V100-SXM3-32GB
527,508 images/sec- images/sec/watt6.931x V100DGX-221.06-py3INT8SyntheticTensorRT 7.2.3V100-SXM3-32GB
1287,799 images/sec22 images/sec/watt16.411x V100DGX-221.08-py3MixedSyntheticTensorRT 8.0V100-SXM3-32GB
BERT-BASE21,159 sequences/sec5 sequences/sec/watt1.731x V100DGX-221.06-py3INT8Sample TextTensorRT 7.2V100-SXM3-32GB
82,201 sequences/sec8 sequences/sec/watt3.641x V100DGX-221.06-py3INT8Sample TextTensorRT 7.2V100-SXM3-32GB
1283,174 sequences/sec10 sequences/sec/watt40.331x V100DGX-221.06-py3INT8Sample TextTensorRT 7.2V100-SXM3-32GB
BERT-LARGE2486 sequences/sec2 sequences/sec/watt4.121x V100DGX-221.06-py3INT8Sample TextTensorRT 7.2V100-SXM3-32GB
8790 sequences/sec3 sequences/sec/watt10.121x V100DGX-221.06-py3MixedSample TextTensorRT 7.2V100-SXM3-32GB
128971 sequences/sec3 sequences/sec/watt1321x V100DGX-221.06-py3MixedSample TextTensorRT 7.2V100-SXM3-32GB

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server section

Inference Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

T4 Inference Performance on Cloud

NetworkBatch SizeThroughputLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.522,063 images/sec0.971x T4AWS EC2 g4dn.4xlarge21.06-py3INT8SyntheticTensorRT 7.2NVIDIA T4
83,533 images/sec2.261x T4AWS EC2 g4dn.4xlarge21.06-py3INT8SyntheticTensorRT 7.2NVIDIA T4
1284,555 images/sec28.11x T4AWS EC2 g4dn.4xlarge21.06-py3INT8SyntheticTensorRT 7.2NVIDIA T4
BERT-LARGE2384 sequences/sec5.211x T4AWS EC2 g4dn.4xlarge21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
8551 sequences/sec14.521x T4AWS EC2 g4dn.4xlarge21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4
128540 sequences/sec237.251x T4AWS EC2 g4dn.4xlarge21.06-py3INT8Sample TextTensorRT 7.2NVIDIA T4

BERT-Large: Sequence Length = 128

Conversational AI

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.

Related Resources

Download and get started with NVIDIA Riva.


Riva Benchmarks

Automatic Speech Recognition

A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet114.41A100 SXM4-40GB
Quartznet256254.364A100 SXM4-40GB
Quartznet512351.2506A100 SXM4-40GB
Quartznet1024630.81005A100 SXM4-40GB
Jasper117.61A100 SXM4-40GB
Jasper256244.9254A100 SXM4-40GB
Jasper512381507A100 SXM4-40GB
Jasper1024749.31,004A100 SXM4-40GB

A100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet19.61A100 SXM4-40GB
Quartznet1625.916A100 SXM4-40GB
Quartznet128132.4128A100 SXM4-40GB
Jasper113.41A100 SXM4-40GB
Jasper1626.316A100 SXM4-40GB
Jasper128258.9128A100 SXM4-40GB

A100 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet128.11A100 SXM4-40GB
Quartznet512566.5505A100 SXM4-40GB
Quartznet1,024899.31,000A100 SXM4-40GB
Quartznet1,5121,303.81,460A100 SXM4-40GB
Jasper1311A100 SXM4-40GB
Jasper512667.5504A100 SXM4-40GB
Jasper1,0241,089997A100 SXM4-40GB
Jasper1,5121,753.81,449A100 SXM4-40GB

V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet114.41V100 SXM2-16GB
Quartznet256222.2254V100 SXM2-16GB
Quartznet512385.2505V100 SXM2-16GB
Quartznet768574.5752V100 SXM2-16GB
Jasper126.81V100 SXM2-16GB
Jasper128239.4127V100 SXM2-16GB
Jasper256416253V100 SXM2-16GB
Jasper512969.7500V100 SXM2-16GB

V100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet18.81V100 SXM2-16GB
Quartznet1622.416V100 SXM2-16GB
Quartznet128114.7127V100 SXM2-16GB
Jasper121.51V100 SXM2-16GB
Jasper1636.916V100 SXM2-16GB
Jasper64406.464V100 SXM2-16GB
Jasper512969.7500V100 SXM2-16GB

V100 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet132.9331V100 SXM2-16GB
Quartznet256461.44253V100 SXM2-16GB
Quartznet512784.73502V100 SXM2-16GB
Quartznet7681,121.6747V100 SXM2-16GB
Quartznet1,0241,551.5986V100 SXM2-16GB
Jasper148.3511V100 SXM2-16GB
Jasper256734.99252V100 SXM2-16GB
Jasper5121,423.3498V100 SXM2-16GB
Jasper7682,190.2730V100 SXM2-16GB

T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet133.1831NVIDIA T4
Quartznet64162.6364NVIDIA T4
Quartznet128263.6127NVIDIA T4
Quartznet256449.28253NVIDIA T4
Quartznet384732.75376NVIDIA T4
Jasper172.3771NVIDIA T4
Jasper64259.6464NVIDIA T4
Jasper128450.81127NVIDIA T4
Jasper2561,200.8249NVIDIA T4

T4 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Quartznet119.21NVIDIA T4
Quartznet1656.416NVIDIA T4
Quartznet64242.464NVIDIA T4
Jasper146.91NVIDIA T4
Jasper851.18NVIDIA T4
Jasper1684.416NVIDIA T4

T4 Offline Mode (3200 ms chunk)
Acoustic model# of streamsLatency (ms) (avg)Throughput (RTFX)GPU Version
Quartznet1157.621NVIDIA T4
Quartznet256906.17251NVIDIA T4
Quartznet5121,515.2495NVIDIA T4
Jasper196.2011NVIDIA T4
Jasper2561,758.4247NVIDIA T4

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.0.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

Natural Language Processing

A100 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER13.19311A100 SXM4-40GB
NER25695.52549A100 SXM4-40GB
Q&A14.95201A100 SXM4-40GB
Q&A128279453A100 SXM4-40GB

V100 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER14.87204V100 SXM2-16GB
NER2561351,797V100 SXM2-16GB
Q&A17.47134V100 SXM2-16GB
Q&A128521244V100 SXM2-16GB

T4 Benchmarks
Task# of streamsAvg Latency (ms)Throughput (seq/sec)GPU Version
NER19.31107NVIDIA T4
NER256255960NVIDIA T4
Q&A111.587NVIDIA T4
Q&A128571223NVIDIA T4

Named Entity Recogniton (NER): 128 seq len, BERT-base | Question Answering (QA): 384 seq len, BERT-large | NLP Throughput (seq/s) - Number of sequences processed per second | Performance of the Riva named entity recognition (NER) service (using a BERT-base model, sequence length of 128) and the Riva question answering (QA) service (using a BERT-large model, sequence length of 384) was measured in Riva. Batch size 1 latency and maximum throughput were measured. Riva version: v1.0.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

Text to Speech

A100 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.060.0420A100 SXM4-40GB
40.480.0337A100 SXM4-40GB
60.690.0342A100 SXM4-40GB
80.880.0346A100 SXM4-40GB
101.060.0349A100 SXM4-40GB

V100 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.080.0514V100 SXM2-16GB
40.770.0523V100 SXM2-16GB
61.110.0526V100 SXM2-16GB
81.40.0628V100 SXM2-16GB
101.740.0728V100 SXM2-16GB

T4 Benchmarks
# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
10.120.0711NVIDIA T4
41.020.0717NVIDIA T4
61.590.0718NVIDIA T4
82.130.0819NVIDIA T4
102.550.118NVIDIA T4

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.0.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

 

Last updated: September 27th, 2021