Reproducible Performance

Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide

Related Resources

HPC Performance

Review the latest GPU-acceleration factors of popular HPC applications.


Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Related Resources

Read our blog on convergence for more details.

Get up and running quickly with NVIDIA’s complete solution stack:


NVIDIA Performance on MLPerf 2.0 Training Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA A100 Performance on MLPerf 2.0 AI Benchmarks - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.527.22775.90% classification8x A100Inspur: NF5688M62.0-2069MixedImageNet2012A100-SXM4-80GB
4.50375.90% classification64x A100DGX A1002.0-2094MixedImageNet2012A100-SXM4-80GB
0.55575.90% classification1,024x A100DGX A1002.0-2101MixedImageNet2012A100-SXM4-80GB
0.31975.90% classification4,216x A100DGX A1002.0-2107MixedImageNet2012A100-SXM4-80GB
3D U-Net21.2780.908 Mean DICE score8x A100H3C: R5500G5-Intelx8A100-SXM-80GB2.0-2060MixedKiTS 2019A100-SXM4-80GB
3.4370.908 Mean DICE score72x A100Azure: ND96amsr_A100_v4_n92.0-2007MixedKiTS 2019A100-SXM4-80GB
1.2160.908 Mean DICE score768x A100DGX A1002.0-2100MixedKiTS 2019A100-SXM4-80GB
PyTorchBERT15.8690.72 Mask-LM accuracy8x A100Inspur: NF5688M62.0-2070MixedWikipedia 2020/01/01A100-SXM4-80GB
2.9420.72 Mask-LM accuracy64x A100DGX A1002.0-2095MixedWikipedia 2020/01/01A100-SXM4-80GB
0.4210.72 Mask-LM accuracy1,024x A100DGX A1002.0-2102MixedWikipedia 2020/01/01A100-SXM4-80GB
0.2060.72 Mask-LM accuracy4,096x A100DGX A1002.0-2106MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN40.9170.377 Box min AP and 0.339 Mask min AP8x A100Inspur: NF5688M62.0-2070MixedCOCO2017A100-SXM4-80GB
8.4470.377 Box min AP and 0.339 Mask min AP64x A100DGX A1002.0-2095MixedCOCO2017A100-SXM4-80GB
3.0850.377 Box min AP and 0.339 Mask min AP384x A100DGX A1002.0-2099MixedCOCO2017A100-SXM4-80GB
RNN-T28.7590.058 Word Error Rate8x A100Inspur: NF5488A52.0-2066MixedLibriSpeechA100-SXM4-80GB
6.910.058 Word Error Rate64x A100DGX A1002.0-2095MixedLibriSpeechA100-SXM4-80GB
2.1510.058 Word Error Rate1,536x A100DGX A1002.0-2104MixedLibriSpeechA100-SXM4-80GB
RetinaNet84.397mAP of 0.348x A100DGX A1002.0-2091MixedOpenImagesA100-SXM4-80GB
14.462mAP of 0.3464x A100DGX A1002.0-2095MixedOpenImagesA100-SXM4-80GB
4.253mAP of 0.341,280x A100DGX A1002.0-2103MixedOpenImagesA100-SXM4-80GB
TensorFlowMiniGo255.67250% win rate vs. checkpoint8x A100H3C: R5500G5-AMDx8A100-SXM-80GB2.0-2059MixedGoA100-SXM4-80GB
73.03850% win rate vs. checkpoint64x A100DGX A1002.0-2096MixedGoA100-SXM4-80GB
16.23150% win rate vs. checkpoint1,792x A100DGX A1002.0-2105MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM1.5970.8025 AUC8x A100Inspur: NF5688M62.0-2068MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.6530.8025 AUC64x A100DGX A1002.0-2093MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.5880.8025 AUC112x A100DGX A1002.0-2098MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

MLPerf™ v2.0 Training Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.


NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Strong Scaling - Closed Division

FrameworkNetworkTime to Train (mins)MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetCosmoFlow8.04Mean average error 0.1241,024x A100DGX A1001.0-1120MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
25.78Mean average error 0.124128x A100DGX A1001.0-1121MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
PyTorchDeepCAM1.67IOU 0.822,048x A100DGX A1001.0-1122MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB
2.65IOU 0.82512x A100DGX A1001.0-1123MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB

NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Weak Scaling - Closed Division

FrameworkNetworkThroughputMLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetCosmoFlow0.73 models/minMean average error 0.1244,096x A100DGX A1001.0-1131MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
PyTorchDeepCAM5.27 models/minIOU 0.824,096x A100DGX A1001.0-1132MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB

MLPerf™ v1.0 Training HPC Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v1.0 Training HPC rules and guidelines, click here

Converged Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0Tacotron2101.55 Training Loss305,059 total output mels/sec8x A100DGX A10022.05-py3TF32128LJSpeech 1.1A100-SXM4-80GB
1.12.0a0WaveGlow250-5.81 Training Loss1,709,596 output samples/sec8x A100DGX A10022.04-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
1.12.0a0GNMT V21624.16 BLEU Score960,545 total tokens/sec8x A100DGX A10022.05-py3Mixed128wmt16-en-deA100-SXM4-80GB
1.12.0a0NCF0.37.96 Hit Rate at 10152,745,062 samples/sec8x A100DGX A10022.05-py3Mixed131072MovieLens 20MA100-SXM4-80GB
1.12.0a0Transformer-XL Base17722.45 Perplexity742,987 total tokens/sec8x A100DGX A10022.05-py3Mixed128WikiText-103A100-SXM4-80GB
1.12.0a0EfficientNet-B057676.54 Top 116,016 images/sec8x A100DGX A10022.04-py3Mixed256Imagenet2012A100-SXM4-40GB
1.12.0a0EfficientDet-D0454.34 BBOX mAP1,990 images/sec8x A100DGX A10022.04-py3Mixed150COCO 2017A100-SXM4-80GB
1.12.0a0EfficientNet-WideSE-B057576.89 Top 115,489 images/sec8x A100DGX A10022.04-py3Mixed256Imagenet2012A100-SXM4-80GB
1.12.0a0EfficientNet-WideSE-B41,25278.06 Top 16,956 images/sec8x A100DGX A10022.04-py3Mixed128Imagenet2012A100-SXM4-80GB
1.12.0a0SE3 Transformer9.04 MAE21,466 molecules/sec8x A100DGX A10022.05-py3Mixed240Quantum Machines 9A100-SXM4-80GB
Tensorflow1.15.5ResNext10118879.19 Top 110,300 images/sec8x A100DGX A10022.04-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5SE-ResNext10121679.76 Top 18,960 images/sec8x A100DGX A10022.05-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5U-Net Industrial1.99 IoU Threshold 0.991,080 images/sec8x A100DGX A10022.05-py3Mixed2DAGM2007A100-SXM4-80GB
2.8.0U-Net Medical5.89 DICE Score973 images/sec8x A100DGX A10022.05-py3Mixed8EM segmentation challengeA100-SXM4-80GB
2.8.0Electra Base Fine Tuning392.55 F12,823 sequences/sec8x A100DGX A10022.05-py3Mixed32SQuaD v1.1A100-SXM4-80GB
2.8.0Wide and Deep6.66 MAP at 124,442,267 images/sec8x A100DGX A10022.04-py3Mixed16384Tabular Outbrain ParquetA100-SXM4-40GB

A40 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0NCF1.96 Hit Rate at 1049,793,335 samples/sec8x A40GIGABYTE G482-Z52-0022.05-py3Mixed131072MovieLens 20MA40
1.12.0a0Tacotron2117.56 Training Loss262,848 total output mels/sec8x A40Supermicro AS -4124GS-TNR22.05-py3Mixed128LJSpeech 1.1A40
1.12.0a0WaveGlow468-5.76 Training Loss901,085 output samples/sec8x A40Supermicro AS -4124GS-TNR22.04-py3Mixed10LJSpeech 1.1A40
1.12.0a0GNMT v24524.3 BLEU Score326,251 total tokens/sec8x A40Supermicro AS -4124GS-TNR22.05-py3Mixed128wmt16-en-deA40
1.12.0a0Transformer XL Base43422.43 Perplexity303,433 total tokens/sec8x A40Supermicro AS -4124GS-TNR22.05-py3Mixed128WikiText-103A40
1.12.0a0EfficientNet-B087576.44 Top 110,150 images/sec8x A40Supermicro AS -4124GS-TNR22.04-py3Mixed256Imagenet2012A40
1.12.0a0EfficientDet-D0640.34 BBOX mAP1,266 images/sec8x A40Supermicro AS -4124GS-TNR22.04-py3Mixed60COCO 2017A40
1.12.0a0SE3 Transformer13.04 MAE13,766 molecules/sec8x A40Supermicro AS -4124GS-TNR22.05-py3Mixed240Quantum Machines 9A40
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99660 images/sec8x A40GIGABYTE G482-Z52-0022.02-py3Mixed2DAGM2007A40
1.15.5ResNeXt10142479.18 Top 14,541 images/sec8x A40Supermicro AS -4124GS-TNR22.04-py3Mixed256Imagenet2012A40
1.15.5SE-ResNext10146879.74 Top 14,125 images/sec8x A40Supermicro AS -4124GS-TNR22.05-py3Mixed256Imagenet2012A40
2.8.0Electra Base Fine Tuning492.6 F11,132 sequences/sec8x A40Supermicro AS -4124GS-TNR22.05-py3Mixed32SQuaD v1.1A40

A30 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0Tacotron2123.53 Training Loss253,743 total output mels/sec8x A30GIGABYTE G482-Z52-0022.05-py3Mixed104LJSpeech 1.1A30
1.12.0a0WaveGlow450-5.8 Training Loss939,042 output samples/sec8x A30GIGABYTE G482-Z52-0022.04-py3Mixed10LJSpeech 1.1A30
1.12.0a0GNMT V24524.43 BLEU Score325,304 total tokens/sec8x A30GIGABYTE G482-Z52-0022.05-py3Mixed128wmt16-en-deA30
1.12.0a0NCF1.96 Hit Rate at 1057,626,487 samples/sec8x A30GIGABYTE G482-Z52-0022.05-py3Mixed131072MovieLens 20MA30
1.12.0a0ResNeXt10154179.47 Top 13,639 images/sec8x A30GIGABYTE G482-Z52-0022.05-py3Mixed112Imagenet2012A30
1.12.0a0EfficientNet-B090876.4 Top 19,667 images/sec8x A30GIGABYTE G482-Z52-0022.04-py3Mixed128Imagenet2012A30
1.12.0a0EfficientDet-D0768.34 BBOX mAP956 images/sec8x A30GIGABYTE G482-Z52-0022.04-py3Mixed30COCO 2017A30
1.12.0a0SE3 Transformer12.04 MAE15,418 molecules/sec8x A30GIGABYTE G482-Z52-0022.05-py3Mixed240Quantum Machines 9A30
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99698 images/sec8x A30GIGABYTE G482-Z52-0022.05-py3Mixed2DAGM2007A30
2.8.0U-Net Medical2.88 DICE Score475 images/sec8x A30GIGABYTE G482-Z52-0022.05-py3Mixed8EM segmentation challengeA30
1.15.5ResNeXt10145979.33 Top 14,213 images/sec8x A30GIGABYTE G482-Z52-0022.04-py3Mixed128Imagenet2012A30
1.15.5SE-ResNext10155779.83 Top 13,475 images/sec8x A30GIGABYTE G482-Z52-0022.05-py3Mixed96Imagenet2012A30
2.8.0Electra Base Fine Tuning592.69 F1990 sequences/sec8x A30GIGABYTE G482-Z52-0022.05-py3Mixed16SQuaD v1.1A30

A10 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0Tacotron2143.54 Training Loss218,110 total output mels/sec8x A10GIGABYTE G482-Z52-0022.05-py3Mixed104LJSpeech 1.1A10
1.12.0a0WaveGlow546-5.86 Training Loss771,407 output samples/sec8x A10GIGABYTE G482-Z52-0022.04-py3Mixed10LJSpeech 1.1A10
1.12.0a0GNMT V25224.1 BLEU Score281,087 total tokens/sec8x A10GIGABYTE G482-Z52-0022.05-py3Mixed128wmt16-en-deA10
1.12.0a0NCF1.96 Hit Rate at 1045,665,439 samples/sec8x A10GIGABYTE G482-Z52-0022.05-py3Mixed131072MovieLens 20MA10
1.12.0a0EfficientNet-B01,11776.3 Top 17,885 images/sec8x A10GIGABYTE G482-Z52-0022.04-py3Mixed128Imagenet2012A10
1.12.0a0EfficientDet-D0790.34 BBOX mAP923 images/sec8x A10GIGABYTE G482-Z52-0022.04-py3Mixed30COCO 2017A10
1.12.0a0EfficientNet-WideSE-B01,14376.78 Top 17,729 images/sec8x A10GIGABYTE G482-Z52-0022.04-py3Mixed128Imagenet2012A10
1.12.0a0SE3 Transformer15.04 MAE12,058 molecules/sec8x A10GIGABYTE G482-Z52-0022.05-py3Mixed240Quantum Machines 9A10
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99657 images/sec8x A10GIGABYTE G482-Z52-0022.05-py3Mixed2DAGM2007A10
2.8.0U-Net Medical3.89 DICE Score369 images/sec8x A10GIGABYTE G482-Z52-0022.05-py3Mixed8EM segmentation challengeA10
1.15.5ResNext10157379.16 Top 13,365 images/sec8x A10GIGABYTE G482-Z52-0022.05-py3Mixed128Imagenet2012A10
1.15.5SE-ResNext10167479.84 Top 12,869 images/sec8x A10GIGABYTE G482-Z52-0022.04-py3Mixed96Imagenet2012A10
2.8.0Electra Base Fine Tuning692.64 F1753 sequences/sec8x A10GIGABYTE G482-Z52-0022.05-py3Mixed16SQuaD v1.1A10

T4 Training Performance +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0ResNeXt1011,37579.43 Top 11,432 images/sec8x T4Supermicro SYS-4029GP-TRT22.05-py3Mixed112Imagenet2012NVIDIA T4
1.12.0a0WaveGlow1,120-5.82 Training Loss387,032 output samples/sec8x T4Supermicro SYS-4029GP-TRT22.03-py3Mixed10LJSpeech 1.1NVIDIA T4
1.12.0a0GNMT V29324.22 BLEU Score155,176 total tokens/sec8x T4Supermicro SYS-4029GP-TRT22.04-py3Mixed128wmt16-en-deNVIDIA T4
1.12.0a0NCF2.96 Hit Rate at 1025,081,490 samples/sec8x T4Supermicro SYS-4029GP-TRT22.05-py3Mixed131072MovieLens 20MNVIDIA T4
1.12.0a0EfficientNet-B02,37176.43 Top 13,702 images/sec8x T4Supermicro SYS-4029GP-TRT22.04-py3Mixed128Imagenet2012NVIDIA T4
1.12.0a0EfficientDet-D01,349.34 BBOX mAP506 images/sec8x T4Supermicro SYS-4029GP-TRT22.04-py3Mixed30COCO 2017NVIDIA T4
1.12.0a0EfficientNet-WideSE-B02,48076.67 Top 13,567 images/sec8x T4Supermicro SYS-4029GP-TRT22.04-py3Mixed128Imagenet2012NVIDIA T4
1.12.0a0SE3 Transformer37.04 MAE4,666 molecules/sec8x T4Supermicro SYS-4029GP-TRT22.05-py3Mixed240Quantum Machines 9NVIDIA T4
Tensorflow1.15.5U-Net Industrial2.99 IoU Threshold 0.99299 images/sec8x T4Supermicro SYS-4029GP-TRT22.05-py3Mixed2DAGM2007NVIDIA T4
1.15.5U-Net Medical39.89 DICE Score149 images/sec8x T4Supermicro SYS-4029GP-TRT22.05-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5ResNext1011,25779.38 Top 11,533 images/sec8x T4Supermicro SYS-4029GP-TRT22.04-py3Mixed128Imagenet2012NVIDIA T4
1.15.5SE-ResNext1011,58079.91 Top 11,220 images/sec8x T4Supermicro SYS-4029GP-TRT22.05-py3Mixed96Imagenet2012NVIDIA T4
2.8.0Electra Base Fine Tuning1092.7 F1378 sequences/sec8x T4Supermicro SYS-4029GP-TRT22.05-py3Mixed16SQuaD v1.1NVIDIA T4


V100 Training Performance +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0Tacotron2181.53 Training Loss180,095 total output mels/sec8x V100DGX-222.04-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.12.0a0WaveGlow411-5.72 Training Loss1,035,406 output samples/sec8x V100DGX-222.03-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.12.0a0GNMT V23424.42 BLEU Score440,905 total tokens/sec8x V100DGX-222.05-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.12.0a0NCF1.96 Hit Rate at 1094,214,173 samples/sec8x V100DGX-222.05-py3Mixed131072MovieLens 20MV100-SXM3-32GB
1.12.0a0EfficientNet-B01,02876.47 Top 18,709 images/sec8x V100DGX-222.04-py3Mixed256Imagenet2012V100-SXM3-32GB
1.12.0a0EfficientDet-D01,239.34 BBOX mAP565 images/sec8x V100DGX-222.05-py3Mixed60COCO 2017V100-SXM3-32GB
1.12.0a0EfficientNet-WideSE-B01,02476.97 Top 18,737 images/sec8x V100DGX-222.04-py3Mixed256Imagenet2012V100-SXM3-32GB
1.12.0a0SE3 Transformer14.04 MAE13,459 molecules/sec8x V100DGX-222.05-py3Mixed240Quantum Machines 9V100-SXM3-32GB
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99639 images/sec8x V100DGX-222.05-py3Mixed2DAGM2007V100-SXM3-32GB
1.15.5ResNext10141979.4 Top 14,622 images/sec8x V100DGX-222.04-py3Mixed128Imagenet2012V100-SXM3-32GB
1.15.5SE-ResNext10149379.9 Top 13,945 images/sec8x V100DGX-222.05-py3Mixed96Imagenet2012V100-SXM3-32GB
1.15.5U-Net Medical12.89 DICE Score466 images/sec8x V100DGX-222.05-py3Mixed8EM segmentation challengeV100-SXM3-32GB
2.8.0Wide and Deep9.66 MAP at 122,921,693 samples/sec8x V100DGX-222.04-py3Mixed16384Kaggle Outbrain Click PredictionV100-SXM3-32GB
2.8.0Electra Base Fine Tuning492.62 F11,376 sequences/sec8x V100DGX-222.05-py3Mixed32SQuaD v1.1V100-SXM3-32GB

Converged Training Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance on Cloud

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
Tensorflow-BERT-LARGE1291.38 F1769 sequences/sec8x A100AWS EC2 p4d.24xlarge22.04-py3Mixed24SQuaD v1.1A100-SXM4-40GB
-BERT-LARGE1091.36 F1825 sequences/sec8x A100Azure Standard_ND96amsr_A100_v422.05-py3Mixed24SQuaD v1.1A100-SXM4-40GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

V100 Training Performance on Cloud +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
Tensorflow-BERT-LARGE2991.3 F1172 sequences/sec8x V100AWS EC2 p3.16xlarge22.04-py3Mixed3SQuaD v1.1V100-SXM2-16GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

Converged Multi-Node Training Performance of NVIDIA GPU

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Multi-Node Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputTotal GPUsNodesServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.11.0a0BERT-LARGE Pre-Training P12961.53 Training Loss25,365 sequences/sec64x A1008Selene21.12-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P21691.35 Training Loss5,112 sequences/sec64x A1008Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training E2E2531.35 Training Loss-64x A1008Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P11601.51 Training Loss48,380 sequences/sec128x A10016Selene21.12-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P2871.34 Training Loss9,961 sequences/sec128x A10016Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training E2E1361.34 Training Loss-128x A10016Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P1871.49 Training Loss89,062 sequences/sec256x A10032Selene21.12-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P2461.34 Training Loss19,169 sequences/sec256x A10032Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training E2E731.34 Training Loss-256x A10032Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P1511.5 Training Loss153,429 sequences/sec512x A10064Selene21.12-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training P2251.33 Training Loss36,887 sequences/sec512x A10064Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0BERT-LARGE Pre-Training E2E421.33 Training Loss-512x A10064Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.10.0a0BERT-LARGE Pre-Training P1261.5 Training Loss300,769 sequences/sec1,024x A100128Selene21.09-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.10.0a0BERT-LARGE Pre-Training P2131.35 Training Loss74,498 sequences/sec1,024x A100128Selene21.09-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.10.0a0BERT-LARGE Pre-Training E2E221.35 Training Loss-1,024x A100128Selene21.09-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0Transformer18618.25 Perplexity454,979 total tokens/sec16x A1002Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0Transformer10518.27 Perplexity822,173 total tokens/sec64x A1004Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB
1.11.0a0Transformer6318.34 Perplexity1,389,494 total tokens/sec64x A1008Selene21.12-py3Mixed16SQuaD v1.1A100-SXM4-80GB

BERT-Large Pre-Training Phase 1 Sequence Length = 128
BERT-Large Pre-Training Phase 2 Sequence Length = 512
Starting from 21.09-py3, ECC is enabled

Single-GPU Training

Some scenarios aren’t used in real-world training, such as single-GPU throughput. The table below provides an indication of a platform’s single-chip throughput.

Related Resources

Achieve unprecedented acceleration at every scale with NVIDIA’s complete solution stack.

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.

NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit the NVIDIA NGC catalog to pull containers and quickly get up and running with deep learning.


Single GPU Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0Tacotron240,316 total output mels/sec1x A100DGX A10022.05-py3TF32128LJSpeech 1.1A100-SXM4-80GB
1.12.0a0WaveGlow230,472 output samples/sec1x A100DGX A10022.04-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
1.11.0a0FastPitch87,184 frames/sec1x A100DGX A10022.02-py3TF32128LJSpeech 1.1A100-SXM4-80GB
1.12.0a0GNMT V2170,507 total tokens/sec1x A100DGX A10022.05-py3Mixed128wmt16-en-deA100-SXM4-80GB
1.12.0a0NCF39,322,684 samples/sec1x A100DGX A10022.04-py3Mixed1048576MovieLens 20MA100-SXM4-80GB
1.12.0a0ResNeXt1011,134 images/sec1x A100DGX A10022.05-py3Mixed128Imagenet2012A100-SXM4-80GB
1.12.0a0Transformer-XL Large17,068 total tokens/sec1x A100DGX A10022.05-py3Mixed16WikiText-103A100-SXM4-80GB
1.12.0a0Transformer-XL Base90,918 total tokens/sec1x A100DGX A10022.05-py3Mixed128WikiText-103A100-SXM4-80GB
1.12.0a0nnU-Net1,120 images/sec1x A100DGX A10022.05-py3Mixed64Medical Segmentation DecathlonA100-SXM4-80GB
1.12.0a0BERT Large Pre-Training Phase 2289 sequences/sec1x A100DGX A10022.04-py3Mixed56Wikipedia 2020/01/01A100-SXM4-80GB
1.12.0a0BERT Large Pre-Training Phase 1853 sequences/sec1x A100DGX A10022.04-py3Mixed512Wikipedia 2020/01/01A100-SXM4-80GB
1.12.0a0EfficientDet-D0270 images/sec1x A100DGX A10022.05-py3Mixed150COCO 2017A100-SXM4-80GB
1.12.0a0EfficientNet-WideSE-B01,920 images/sec1x A100DGX A10022.04-py3Mixed256Imagenet2012A100-SXM4-80GB
1.12.0a0EfficientNet-WideSE-B4940 images/sec1x A100DGX A10022.04-py3Mixed128Imagenet2012A100-SXM4-80GB
1.12.0a0SE3 Transformer3,097 molecules/sec1x A100DGX A10022.05-py3Mixed240Quantum Machines 9A100-SXM4-80GB
Tensorflow1.15.5ResNext1011,322 images/sec1x A100DGX A10022.05-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5SE-ResNext1011,156 images/sec1x A100DGX A10022.05-py3Mixed256Imagenet2012A100-SXM4-80GB
1.15.5U-Net Industrial371 images/sec1x A100DGX A10022.05-py3Mixed16DAGM2007A100-SXM4-40GB
2.8.0U-Net Medical149 images/sec1x A100DGX A10022.04-py3Mixed8EM segmentation challengeA100-SXM4-80GB
2.8.0Wide and Deep1,876,978 samples/sec1x A100DGX A10022.04-py3Mixed131072Kaggle Outbrain Click PredictionA100-SXM4-80GB
2.8.0Electra Base Fine Tuning372 sequences/sec1x A100DGX A10022.05-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.15.5NCF43,735,836 samples/sec1x A100DGX A10022.05-py3Mixed1048576MovieLens 20MA100-SXM4-40GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A40 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0Tacotron233,988 total output mels/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed128LJSpeech 1.1A40
1.12.0a0WaveGlow144,465 output samples/sec1x A40GIGABYTE G482-Z52-0022.04-py3Mixed10LJSpeech 1.1A40
1.12.0a0GNMT V281,260 total tokens/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed128wmt16-en-deA40
1.12.0a0NCF17,869,902 samples/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed1048576MovieLens 20MA40
1.12.0a0Transformer-XL Large10,184 total tokens/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed16WikiText-103A40
1.12.0a0FastPitch77,556 frames/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed32LJSpeech 1.1A40
1.12.0a0Transformer-XL Base42,411 total tokens/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed128WikiText-103A40
1.12.0a0nnU-Net562 images/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed64Medical Segmentation DecathlonA40
1.12.0a0EfficientNet-B01,255 images/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed256Imagenet2012A40
1.12.0a0EfficientDet-D0172 images/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed60COCO 2017A40
1.12.0a0EfficientNet-WideSE-B01,262 images/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed256Imagenet2012A40
1.12.0a0EfficientNet-WideSE-B4485 images/sec1x A40GIGABYTE G482-Z52-0022.04-py3Mixed128Imagenet2012A40
1.12.0a0SE3 Transformer1,811 molecules/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed240Quantum Machines 9A40
Tensorflow1.15.5U-Net Industrial123 images/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed16DAGM2007A40
1.15.5BERT-LARGE51 sentences/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed24SQuaD v1.1A40
2.8.0U-Net Medical70 images/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed8EM segmentation challengeA40
2.8.0Wide and Deep956,456 samples/sec1x A40GIGABYTE G482-Z52-0022.04-py3Mixed131072Kaggle Outbrain Click PredictionA40
1.15.5ResNext101605 images/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed256Imagenet2012A40
1.15.5SE-ResNext101551 images/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed256Imagenet2012A40
2.8.0Electra Base Fine Tuning165 sequences/sec1x A40GIGABYTE G482-Z52-0022.05-py3Mixed32SQuaD v1.1A40

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A30 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0Tacotron234,239 total output mels/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed104LJSpeech 1.1A30
1.12.0a0WaveGlow146,477 output samples/sec1x A30GIGABYTE G482-Z52-0022.04-py3Mixed10LJSpeech 1.1A30
1.12.0a0FastPitch66,964 frames/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed16LJSpeech 1.1A30
1.12.0a0NCF18,779,967 samples/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed1048576MovieLens 20MA30
1.12.0a0GNMT V291,406 total tokens/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed128wmt16-en-deA30
1.12.0a0Transformer-XL Base19,067 total tokens/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed32WikiText-103A30
1.12.0a0ResNeXt101547 images/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed112Imagenet2012A30
1.12.0a0Transformer-XL Large7,055 total tokens/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed4WikiText-103A30
1.12.0a0nnU-Net580 images/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed64Medical Segmentation DecathlonA30
1.12.0a0EfficientDet-D0145 images/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed30COCO 2017A30
1.12.0a0EfficientNet-WideSE-B01,231 images/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed128Imagenet2012A30
1.12.0a0EfficientNet-WideSE-B4348 images/sec1x A30GIGABYTE G482-Z52-0022.04-py3Mixed32Imagenet2012A30
1.12.0a0SE3 Transformer2,017 molecules/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed240Quantum Machines 9A30
Tensorflow1.15.5ResNext101587 images/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed128Imagenet2012A30
1.15.5SE-ResNext101497 images/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed96Imagenet2012A30
1.15.5U-Net Industrial118 images/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed16DAGM2007A30
2.8.0U-Net Medical73 images/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed8EM segmentation challengeA30
1.15.5Transformer XL Base18,449 total tokens/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed16WikiText-103A30
2.8.0Electra Base Fine Tuning165 sequences/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed16SQuaD v1.1A30
1.15.5Wide and Deep320,517 samples/sec1x A30GIGABYTE G482-Z52-0022.05-py3Mixed131072Kaggle Outbrain Click PredictionA30

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A10 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0Tacotron228,869 total output mels/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed104LJSpeech 1.1A10
1.12.0a0WaveGlow112,773 output samples/sec1x A10GIGABYTE G482-Z52-0022.04-py3Mixed10LJSpeech 1.1A10
1.12.0a0FastPitch58,928 frames/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed16LJSpeech 1.1A10
1.12.0a0Transformer-XL Base15,821 total tokens/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed32WikiText-103A10
1.12.0a0GNMT V265,590 total tokens/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed128wmt16-en-deA10
1.12.0a0ResNeXt101397 images/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed112Imagenet2012A10
1.12.0a0NCF14,965,414 samples/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed1048576MovieLens 20MA10
1.12.0a0Transformer-XL Large6,061 total tokens/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed4WikiText-103A10
1.12.0a0nnU-Net454 images/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed64Medical Segmentation DecathlonA10
1.12.0a0EfficientDet-D0132 images/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed30COCO 2017A10
1.12.0a0EfficientNet-WideSE-B01,010 images/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed128Imagenet2012A10
1.12.0a0EfficientNet-WideSE-B4337 images/sec1x A10GIGABYTE G482-Z52-0022.04-py3Mixed32Imagenet2012A10
1.12.0a0SE3 Transformer1,579 molecules/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed240Quantum Machines 9A10
Tensorflow1.15.5ResNext101451 images/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed128Imagenet2012A10
1.15.5SE-ResNext101391 images/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed96Imagenet2012A10
1.15.5U-Net Industrial100 images/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed16DAGM2007A10
2.8.0U-Net Medical52 images/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed8EM segmentation challengeA10
2.8.0Electra Base Fine Tuning122 sequences/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed16SQuaD v1.1A10
1.15.5Wide and Deep293,476 samples/sec1x A10GIGABYTE G482-Z52-0022.05-py3Mixed131072Kaggle Outbrain Click PredictionA10

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

T4 Training Performance +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0ResNeXt101186 images/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed112Imagenet2012NVIDIA T4
1.12.0a0Tacotron26,486 total output mels/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3FP3248LJSpeech 1.1NVIDIA T4
1.12.0a0WaveGlow55,615 output samples/sec1x T4Supermicro SYS-1029GQ-TRT22.04-py3Mixed10LJSpeech 1.1NVIDIA T4
1.12.0a0FastPitch30,228 frames/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed16LJSpeech 1.1NVIDIA T4
1.12.0a0GNMT V230,820 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed128wmt16-en-deNVIDIA T4
1.12.0a0NCF7,253,124 samples/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed1048576MovieLens 20MNVIDIA T4
1.12.0a0Transformer-XL Base9,024 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed32WikiText-103NVIDIA T4
1.12.0a0SE-ResNeXt101150 images/sec1x T4Supermicro SYS-1029GQ-TRT22.04-py3Mixed112Imagenet2012NVIDIA T4
1.12.0a0Transformer-XL Large2,733 total tokens/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed4WikiText-103NVIDIA T4
1.12.0a0nnU-Net205 images/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed64Medical Segmentation DecathlonNVIDIA T4
1.12.0a0EfficientDet-D068 images/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed30COCO 2017NVIDIA T4
1.12.0a0EfficientNet-WideSE-B0480 images/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed128Imagenet2012NVIDIA T4
1.12.0a0EfficientNet-WideSE-B4170 images/sec1x T4Supermicro SYS-1029GQ-TRT22.04-py3Mixed32Imagenet2012NVIDIA T4
1.12.0a0SE3 Transformer601 molecules/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed240Quantum Machines 9NVIDIA T4
Tensorflow1.15.5U-Net Industrial45 images/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed16DAGM2007NVIDIA T4
2.8.0U-Net Medical21 images/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5SE-ResNext101164 images/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed96Imagenet2012NVIDIA T4
1.15.5ResNext101199 images/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed128Imagenet2012NVIDIA T4
2.8.0Electra Base Fine Tuning56 sequences/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed16SQuaD v1.1NVIDIA T4
2.8.0Wide and Deep195,709 samples/sec1x T4Supermicro SYS-1029GQ-TRT22.05-py3Mixed131072Kaggle Outbrain Click PredictionNVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec



V100 Training Performance +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.12.0a0ResNeXt101569 images/sec1x V100DGX-222.05-py3Mixed112Imagenet2012V100-SXM3-32GB
1.12.0a0Tacotron224,775 total output mels/sec1x V100DGX-222.05-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.12.0a0WaveGlow144,081 output samples/sec1x V100DGX-222.04-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.12.0a0FastPitch68,742 frames/sec1x V100DGX-222.03-py3Mixed64LJSpeech 1.1V100-SXM3-32GB
1.12.0a0GNMT V278,408 total tokens/sec1x V100DGX-222.05-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.12.0a0NCF23,153,985 samples/sec1x V100DGX-222.04-py3Mixed1048576MovieLens 20MV100-SXM3-32GB
1.12.0a0Transformer-XL Base17,951 total tokens/sec1x V100DGX-222.05-py3Mixed32WikiText-103V100-SXM3-32GB
1.12.0a0Transformer-XL Large7,310 total tokens/sec1x V100DGX-222.05-py3Mixed8WikiText-103V100-SXM3-32GB
1.12.0a0nnU-Net659 images/sec1x V100DGX-222.05-py3Mixed64Medical Segmentation DecathlonV100-SXM3-32GB
1.12.0a0EfficientNet-B01,280 images/sec1x V100DGX-222.05-py3Mixed256Imagenet2012V100-SXM3-32GB
1.12.0a0EfficientDet-D0150 images/sec1x V100DGX-222.05-py3Mixed60COCO 2017V100-SXM3-32GB
1.12.0a0EfficientNet-WideSE-B01,279 images/sec1x V100DGX-222.05-py3Mixed256Imagenet2012V100-SXM3-32GB
1.12.0a0EfficientNet-WideSE-B4501 images/sec1x V100DGX-222.04-py3Mixed64Imagenet2012V100-SXM3-32GB
1.12.0a0SE3 Transformer1,818 molecules/sec1x V100DGX-222.05-py3Mixed240Quantum Machines 9V100-SXM3-32GB
Tensorflow1.15.5ResNext101633 images/sec1x V100DGX-222.05-py3Mixed128Imagenet2012V100-SXM3-32GB
1.15.5SE-ResNext101556 images/sec1x V100DGX-222.05-py3Mixed96Imagenet2012V100-SXM3-32GB
1.15.5U-Net Industrial119 images/sec1x V100DGX-222.05-py3Mixed16DAGM2007V100-SXM3-32GB
2.8.0U-Net Medical67 images/sec1x V100DGX-222.05-py3Mixed8EM segmentation challengeV100-SXM3-32GB
2.8.0Wide and Deep1,022,754 samples/sec1x V100DGX-222.04-py3Mixed131072Kaggle Outbrain Click PredictionV100-SXM3-32GB
2.8.0Electra Base Fine Tuning188 sequences/sec1x V100DGX-222.05-py3Mixed32SQuaD v1.1V100-SXM3-32GB
1.15.5Transformer XL Base18,500 total tokens/sec1x V100DGX-222.05-py3Mixed16WikiText-103V100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

Single GPU Training Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance on Cloud

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.52,916 images/sec1x A100GCP A2-HIGHGPU-1G22.04-py3Mixed192ImageNet2012A100-SXM4-40GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

T4 Training Performance on Cloud +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.5457 images/sec1x T4AWS EC2 g4dn.4xlarge22.04-py3Mixed192ImageNet2012NVIDIA T4
-ResNet-50 v1.5419 images/sec1x T4GCP N1-HIGHMEM-822.04-py3Mixed192ImageNet2012NVIDIA T4
TensorFlow-ResNet-50 v1.5417 images/sec1x T4AWS EC2 g4dn.4xlarge22.04-py3Mixed256Imagenet2012NVIDIA T4
-ResNet-50 v1.5406 images/sec1x T4GCP N1-HIGHMEM-822.04-py3Mixed256Imagenet2012NVIDIA T4

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384



V100 Training Performance on Cloud +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.51,519 images/sec1x V100AWS EC2 p3.2xlarge22.04-py3Mixed192ImageNet2012V100-SXM2-16GB
-ResNet-50 v1.51,434 images/sec1x V100GCP N1-HIGHMEM-822.04-py3Mixed192ImageNet2012V100-SXM2-16GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Related Resources

Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:


MLPerf Inference v2.0 Performance Benchmarks

Offline Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionDatasetTarget Accuracy
ResNet-50 v1.5312,849 samples/sec8x A100DGX A100A100 SXM-80GBImageNet76.46% Top1
314,929 samples/sec8x A100Gigabyte G492-PD0A100 SXM-80GBImageNet76.46% Top1
138,516 samples/sec4x A100DGX Station A100A100 SXM-80GBImageNet76.46% Top1
5,231 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBImageNet76.46% Top1
293,451 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet76.46% Top1
145,947 samples/sec4x A100Gigabyte G242-P31A100 PCIe-80GBImageNet76.46% Top1
147,246 samples/sec8x A30Gigabyte G482-Z54A30ImageNet76.46% Top1
5,089 samples/sec1x1g.6gb A30Gigabyte G482-Z54A30ImageNet76.46% Top1
SSD ResNet-347,923 samples/sec8x A100DGX A100A100 SXM-80GBCOCO0.2 mAP
7,880 samples/sec8x A100Gigabyte G492-PD0A100 SXM-80GBCOCO0.2 mAP
3,397 samples/sec4x A100DGX Station A100A100 SXM-80GBCOCO0.2 mAP
135 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBCOCO0.2 mAP
7,297 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBCOCO0.2 mAP
3,623 samples/sec4x A100Gigabyte G242-P31A100 PCIe-80GBCOCO0.2 mAP
3,827 samples/sec8x A30Gigabyte G482-Z54A30COCO0.2 mAP
129 samples/sec1x1g.6gb A30Gigabyte G482-Z54A30COCO0.2 mAP
3D-UNet25 samples/sec8x A100DGX A100A100 SXM-80GBKiTS 20190.863 DICE mean
24 samples/sec8x A100Gigabyte G492-PD0A100 SXM-80GBKiTS 20190.863 DICE mean
11 samples/sec4x A100DGX Station A100A100 SXM-80GBKiTS 20190.863 DICE mean
24 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBKiTS 20190.863 DICE mean
12 samples/sec4x A100Gigabyte G242-P31A100 PCIe-80GBKiTS 20190.863 DICE mean
13 samples/sec8x A30Gigabyte G482-Z54A30KiTS 20190.863 DICE mean
RNN-T106,753 samples/sec8x A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
107,399 samples/sec8x A100Gigabyte G492-PD0A100 SXM-80GBLibriSpeech7.45% WER
49,789 samples/sec4x A100DGX Station A100A100 SXM-80GBLibriSpeech7.45% WER
1,612 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
101,788 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech7.45% WER
52,752 samples/sec4x A100Gigabyte G242-P31A100 PCIe-80GBLibriSpeech7.45% WER
52,453 samples/sec8x A30Gigabyte G482-Z54A30LibriSpeech7.45% WER
1,432 samples/sec1x1g.6gb A30Gigabyte G482-Z54A30LibriSpeech7.45% WER
BERT27,971 samples/sec8x A100DGX A100A100 SXM-80GBSQuAD v1.190.07% f1
27,894 samples/sec8x A100Gigabyte G492-PD0A100 SXM-80GBSQuAD v1.190.07% f1
11,387 samples/sec4x A100DGX Station A100A100 SXM-80GBSQuAD v1.190.07% f1
484 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBSQuAD v1.190.07% f1
25,035 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.190.07% f1
12,595 samples/sec4x A100Gigabyte G242-P31A100 PCIe-80GBSQuAD v1.190.07% f1
13,340 samples/sec8x A30Gigabyte G482-Z54A30SQuAD v1.190.07% f1
502 samples/sec1x1g.6gb A30Gigabyte G482-Z54A30SQuAD v1.190.07% f1
DLRM2,499,040 samples/sec8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
2,477,270 samples/sec8x A100Gigabyte G492-PD0A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
1,065,600 samples/sec4x A100DGX Station A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
40,424 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
2,313,280 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBCriteo 1TB Click Logs80.25% AUC
1,125,130 samples/sec4x A100Gigabyte G242-P31A100 PCIe-80GBCriteo 1TB Click Logs80.25% AUC
1,105,550 samples/sec8x A30Gigabyte G482-Z54A30Criteo 1TB Click Logs80.25% AUC
35,831 samples/sec1x1g.6gb A30Gigabyte G482-Z54A30Criteo 1TB Click Logs80.25% AUC

Server Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionTarget AccuracyMLPerf Server Latency
Constraints (ms)
Dataset
ResNet-50 v1.5260,031 queries/sec8x A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
270,027 queries/sec8x A100Gigabyte G492-PD0A100 SXM-80GB76.46% Top115ImageNet
107,000 queries/sec4x A100DGX Station A100A100 SXM-80GB76.46% Top115ImageNet
3,527 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
200,007 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB76.46% Top115ImageNet
104,000 queries/sec4x A100Gigabyte G242-P31A100 PCIe-80GB76.46% Top115ImageNet
116,002 queries/sec8x A30Gigabyte G482-Z54A3076.46% Top115ImageNet
3,398 queries/sec1x1g.6gb A30Gigabyte G482-Z54A3076.46% Top115ImageNet
SSD ResNet-347,575 queries/sec8x A100DGX A100A100 SXM-80GB0.2 mAP100COCO
7,505 queries/sec8x A100Gigabyte G492-PD0A100 SXM-80GB0.2 mAP100COCO
3,247 queries/sec4x A100DGX Station A100A100 SXM-80GB0.2 mAP100COCO
98 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB0.2 mAP100COCO
6,466 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB0.2 mAP100COCO
3,078 queries/sec4x A100Gigabyte G242-P31A100 PCIe-80GB0.2 mAP100COCO
3,570 queries/sec8x A30Gigabyte G482-Z54A300.2 mAP100COCO
95 queries/sec1x1g.6gb A30Gigabyte G482-Z54A300.2 mAP100COCO
RNN-T104,000 queries/sec8x A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
104,000 queries/sec8x A100Gigabyte G492-PD0A100 SXM-80GB7.45% WER1,000LibriSpeech
44,989 queries/sec4x A100DGX Station A100A100 SXM-80GB7.45% WER1,000LibriSpeech
1,350 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
89,994 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB7.45% WER1,000LibriSpeech
42,989 queries/sec4x A100Gigabyte G242-P31A100 PCIe-80GB7.45% WER1,000LibriSpeech
36,989 queries/sec8x A30Gigabyte G482-Z54A307.45% WER1,000LibriSpeech
1,100 queries/sec1x1g.6gb A30Gigabyte G482-Z54A307.45% WER1,000LibriSpeech
BERT25,792 queries/sec8x A100DGX A100A100 SXM-80GB90.07% f1130SQuAD v1.1
25,391 queries/sec8x A100Gigabyte G492-PD0A100 SXM-80GB90.07% f1130SQuAD v1.1
10,794 queries/sec4x A100DGX Station A100A100 SXM-80GB90.07% f1130SQuAD v1.1
380 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB90.07% f1130SQuAD v1.1
22,989 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB90.07% f1130SQuAD v1.1
10,394 queries/sec4x A100Gigabyte G242-P31A100 PCIe-80GB90.07% f1130SQuAD v1.1
11,491 queries/sec8x A30Gigabyte G482-Z54A3090.07% f1130SQuAD v1.1
380 queries/sec1x1g.6gb A30Gigabyte G482-Z54A3090.07% f1130SQuAD v1.1
DLRM2,302,640 queries/sec8x A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
1,951,890 queries/sec8x A100Gigabyte G492-PD0A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
950,448 queries/sec4x A100DGX Station A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
35,989 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
1,300,850 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB80.25% AUC30Criteo 1TB Click Logs
600,183 queries/sec4x A100Gigabyte G242-P31A100 PCIe-80GB80.25% AUC30Criteo 1TB Click Logs
960,456 queries/sec8x A30Gigabyte G482-Z54A3080.25% AUC30Criteo 1TB Click Logs
30,987 queries/sec1x1g.6gb A30Gigabyte G482-Z54A3080.25% AUC30Criteo 1TB Click Logs

Power Efficiency Offline Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5250,242 samples/sec86.74 samples/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
268,462 samples/sec95.36 samples/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBImageNet
128,665 samples/sec113.68 samples/sec/watt4x A100DGX Station A100A100 SXM-80GBImageNet
211,065 samples/sec113.44 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet
104,893 samples/sec105.20 samples/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBImageNet
SSD ResNet-346,576 samples/sec2.11 samples/sec/watt8x A100DGX A100A100 SXM-80GBCOCO
6,521 samples/sec2.31 samples/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBCOCO
3,307 samples/sec2.67 samples/sec/watt4x A100DGX Station A100A100 SXM-80GBCOCO
5,778 samples/sec2.75 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBCOCO
2,894 samples/sec2.57 samples/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBCOCO
3D-UNet21 samples/sec0.007 samples/sec/watt8x A100DGX A100A100 SXM-80GBKiTS 2019
20 samples/sec0.008 samples/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBKiTS 2019
11 samples/sec0.009 samples/sec/watt4x A100DGX Station A100A100 SXM-80GBKiTS 2019
19 samples/sec0.010 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBKiTS 2019
10 samples/sec0.010 samples/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBKiTS 2019
RNN-T90,730 samples/sec27.94 samples/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
90,946 samples/sec31.89 samples/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBLibriSpeech
44,966 samples/sec37.87 samples/sec/watt4x A100DGX Station A100A100 SXM-80GBLibriSpeech
85,952 samples/sec39.16 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech
42,945 samples/sec37.86 samples/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBLibriSpeech
BERT24,794 samples/sec6.99 samples/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
20,706 samples/sec7.38 samples/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBSQuAD v1.1
10,828 samples/sec8.64 samples/sec/watt4x A100DGX Station A100A100 SXM-80GBSQuAD v1.1
19,993 samples/sec8.47 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.1
10,047 samples/sec8.06 samples/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBSQuAD v1.1
DLRM2,140,540 samples/sec646.23 samples/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs
1,940,830 samples/sec701.53 samples/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBCriteo 1TB Click Logs
1,001,010 samples/sec797.59 samples/sec/watt4x A100DGX Station A100A100 SXM-80GBCriteo 1TB Click Logs
1,845,900 samples/sec795.67 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBCriteo 1TB Click Logs
953,749 samples/sec768.81 samples/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBCriteo 1TB Click Logs

Power Efficiency Server Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5229,016 queries/sec78.69 queries/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
230,018 queries/sec81.52 queries/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBImageNet
107,000 queries/sec94.59 queries/sec/watt4x A100DGX Station A100A100 SXM-80GBImageNet
185,005 queries/sec88.70 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet
92,496 queries/sec93.88 queries/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBImageNet
SSD ResNet-346,298 queries/sec2.01 queries/sec/watt8x A100DGX A100A100 SXM-80GBCOCO
6,298 queries/sec2.23 queries/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBCOCO
3,078 queries/sec2.50 queries/sec/watt4x A100DGX Station A100A100 SXM-80GBCOCO
5,697 queries/sec2.72 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBCOCO
2,748 queries/sec2.48 queries/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBCOCO
RNN-T87,992 queries/sec25.47 queries/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
74,990 queries/sec26.37 queries/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBLibriSpeech
43,388 queries/sec33.53 queries/sec/watt4x A100DGX Station A100A100 SXM-80GBLibriSpeech
74,990 queries/sec34.09 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech
37,489 queries/sec32.89 queries/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBLibriSpeech
BERT21,492 queries/sec6.36 queries/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
20,992 queries/sec6.47 queries/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBSQuAD v1.1
10,195 queries/sec8.01 queries/sec/watt4x A100DGX Station A100A100 SXM-80GBSQuAD v1.1
17,292 queries/sec8.09 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.1
9,995 queries/sec7.99 queries/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBSQuAD v1.1
DLRM2,001,990 queries/sec593.66 queries/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs
1,831,680 queries/sec651.00 queries/sec/watt8x A100Gigabyte G492-PD0A100 SXM-80GBCriteo 1TB Click Logs
870,363 queries/sec649.49 queries/sec/watt4x A100DGX Station A100A100 SXM-80GBCriteo 1TB Click Logs
750,272 queries/sec358.95 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBCriteo 1TB Click Logs
500,121 queries/sec408.95 queries/sec/watt4x A100Gigabyte G242-P31A100 PCIe-80GBCriteo 1TB Click Logs

MLPerf™ v2.0 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 2.0-073, 2.0-075, 2.0-077, 2.0-078, 2.0-080, 2.0-081, 2.0-083, 2.0-084, 2.0-090, 2.0-094, 2.0-095, 2.0-097, 2.0-098. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
1x1g.6gb and 1x1g.10gb is a notation used to refer to the MIG configuration. In this example, the workload is running on a single MIG slice, each with 6GB or 10GB of memory on a single A30 and A100 respectively.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v2.0


NVIDIA landed top performance spots on all MLPerf™ Inference 2.0 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.

MLPerf™ v2.0 A100 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, BERT 99% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 2.0-094, 2.0-096. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.​

 

NVIDIA Client Batch Size 1 and 2 Performance with Triton Inference Server

A100 Triton Inference Server Performance

NetworkAcceleratorTraining FrameworkFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA100-SXM4-80GBTensorFlowTensorRTMixed4112439.31611 inf/sec38422.05-py3
BERT Large InferenceA100-SXM4-80GBTensorFlowTensorRTMixed2211642.69750 inf/sec38422.05-py3
BERT Large InferenceA100-PCIE-40GBTensorFlowTensorRTMixed4112444.77536 inf/sec38422.05-py3
BERT Large InferenceA100-PCIE-40GBTensorFlowTensorRTMixed1212482.05585 inf/sec38422.05-py3
BERT Base InferenceA100-SXM4-80GBTensorFlowTensorRTMixed411247.553,178 inf/sec12822.05-py3
BERT Base InferenceA100-SXM4-40GBTensorFlowTensorRTMixed121207.945,039 inf/sec12822.05-py3
BERT Base InferenceA100-PCIE-40GBTensorFlowTensorRTMixed4112473,427 inf/sec12822.05-py3
BERT Base InferenceA100-PCIE-40GBTensorFlowTensorRTMixed121208.344,793 inf/sec12822.05-py3
DLRM InferenceA100-SXM4-40GBPyTorchTensorRTMixed2165,536262.0712,560 inf/sec-22.05-py3
DLRM InferenceA100-SXM4-40GBPyTorchTensorRTMixed2265,536282.2524,919 inf/sec-22.05-py3
DLRM InferenceA100-PCIE-40GBPyTorchTensorRTMixed4165,536302.3712,672 inf/sec-22.05-py3
DLRM InferenceA100-PCIE-40GBPyTorchTensorRTMixed2265,536282.1426,206 inf/sec-22.05-py3

A30 Triton Inference Server Performance

NetworkAcceleratorFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA30TensorFlowTensorRTMixed4112474.63321 inf/sec38422.05-py3
BERT Large InferenceA30TensorFlowTensorRTMixed2211691.98348 inf/sec38422.05-py3
BERT Base InferenceA30TensorFlowTensorRTMixed4112410.072,382 inf/sec12822.05-py3
BERT Base InferenceA30TensorFlowTensorRTMixed1212415.723,053 inf/sec12822.05-py3

A10 Triton Inference Server Performance

NetworkAcceleratorTraining FrameworkFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA10TensorFlowTensorRTMixed41124111.82214 inf/sec38422.05-py3
BERT Large InferenceA10TensorFlowTensorRTMixed22116141.66226 inf/sec38422.05-py3
BERT Base InferenceA10TensorFlowTensorRTMixed4112413.161,824 inf/sec12822.05-py3
BERT Base InferenceA10TensorFlowTensorRTMixed2211614.012,285 inf/sec12822.05-py3

T4 Triton Inference Server Performance +

NetworkAcceleratorTraining FrameworkFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceNVIDIA T4TensorFlowTensorRTMixed4112485283 inf/sec38422.03-py3
BERT Large InferenceNVIDIA T4TensorFlowTensorRTMixed2212484.47568 inf/sec38422.03-py3
BERT Base InferenceNVIDIA T4TensorFlowTensorRTMixed4112427.95859 inf/sec12822.05-py3
BERT Base InferenceNVIDIA T4TensorFlowTensorRTMixed1212450.44952 inf/sec12822.05-py3


V100 Triton Inference Server Performance +

NetworkAcceleratorTraining FrameworkFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceV100 SXM2-32GBTensorFlowTensorRTMixed41124105.96227 inf/sec38422.03-py3
BERT Large InferenceV100 SXM2-32GBTensorFlowTensorRTMixed22116125.94254 inf/sec38422.03-py3
BERT Base InferenceV100 SXM2-32GBTensorFlowTensorRTMixed4112417.601,363 inf/sec12822.03-py3
BERT Base InferenceV100 SXM2-32GBTensorFlowTensorRTMixed2211614.832,158 inf/sec12822.03-py3
DLRM InferenceV100-SXM2-32GBPyTorchTensorRTMixed2165,536304.157,228 inf/sec-22.03-py3
DLRM InferenceV100-SXM2-32GBPyTorchTensorRTMixed2265,536304.1114,599 inf/sec-22.03-py3

Inference Performance of NVIDIA A100, A40, A30, A10, A2, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

Inference Natural Langugage Processing

BERT Inference Throughput

DGX A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128

 

NVIDIA A100 BERT Inference Benchmarks

NetworkNetwork
Type
Batch
Size
ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
BERT-Large with SparsityAttention946,188 sequences/sec--1x A100DGX A100-INT8SQuaD v1.1-A100 SXM4-40GB

A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: Mixed | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: Mixed | Dataset: Synthetic

 

A100 Full Chip Inference Performance

NetworkBatch SizeFull Chip ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50811,677 images/sec58 images/sec/watt0.691x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
12830,532 images/sec80 images/sec/watt4.191x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
ResNet-50v1.5811,434 images/sec57 images/sec/watt0.71x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
12829,533 images/sec78 images/sec/watt4.331x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
87,435 sequences/sec30 sequences/sec/watt1.081x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
12814,979 sequences/sec38 sequences/sec/watt8.551x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
82,675 sequences/sec3 sequences/sec/watt7.591x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
1284,806 sequences/sec3 sequences/sec/watt92.041x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
EfficientNet-B088,910 images/sec58 images/sec/watt0.91x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
12828,897 images/sec89 images/sec/watt4.431x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
EfficientNet-B482,498 sequences/sec11 sequences/sec/watt3.21x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB
1284,413 sequences/sec12 sequences/sec/watt29.011x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 1/7 MIG Inference Performance

NetworkBatch Size1/7 MIG ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,692 images/sec33 images/sec/watt2.171x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
1284,596 images/sec38 images/sec/watt27.851x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
ResNet-50v1.583,583 images/sec32 images/sec/watt2.231x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
1284,443 images/sec37 images/sec/watt28.811x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
BERT-BASE81,806 sequences/sec14 sequences/sec/watt4.431x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
1282,174 sequences/sec17 sequences/sec/watt58.891x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
BERT-LARGE8584 sequences/sec5 sequences/sec/watt13.691x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
128673 sequences/sec5 sequences/sec/watt190.161x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 7 MIG Inference Performance

NetworkBatch Size7 MIG ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50825,485 images/sec80 images/sec/watt2.221x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
12831,964 images/sec84 images/sec/watt28.051x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
ResNet-50v1.5824,788 images/sec78 images/sec/watt2.261x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
12830,899 images/sec81 images/sec/watt29.041x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
BERT-BASE812,522 sequences/sec33 sequences/sec/watt4.491x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
12814,478 sequences/sec37 sequences/sec/watt61.941x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
BERT-LARGE83,980 sequences/sec10 sequences/sec/watt14.11x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB
1284,434 sequences/sec11 sequences/sec/watt202.321x A100DGX A10022.05-py3INT8SyntheticTensorRT 8.2.3A100 SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

 

A40 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50810,044 images/sec41 images/sec/watt0.81x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
12816,070 images/sec54 images/sec/watt7.961x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
ResNet-50v1.589,718 images/sec38 images/sec/watt0.821x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
12815,358 images/sec51 images/sec/watt8.331x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
BERT-BASE85,322 sequences/sec18 sequences/sec/watt1.51x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
1287,528 sequences/sec25 sequences/sec/watt171x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
BERT-LARGE81,722 sequences/sec2 sequences/sec/watt4.621x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
1282,242 sequences/sec2 sequences/sec/watt57.11x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
EfficientNet-B088,047 images/sec42 images/sec/watt0.991x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
12816,445 images/sec55 images/sec/watt7.781x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
EfficientNet-B481,878 images/sec7 images/sec/watt4.261x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40
1282,533 images/sec8 images/sec/watt50.541x A40GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A40

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5087,003 images/sec46 images/sec/watt1.141x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
12815,984 images/sec96 images/sec/watt8.011x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
ResNet-50v1.587,406 images/sec50 images/sec/watt1.081x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
12815,413 images/sec94 images/sec/watt8.31x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
84,981 sequences/sec32 sequences/sec/watt1.611x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
1287,108 sequences/sec43 sequences/sec/watt18.011x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,696 sequences/sec4 sequences/sec/watt4.721x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
1282,257 sequences/sec4 sequences/sec/watt56.71x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
EfficientNet-B087,126 images/sec71 images/sec/watt1.121x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
12816,143 images/sec99 images/sec/watt7.931x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
EfficientNet-B481,620 images/sec12 images/sec/watt4.941x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
1282,237 images/sec14 images/sec/watt57.231x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 1/4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,579 images/sec43 images/sec/watt2.241x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
1284,526 images/sec50 images/sec/watt28.281x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
ResNet-50v1.583,467 images/sec42 images/sec/watt2.311x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
1284,389 images/sec48 images/sec/watt29.171x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
BERT-BASE81,802 sequences/sec20 sequences/sec/watt4.441x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
1282,151 sequences/sec21 sequences/sec/watt59.511x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
BERT-LARGE8561 sequences/sec6 sequences/sec/watt14.271x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
128675 sequences/sec7 sequences/sec/watt189.711x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50813,856 images/sec84 images/sec/watt2.331x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
12816,996 images/sec104 images/sec/watt30.241x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
ResNet-50v1.5813,543 images/sec82 images/sec/watt2.361x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
12816,333 images/sec100 images/sec/watt31.491x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
BERT-BASE86,524 sequences/sec40 sequences/sec/watt5.011x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
1287,397 sequences/sec45 sequences/sec/watt70.341x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
BERT-LARGE81,998 sequences/sec12 sequences/sec/watt16.251x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30
1282,312 sequences/sec14 sequences/sec/watt223.411x A30GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A10 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5087,938 images/sec53 images/sec/watt1.011x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
12811,532 images/sec77 images/sec/watt11.11x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
ResNet-50v1.587,685 images/sec51 images/sec/watt1.041x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
12810,630 images/sec71 images/sec/watt12.041x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
83,992 sequences/sec27 sequences/sec/watt21x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
1284,882 sequences/sec33 sequences/sec/watt26.221x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,264 sequences/sec3 sequences/sec/watt6.331x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
1281,472 sequences/sec3 sequences/sec/watt86.941x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
EfficientNet-B087,057 images/sec47 images/sec/watt1.131x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
12811,793 images/sec79 images/sec/watt10.851x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
EfficientNet-B481,454 images/sec10 images/sec/watt5.51x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10
1281,761 images/sec12 images/sec/watt72.681x A10GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A10

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A2 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5082,596 images/sec43 images/sec/watt3.081x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
1283,008 images/sec50 images/sec/watt42.551x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
ResNet-50v1.582,512 images/sec42 images/sec/watt3.181x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
1282,888 images/sec48 images/sec/watt44.321x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
BERT-BASE81,055 sequences/sec18 sequences/sec/watt7.591x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
1281,105 sequences/sec18 sequences/sec/watt115.81x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
BERT-LARGE8313 sequences/sec2 sequences/sec/watt25.551x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
128334 sequences/sec2 sequences/sec/watt382.851x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
EfficientNet-B082,501 images/sec51 images/sec/watt3.21x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
1283,175 images/sec54 images/sec/watt40.321x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
EfficientNet-B48438 images/sec7 images/sec/watt18.291x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2
128481 images/sec8 images/sec/watt266.121x A2GIGABYTE G482-Z52-0022.05-py3INT8SyntheticTensorRT 8.2.3A2

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power

 

T4 Inference Performance +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,787 images/sec54 images/sec/watt2.111x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
1284,707 images/sec67 images/sec/watt27.191x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
ResNet-50v1.583,576 images/sec51 images/sec/watt2.241x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
1284,426 images/sec63 images/sec/watt28.921x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,540 sequences/sec22 sequences/sec/watt5.191x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
1281,792 sequences/sec26 sequences/sec/watt71.451x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
8555 sequences/sec2 sequences/sec/watt14.411x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
128535 sequences/sec2 sequences/sec/watt239.131x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
EfficientNet-B084,479 images/sec64 images/sec/watt1.791x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
1285,944 images/sec86 images/sec/watt21.531x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
EfficientNet-B48732 images/sec10 images/sec/watt10.931x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4
128789 images/sec11 images/sec/watt162.211x T4Supermicro SYS-4029GP-TRT22.05-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container



V100 Inference Performance +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5084,276 images/sec16 images/sec/watt1.871x V100DGX-222.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM3-32GB
1287,916 images/sec23 images/sec/watt16.171x V100DGX-222.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM3-32GB
ResNet-50v1.584,199 images/sec15 images/sec/watt1.911x V100DGX-222.05-py3MixedSyntheticTensorRT 8.2.3V100-SXM3-32GB
1287,564 images/sec22 images/sec/watt16.921x V100DGX-222.05-py3MixedSyntheticTensorRT 8.2.3V100-SXM3-32GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,997 sequences/sec6 sequences/sec/watt4.011x V100DGX-222.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM3-32GB
1283,150 sequences/sec9 sequences/sec/watt40.641x V100DGX-222.05-py3MixedSyntheticTensorRT 8.2.3V100-SXM3-32GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
8765 sequences/sec1 sequences/sec/watt10.451x V100DGX-222.05-py3MixedSyntheticTensorRT 8.2.3V100-SXM3-32GB
128968 sequences/sec1 sequences/sec/watt132.241x V100DGX-222.05-py3MixedSyntheticTensorRT 8.2.3V100-SXM3-32GB
EfficientNet-B084,233 images/sec21 images/sec/watt1.891x V100DGX-222.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM3-32GB
1288,406 images/sec27 images/sec/watt15.231x V100DGX-222.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM3-32GB
EfficientNet-B48835 images/sec3 images/sec/watt9.581x V100DGX-222.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM3-32GB
1281,172 images/sec4 images/sec/watt109.21x V100DGX-222.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM3-32GB

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container


Inference Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Inference Performance on Cloud

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.5811,535 images/sec62 images/sec/watt0.691x A100GCP A2-HIGHGPU-1G22.04-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-40GB
12828,303 images/sec110 images/sec/watt4.521x A100GCP A2-HIGHGPU-1G22.04-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-40GB
811,204 images/sec61 images/sec/watt0.711x A100AWS EC2 p4d.24xlarge22.04-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-40GB
12828,291 images/sec108 images/sec/watt4.521x A100AWS EC2 p4d.24xlarge22.04-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-40GB
811,371 images/sec- images/sec/watt0.71x A100Azure Standard_ND96amsr_A100_v422.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-40GB
12829,603 images/sec- images/sec/watt4.321x A100Azure Standard_ND96amsr_A100_v422.05-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-40GB
BERT-LARGE82,669 images/sec10 images/sec/watt31x A100AWS EC2 p4d.24xlarge22.04-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-40GB
1284,906 images/sec12 images/sec/watt26.091x A100AWS EC2 p4d.24xlarge22.04-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-40GB
BERT-BASE86,763 sequences/sec31 sequences/sec/watt1.181x A100AWS EC2 p4d.24xlarge22.04-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-40GB
12815,199 sequences/sec39 sequences/sec/watt8.421x A100AWS EC2 p4d.24xlarge22.04-py3INT8SyntheticTensorRT 8.2.3A100-SXM4-40GB

BERT-Large: Sequence Length = 128

T4 Inference Performance on Cloud +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.583,301 images/sec47 images/sec/watt2.421x T4GCP N1-HIGHMEM-822.04-py3INT8SyntheticTensorRT 8.2.3NVIDIA T4


V100 Inference Performance on Cloud +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.584,122 images/sec18 images/sec/watt1.941x V100GCP N1-HIGHMEM-822.04-py3INT8SyntheticTensorRT 8.2.3V100-SXM2-16GB
1287,343 images/sec25 images/sec/watt17.431x V100GCP N1-HIGHMEM-822.04-py3INT8SyntheticTensorRT 8.2.3V100-SXM2-16GB
83,824 images/sec- images/sec/watt2.091x V100Azure Standard_NC6s_v322.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM2-16GB
1287,043 images/sec- images/sec/watt18.171x V100Azure Standard_NC6s_v322.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM2-16GB
BERT-BASE86,763 sequences/sec31 sequences/sec/watt1.181x V100AWS EC2 p3.2xlarge22.04-py3INT8SyntheticTensorRT 8.2.3V100-SXM2-16GB
12815,199 sequences/sec39 sequences/sec/watt8.421x V100AWS EC2 p3.2xlarge22.04-py3INT8SyntheticTensorRT 8.2.3V100-SXM2-16GB

Conversational AI

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.

Related Resources

Download and get started with NVIDIA Riva.


Riva Benchmarks

A100 ASR Benchmarks

A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet110.31A100 SXM4-40GB
Citrinet256167.4253A100 SXM4-40GB
Citrinet512293.8503A100 SXM4-40GB
Citrinet1024661.8988A100 SXM4-40GB
Quartznet117.21A100 SXM4-40GB
Quartznet256142.8254A100 SXM4-40GB
Quartznet512214.2505A100 SXM4-40GB
Quartznet1024377.8998A100 SXM4-40GB
Jasper120.91A100 SXM4-40GB
Jasper256173.3254A100 SXM4-40GB
Jasper512286504A100 SXM4-40GB
Jasper1024700.6989A100 SXM4-40GB

A100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet19.81A100 SXM4-40GB
Citrinet1626.816A100 SXM4-40GB
Citrinet12891.1127A100 SXM4-40GB
Quartznet19.11A100 SXM4-40GB
Quartznet1617.916A100 SXM4-40GB
Quartznet12855.5127A100 SXM4-40GB
Jasper113.51A100 SXM4-40GB
Jasper1631.516A100 SXM4-40GB
Jasper12898.5127A100 SXM4-40GB

A100 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet111.61A100 SXM4-40GB
Citrinet512366.8503A100 SXM4-40GB
Citrinet1,024680.4989A100 SXM4-40GB
Citrinet1,512981.31,437A100 SXM4-40GB
Quartznet134.41A100 SXM4-40GB
Quartznet512457.5504A100 SXM4-40GB
Quartznet1,024941.9989A100 SXM4-40GB
Quartznet1,5121592.71,421A100 SXM4-40GB
Jasper135.81A100 SXM4-40GB
Jasper512631.3503A100 SXM4-40GB
Jasper1,0241,495.5977A100 SXM4-40GB
Jasper1,5122,544.21,395A100 SXM4-40GB

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

A30 ASR Benchmarks

A30 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet115.31A30
Citrinet256262.4253A30
Citrinet512494.5500A30
Citrinet102412,500690A30
Quartznet119.71A30
Quartznet256177.9254A30
Quartznet512293.7504A30
Quartznet1024654.7987A30
Jasper122.31A30
Jasper256252.1253A30
Jasper512454.1502A30
Jasper10247,770.8722A30

A30 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet114.11A30
Citrinet1644.916A30
Citrinet128177.6127A30
Quartznet110.21A30
Quartznet1625.816A30
Quartznet12865.5127A30
Jasper115.11A30
Jasper1640.316A30
Jasper1282,663.2120A30

A30 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet116.91A30
Citrinet512574.1501A30
Citrinet1,0241,166.1979A30
Citrinet1,5128,992.41,108A30
Quartznet141.41A30
Quartznet512696.4502A30
Quartznet1,0241,536.5974A30
Quartznet1,5122,712.41,392A30
Jasper140.31A30
Jasper5121,149.5498A30
Jasper1,0242,981.7948A30
Jasper1,51218,136978A30

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

V100 ASR Benchmarks +

V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet112.51V100 SXM2-16GB
Citrinet256284.5253V100 SXM2-16GB
Citrinet512553.9499V100 SXM2-16GB
Citrinet7684,443.9650V100 SXM2-16GB
Quartznet113.71V100 SXM2-16GB
Quartznet256196.4254V100 SXM2-16GB
Quartznet512308.1502V100 SXM2-16GB
Quartznet768458.1748V100 SXM2-16GB
Jasper123.61V100 SXM2-16GB
Jasper128191.7127V100 SXM2-16GB
Jasper256336.6253V100 SXM2-16GB
Jasper512937.1497V100 SXM2-16GB

V100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet111.21V100 SXM2-16GB
Citrinet1634.916V100 SXM2-16GB
Citrinet128213.3127V100 SXM2-16GB
Quartznet18.31V100 SXM2-16GB
Quartznet1617.716V100 SXM2-16GB
Quartznet128183.1127V100 SXM2-16GB
Jasper119.31V100 SXM2-16GB
Jasper1638.916V100 SXM2-16GB
Jasper64123.164V100 SXM2-16GB

V100 Offline Mode (3200 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet112.71V100 SXM2-16GB
Citrinet256350.9252V100 SXM2-16GB
Citrinet512627.1499V100 SXM2-16GB
Citrinet768936.7738V100 SXM2-16GB
Citrinet1,0241,500.5972V100 SXM2-16GB
Quartznet129.51V100 SXM2-16GB
Quartznet256365.4253V100 SXM2-16GB
Quartznet512669.9501V100 SXM2-16GB
Quartznet7681,199.8737V100 SXM2-16GB
Quartznet1,0241,662.2965V100 SXM2-16GB
Jasper135.31V100 SXM2-16GB
Jasper256740.5251V100 SXM2-16GB
Jasper5121,757.2489V100 SXM2-16GB
Jasper7683,138.9711V100 SXM2-16GB

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz



T4 ASR Benchmarks +

T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet1261NVIDIA T4
Citrinet64178.764NVIDIA T4
Citrinet128300.4127NVIDIA T4
Citrinet256710.4249NVIDIA T4
Citrinet3848,847.0290NVIDIA T4
Quartznet128.41NVIDIA T4
Quartznet64144.164NVIDIA T4
Quartznet128190.3127NVIDIA T4
Quartznet256296.5252NVIDIA T4
Quartznet384422.7376NVIDIA T4
Jasper174.81NVIDIA T4
Jasper64218.864NVIDIA T4
Jasper128359.5126NVIDIA T4
Jasper2561,030.6249NVIDIA T4

T4 Best Streaming Latency Mode (100 ms chunk)
Acoustic model# of streamsAvg Latency (ms)Throughput (RTFX)GPU Version
Citrinet122.61NVIDIA T4
Citrinet1666.116NVIDIA T4
Citrinet641,803.762NVIDIA T4
Quartznet116.11NVIDIA T4
Quartznet1640.716NVIDIA T4
Quartznet64104.564NVIDIA T4
Jasper146.61NVIDIA T4
Jasper847.48NVIDIA T4
Jasper167216NVIDIA T4

T4 Offline Mode (3200 ms chunk)
Acoustic model# of streamsLatency (ms) (avg)Throughput (RTFX)GPU Version
Citrinet128.31NVIDIA T4
Citrinet256709.2250NVIDIA T4
Citrinet5123,510.8449NVIDIA T4
Quartznet154.21NVIDIA T4
Quartznet256770.9251NVIDIA T4
Quartznet5121,685.9486NVIDIA T4
Jasper196.71NVIDIA T4
Jasper2561,888.4245NVIDIA T4

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

A100 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.030.003133A100 SXM4-40GB
FastPitch + Hifi-GAN40.040.006340A100 SXM4-40GB
FastPitch + Hifi-GAN60.060.007390A100 SXM4-40GB
FastPitch + Hifi-GAN80.070.009443A100 SXM4-40GB
FastPitch + Hifi-GAN100.070.009464A100 SXM4-40GB
Tacotron 2 + WaveGlow10.050.0234A100 SXM4-40GB
Tacotron 2 + WaveGlow40.260.0359A100 SXM4-40GB
Tacotron 2 + WaveGlow60.380.0366A100 SXM4-40GB
Tacotron 2 + WaveGlow80.510.0470A100 SXM4-40GB
Tacotron 2 + WaveGlow100.610.0473A100 SXM4-40GB

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

A30 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.030.003133A30
FastPitch + Hifi-GAN40.040.006340A30
FastPitch + Hifi-GAN60.060.007390A30
FastPitch + Hifi-GAN80.070.009443A30
FastPitch + Hifi-GAN100.070.009464A30
Tacotron 2 + WaveGlow10.070.0325A30
Tacotron 2 + WaveGlow40.330.0445A30
Tacotron 2 + WaveGlow60.510.0548A30
Tacotron 2 + WaveGlow80.690.0650A30
Tacotron 2 + WaveGlow100.840.0650A30

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

V100 TTS Benchmarks +

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.030.005107V100 SXM2-16GB
FastPitch + Hifi-GAN40.070.01212V100 SXM2-16GB
FastPitch + Hifi-GAN60.100.01226V100 SXM2-16GB
FastPitch + Hifi-GAN80.130.02236V100 SXM2-16GB
FastPitch + Hifi-GAN100.150.02232V100 SXM2-16GB
Tacotron 2 + WaveGlow10.060.0325V100 SXM2-16GB
Tacotron 2 + WaveGlow40.390.0537V100 SXM2-16GB
Tacotron 2 + WaveGlow60.600.0640V100 SXM2-16GB
Tacotron 2 + WaveGlow80.810.0643V100 SXM2-16GB
Tacotron 2 + WaveGlow100.980.0743V100 SXM2-16GB

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz



T4 TTS Benchmarks +

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.050.00673NVIDIA T4
FastPitch + Hifi-GAN40.110.02132NVIDIA T4
FastPitch + Hifi-GAN60.150.02141NVIDIA T4
FastPitch + Hifi-GAN80.190.03148NVIDIA T4
FastPitch + Hifi-GAN100.210.03150NVIDIA T4
Tacotron 2 + WaveGlow10.110.0515NVIDIA T4
Tacotron 2 + WaveGlow40.720.1118NVIDIA T4
Tacotron 2 + WaveGlow61.160.1419NVIDIA T4
Tacotron 2 + WaveGlow81.640.1619NVIDIA T4
Tacotron 2 + WaveGlow102.070.1719NVIDIA T4

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz

 

Last updated: June 29th, 2022