Reproducible Performance

Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide

Related Resources

HPC Performance

Review the latest GPU-acceleration factors of popular HPC applications.


Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Related Resources

Read our blog on convergence for more details.

Get up and running quickly with NVIDIA’s complete solution stack:


NVIDIA Performance on MLPerf 2.1 Training Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA Performance on MLPerf 2.1 AI Benchmarks: Single Node - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.514.74675.90% classification8x H100DGX H1002.1-2091MixedImageNet2012H100-SXM5-80GB
27.68875.90% classification8x A100GIGABYTE: G492-ZD22.1-2038MixedImageNet2012A100-SXM4-80GB
3D U-Net13.1010.908 Mean DICE score8x H100DGX H1002.1-2091MixedKiTS 2019H100-SXM5-80GB
22.9890.908 Mean DICE score8x A100GIGABYTE: G492-ZD22.1-2038MixedKiTS 2019A100-SXM4-80GB
PyTorchBERT6.3780.72 Mask-LM accuracy8x H100DGX H1002.1-2091MixedWikipedia 2020/01/01H100-SXM5-80GB
16.5490.72 Mask-LM accuracy8x A100GIGABYTE: G492-ZD22.1-2039MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN20.3480.377 Box min AP and 0.339 Mask min AP8x H100DGX H1002.1-2091MixedCOCO2017H100-SXM5-80GB
37.9160.377 Box min AP and 0.339 Mask min AP8x A100GIGABYTE: G492-ZD22.1-2039MixedCOCO2017A100-SXM4-80GB
RNN-T18.2020.058 Word Error Rate8x H100DGX H1002.1-2091MixedLibriSpeechH100-SXM5-80GB
29.9480.058 Word Error Rate8x A100GIGABYTE: G492-ZD22.1-2039MixedLibriSpeechA100-SXM4-80GB
RetinaNet38.05034.0% mAP8x H100DGX H1002.1-2091MixedOpenImagesH100-SXM5-80GB
82.52934.0% mAP8x A100GIGABYTE: G492-ZD22.1-2039MixedOpenImagesA100-SXM4-80GB
TensorFlowMiniGo174.58450% win rate vs. checkpoint8x H100DGX H1002.1-2091MixedGoH100-SXM5-80GB
161.84850% win rate vs. checkpoint8x A100GIGABYTE: G492-ZD22.1-2040MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM1.0630.8025 AUC8x H100DGX H1002.1-2091MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)H100-SXM5-80GB
1.6250.8025 AUC8x A100Fujitsu: PRIMERGY-GX2570M6-hugectr2.1-2033MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

NVIDIA Performance on MLPerf 2.1 AI Benchmarks: Multi Node - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.54.50875.90% classification32x H100DGX H1002.1-2093MixedImageNet2012H100-SXM5-80GB
4.52375.90% classification64x A100DGX A1002.1-2065MixedImageNet2012A100-SXM4-80GB
0.55575.90% classification1,024x A100DGX A1002.1-2073MixedImageNet2012A100-SXM4-80GB
0.31975.90% classification4,216x A100DGX A1002.1-2080MixedImageNet2012A100-SXM4-80GB
3D U-Net5.3470.908 Mean DICE score24x H100DGX H1002.1-2092MixedKiTS 2019H100-SXM5-80GB
3.4370.908 Mean DICE score72x A100Azure: ND96amsr_A100_v4_n92.1-2009MixedKiTS 2019A100-SXM4-80GB
1.2160.908 Mean DICE score768x A100DGX A1002.1-2072MixedKiTS 2019A100-SXM4-80GB
PyTorchBERT1.7970.72 Mask-LM accuracy32x H100DGX H1002.1-2093MixedWikipedia 2020/01/01H100-SXM5-80GB
2.4970.72 Mask-LM accuracy64x A100DGX A1002.1-2068MixedWikipedia 2020/01/01A100-SXM4-80GB
0.4210.72 Mask-LM accuracy1,024x A100DGX A1002.1-2074MixedWikipedia 2020/01/01A100-SXM4-80GB
0.2080.72 Mask-LM accuracy4,096x A100DGX A1002.1-2079MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN7.3380.377 Box min AP and 0.339 Mask min AP32x H100DGX H1002.1-2093MixedCOCO2017H100-SXM5-80GB
8.2930.377 Box min AP and 0.339 Mask min AP64x A100HPE-ProLiant-XL675d-Gen10-Plus_A100-SXM-80GB_pytorch2.1-2049MixedCOCO2017A100-SXM4-80GB
2.7500.377 Box min AP and 0.339 Mask min AP384x A100DGX A1002.1-2071MixedCOCO2017A100-SXM4-80GB
RNN-T7.5340.058 Word Error Rate32x H100DGX H1002.1-2093MixedLibriSpeechH100-SXM5-80GB
6.9100.058 Word Error Rate64x A100DGX A1002.1-2066MixedLibriSpeechA100-SXM4-80GB
2.1510.058 Word Error Rate1,536x A100DGX A1002.1-2076MixedLibriSpeechA100-SXM4-80GB
RetinaNet11.79834.0% mAP32x H100DGX H1002.1-2093MixedOpenImagesH100-SXM5-80GB
12.76334.0% mAP64x A100DGX A1002.1-2068MixedOpenImagesA100-SXM4-80GB
2.34934.0% mAP1,280x A100DGX A1002.1-2075MixedOpenImagesA100-SXM4-80GB
1.84334.0% mAP2,048x A100DGX A1002.1-2078MixedOpenImagesA100-SXM4-80GB
TensorFlowMiniGo92.52250% win rate vs. checkpoint32x H100DGX H1002.1-2093MixedGoH100-SXM5-80GB
73.03850% win rate vs. checkpoint64x A100DGX A1002.1-2067MixedGoA100-SXM4-80GB
16.23150% win rate vs. checkpoint1,792x A100DGX A1002.1-2077MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM0.5150.8025 AUC32x H100DGX H1002.1-2093MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)H100-SXM5-80GB
0.6530.8025 AUC64x A100DGX A1002.1-2064MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.5880.8025 AUC112x A100DGX A1002.1-2070MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

MLPerf™ v2.1 Training Closed: 2.1-2038, 2.1-2039, 2.1-2033, 2.1-2040, 2.1-2065, 2.1-2068, 2.1-2049, 2.1-2066, 2.1-2064, 2.1-2067, 2.1-2009, 2.1-2070, 2.1-2071, 2.1-2072, 2.1-2073, 2.1-2074, 2.1-2075, 2.1-2076, 2.1-2077, 2.1-2078, 2.1-2079, 2.1-2080, 2.1-2091, 2.1-2092, 2.1-2093 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
H100 SXM5-80GB is a preview submission


NVIDIA A100 Performance on MLPerf 2.0 Training HPC Benchmarks: Strong Scaling - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
PyTorchCosmoFlow3.79Mean average error 0.124512x A100DGX A1002.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
DeepCAM1.57IOU 0.822,048x A100DGX A1002.0-8005MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB
OpenCatalyst21.93Forces mean absolute error 0.036512x A100DGX A1002.0-8006MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setA100-SXM4-80GB

NVIDIA A100 Performance on MLPerf 2.0 Training HPC Benchmarks: Weak Scaling - Closed Division

FrameworkNetworkThroughputMLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
PyTorchCosmoFlow4.21 models/minMean average error 0.1244,096x A100DGX A1002.0-8014MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
DeepCAM6.40 models/minIOU 0.824,096x A100DGX A1002.0-8014MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB
OpenCatalyst0.66 models/minForces mean absolute error 0.0364,096x A100DGX A1002.0-8014MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setA100-SXM4-80GB

MLPerf™ v2.0 Training HPC Closed: 2.0-8005, 2.0-8006, 2.0-8014 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v2.0 Training HPC rules and guidelines, click here

Converged Training Performance of NVIDIA Data Center GPUs

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch2.0.0a0Tacotron2100.56 Training Loss309,182 total output mels/sec8x A100DGX A10023.03-py3TF32128LJSpeech 1.1A100-SXM4-80GB
2.0.0a0WaveGlow226-5.83 Training Loss1,866,026 output samples/sec8x A100DGX A10023.03-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
2.0.0a0GNMT v21524.52 BLEU Score1,008,081 total tokens/sec8x A100DGX A10023.03-py3Mixed128wmt16-en-deA100-SXM4-80GB
2.0.0a0NCF0.36.96 Hit Rate at 10157,366,919 samples/sec8x A100DGX A10023.03-py3Mixed131072MovieLens 20MA100-SXM4-80GB
2.0.0a0Transformer XL Large37617.82 Perplexity221,307 total tokens/sec8x A100DGX A10023.03-py3Mixed16WikiText-103A100-SXM4-80GB
2.0.0a0Transformer XL Base17821.54 Perplexity744,427 total tokens/sec8x A100DGX A10023.03-py3Mixed128WikiText-103A100-SXM4-80GB
1.14.0a0EfficientDet-D0477.34 BBOX mAP1,888 images/sec8x A100DGX A10022.12-py3Mixed150COCO 2017A100-SXM4-80GB
1.14.0a0TFT - Traffic2.08 Test P90132,904 items/sec8x A100DGX A10022.12-py3Mixed1024TrafficA100-SXM4-80GB
2.0.0a0TFT-Electricity2.03 Test P90128,663 items/sec8x A100DGX A10023.03-py3Mixed1024ElectricityA100-SXM4-80GB
2.0.0a0HiFiGAN1,8449.67 Training Loss57,480 total output mels/sec8x A100DGX A10023.03-py3Mixed16LJSpeech-1.1A100-SXM4-80GB
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.991,080 images/sec8x A100DGX A10023.02-py3Mixed2DAGM2007A100-SXM4-80GB
2.11.0U-Net Medical4.89 DICE Score1,016 images/sec8x A100DGX A10023.03-py3Mixed8EM segmentation challengeA100-SXM4-80GB
2.11.0Electra Fine Tuning399.42 F12,766 sequences/sec8x A100DGX A10023.03-py3Mixed32SQuaD v1.1A100-SXM4-80GB
2.11.0EfficientNet-B053676.52 Top 120,310 images/sec8x A100DGX A10023.03-py3Mixed1024Imagenet2012A100-SXM4-80GB
2.11.0Wide and Deep5.66 MAP at 127,374,523 samples/sec8x A100DGX A10023.03-py3Mixed16384Kaggle Outbrain Click PredictionA100-SXM4-80GB
2.11.0SIM1.82 AUC3,098,865 samples/sec8x A100DGX A10023.03-py3Mixed16384Amazon ReviewsA100-SXM4-80GB

A40 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch2.0.0a0NCF1.96 Hit Rate at 1047,497,889 samples/sec8x A40GIGABYTE G482-Z52-0023.03-py3Mixed131072MovieLens 20MA40
2.0.0a0Tacotron2112.56 Training Loss271,434 total output mels/sec8x A40Supermicro AS -4124GS-TNR23.03-py3Mixed128LJSpeech 1.1A40
2.0.0a0WaveGlow459-5.7 Training Loss915,775 output samples/sec8x A40Supermicro AS -4124GS-TNR23.03-py3Mixed10LJSpeech 1.1A40
2.0.0a0GNMT v24624.41 BLEU Score328,835 total tokens/sec8x A40Supermicro AS -4124GS-TNR23.03-py3Mixed128wmt16-en-deA40
2.0.0a0Transformer XL Large92517.82 Perplexity90,187 total tokens/sec8x A40Supermicro AS -4124GS-TNR23.03-py3Mixed16WikiText-103A40
1.14.0a0Transformer XL Base43721.59 Perplexity304,533 total tokens/sec8x A40Supermicro AS -4124GS-TNR23.02-py3Mixed128WikiText-103A40
2.0.0a0FastPitch140.17 Training Loss610,189 frames/sec8x A40Supermicro AS -4124GS-TNR23.03-py3Mixed32LJSpeech 1.1A40
2.0.0a0TFT - Traffic2.08 Test P9074,780 items/sec8x A40GIGABYTE G482-Z52-0023.03-py3Mixed1024TrafficA40
2.0.0a0TFT-Electricity4.03 Test P9074,746 items/sec8x A40GIGABYTE G482-Z52-0023.03-py3Mixed1024ElectricityA40
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99738 images/sec8x A40GIGABYTE G482-Z52-0023.02-py3Mixed2DAGM2007A40
2.11.0Electra Fine Tuning492.44 F11,127 sequences/sec8x A40Supermicro AS -4124GS-TNR23.03-py3Mixed32SQuaD v1.1A40
2.11.0SIM1.81 AUC2,316,262 samples/sec8x A40GIGABYTE G482-Z52-0023.03-py3Mixed16384Amazon ReviewsA40

A30 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch2.0.0a0Tacotron2250.53 Training Loss251,114 total output mels/sec8x A30GIGABYTE G482-Z52-0023.03-py3Mixed104LJSpeech 1.1A30
1.14.0a0WaveGlow432-5.67 Training Loss972,594 output samples/sec8x A30GIGABYTE G482-Z52-0023.02-py3Mixed10LJSpeech 1.1A30
2.0.0a0GNMT v24824.15 BLEU Score310,897 total tokens/sec8x A30GIGABYTE G482-Z52-0023.03-py3Mixed128wmt16-en-deA30
2.0.0a0NCF1.96 Hit Rate at 1053,500,822 samples/sec8x A30GIGABYTE G482-Z52-0023.03-py3Mixed131072MovieLens 20MA30
2.0.0a0FastPitch162.17 Training Loss516,048 frames/sec8x A30GIGABYTE G482-Z52-0023.03-py3Mixed16LJSpeech 1.1A30
2.0.0a0Transformer XL Base14522.77 Perplexity231,038 total tokens/sec8x A30GIGABYTE G482-Z52-0023.03-py3Mixed32WikiText-103A30
2.0.0a0EfficientDet-D0898.34 BBOX mAP802 images/sec8x A30GIGABYTE G482-Z52-0023.03-py3Mixed30COCO 2017A30
2.0.0a0TFT - Traffic2.08 Test P9080,767 items/sec8x A30GIGABYTE G482-Z52-0023.03-py3Mixed1024TrafficA30
1.14.0a0TFT-Electricity3.03 Test P9079,171 items/sec8x A30GIGABYTE G482-Z52-0023.01-py3Mixed1024ElectricityA30
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99681 images/sec8x A30GIGABYTE G482-Z52-0023.02-py3Mixed2DAGM2007A30
1.15.5U-Net Medical6.9 DICE Score460 images/sec8x A30GIGABYTE G482-Z52-0023.02-py3Mixed8EM segmentation challengeA30
1.15.5Wide and Deep150.68 MAP at 121,031,440 samples/sec8x A30GIGABYTE G482-Z52-0023.03-py3TF3216384Kaggle Outbrain Click PredictionA30
2.11.0Electra Fine Tuning592.67 F1974 sequences/sec8x A30GIGABYTE G482-Z52-0023.03-py3Mixed16SQuaD v1.1A30
2.11.0SIM1.81 AUC2,206,332 samples/sec8x A30GIGABYTE G482-Z52-0023.03-py3Mixed16384Amazon ReviewsA30

A10 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch2.0.0a0Tacotron2138.53 Training Loss221,082 total output mels/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed104LJSpeech 1.1A10
2.0.0a0WaveGlow588-5.8 Training Loss713,663 output samples/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed10LJSpeech 1.1A10
2.0.0a0GNMT v25824.18 BLEU Score256,851 total tokens/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed128wmt16-en-deA10
2.0.0a0NCF1.96 Hit Rate at 1044,699,425 samples/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed131072MovieLens 20MA10
2.0.0a0Fast Pitch184.17 Training Loss449,885 frames/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed16LJSpeech 1.1A10
2.0.0a0Transformer-XL Base19122.81 Perplexity174,655 total tokens/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed32WikiText-103A10
2.0.0a0EfficientDet-D0939.34 BBOX mAP767 images/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed30COCO 2017A10
1.14.0a0TFT - Traffic3.08 Test P9063,062 items/sec8x A10GIGABYTE G482-Z52-0023.02-py3Mixed1024TrafficA10
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99657 images/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed2DAGM2007A10
1.15.5U-Net Medical13.89 DICE Score355 images/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed8EM segmentation challengeA10
2.11.0Wide and Deep181.68 MAP at 12741,087 samples/sec8x A10GIGABYTE G482-Z52-0023.02-py3Mixed16384Kaggle Outbrain Click PredictionA10
2.11.0Electra Fine Tuning692.72 F1738 sequences/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed16SQuaD v1.1A10
2.11.0SIM1.81 AUC1,961,379 samples/sec8x A10GIGABYTE G482-Z52-0023.03-py3Mixed16384Amazon ReviewsA10

T4 Training Performance +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch2.0.0a0Tacotron2225.5 Training Loss133,611 total output mels/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed104LJSpeech 1.1NVIDIA T4
2.0.0a0WaveGlow1,007-5.69 Training Loss415,453 output samples/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed10LJSpeech 1.1NVIDIA T4
2.0.0a0GNMT v211224.31 BLEU Score159,382 total tokens/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed128wmt16-en-deNVIDIA T4
2.0.0a0NCF2.96 Hit Rate at 1025,978,081 samples/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed131072MovieLens 20MNVIDIA T4
1.14.0a0Fast Pitch344.17 Training Loss233,644 frames/sec8x T4Supermicro SYS-4029GP-TRT23.02-py3Mixed16LJSpeech 1.1NVIDIA T4
2.0.0a0Transformer-XL Base34322.84 Perplexity96,444 total tokens/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed32WikiText-103NVIDIA T4
2.0.0a0TFT - Traffic5.08 Test P9033,043 items/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed1024TrafficNVIDIA T4
2.0.0a0TFT-Electricity7.03 Test P9033,482 items/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed1024ElectricityNVIDIA T4
Tensorflow1.15.5U-Net Industrial2.99 IoU Threshold 0.99329 images/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed2DAGM2007NVIDIA T4
1.15.5U-Net Medical70.9 DICE Score155 images/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed8EM segmentation challengeNVIDIA T4
1.15.5Wide and Deep217.68 MAP at 12640,843 samples/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed16384Kaggle Outbrain Click PredictionNVIDIA T4
2.11.0Electra Fine Tuning1092.7 F1373 sequences/sec8x T4Supermicro SYS-4029GP-TRT23.02-py3Mixed16SQuaD v1.1NVIDIA T4
2.11.0SIM2.8 AUC1,128,165 samples/sec8x T4Supermicro SYS-4029GP-TRT23.03-py3Mixed16384Amazon ReviewsNVIDIA T4


V100 Training Performance +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.14.0a0Tacotron2145.53 Training Loss223,647 total output mels/sec8x V100DGX-223.02-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
2.0.0a0WaveGlow389-5.73 Training Loss1,098,179 output samples/sec8x V100DGX-223.03-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
2.0.0a0GNMT v23424.23 BLEU Score447,868 total tokens/sec8x V100DGX-223.03-py3Mixed128wmt16-en-deV100-SXM3-32GB
2.0.0a0NCF1.96 Hit Rate at 1094,887,080 samples/sec8x V100DGX-223.03-py3Mixed131072MovieLens 20MV100-SXM3-32GB
2.0.0a0Fast Pitch169.17 Training Loss518,517 frames/sec8x V100DGX-223.03-py3Mixed16LJSpeech 1.1V100-SXM3-32GB
2.0.0a0Transformer-XL Base10822.79 Perplexity308,781 total tokens/sec8x V100DGX-223.03-py3Mixed32WikiText-103V100-SXM3-32GB
1.14.0a0EfficientDet-D01,226.34 BBOX mAP584 images/sec8x V100DGX-223.02-py3Mixed60COCO 2017V100-SXM3-32GB
2.0.0a0TFT - Traffic2.08 Test P9087,644 items/sec8x V100DGX-223.03-py3Mixed1024TrafficV100-SXM3-32GB
2.0.0a0TFT-Electricity3.03 Test P9087,857 items/sec8x V100DGX-223.03-py3Mixed1024ElectricityV100-SXM3-32GB
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99643 images/sec8x V100DGX-223.03-py3Mixed2DAGM2007V100-SXM3-32GB
1.15.5U-Net Medical13.89 DICE Score460 images/sec8x V100DGX-223.03-py3Mixed8EM segmentation challengeV100-SXM3-32GB
2.11.0Wide and Deep185.68 MAP at 124,689,385 samples/sec8x V100DGX-223.03-py3Mixed16384Kaggle Outbrain Click PredictionV100-SXM3-32GB
2.11.0Electra Fine Tuning592.64 F11,359 sequences/sec8x V100DGX-223.03-py3Mixed32SQuaD v1.1V100-SXM3-32GB
2.11.0SIM1.82 AUC1,991,609 samples/sec8x V100DGX-223.03-py3Mixed16384Amazon ReviewsV100-SXM3-32GB

Converged Training Performance of NVIDIA Data Center GPUs on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.58177.18 Top 124,513 images/sec8x A100Azure Standard_ND96amsr_A100_v422.12-py3Mixed256ImageNet2012A100-SXM4-80GB
-ResNet-50 v1.58477.17 Top 123,595 images/sec8x A100GCP A2-HIGHGPU-8G23.02-py3Mixed256ImageNet2012A100-SXM4-40GB

Converged Multi-Node Training Performance of NVIDIA Data Center GPUs

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Multi-Node Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputNumber of NodesNumber of GPUsServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.14.0a0BERT-Large Pre-Training P116061.32 Final Loss4,754.06 sequences/sec18Selene23.01-py3Mixed256SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P24941.28 Final Loss1,780.29 sequences/sec18Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training E2E12351.28 Final Loss-18Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P18041.37 Final Loss9,457.6 sequences/sec216Selene23.01-py3Mixed256SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P22451.27 Final Loss3,479.51 sequences/sec216Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training E2E6181.27 Final Loss-216Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P12091.42 Final Loss37,052.87 sequences/sec864Selene23.01-py3Mixed256SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P2641.28 Final Loss13,805.48 sequences/sec864Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training E2E1601.28 Final Loss-864Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P11191.41 Final Loss71,775.3 sequences/sec16128Selene23.01-py3Mixed256SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P2351.28 Final Loss25,005.24 sequences/sec16128Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training E2E911.28 Final Loss-16128Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P1561.41 Final Loss138,802.3 sequences/sec32256Selene23.01-py3Mixed256SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P2181.27 Final Loss51,148.79 sequences/sec32256Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training E2E441.27 Final Loss-32256Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P1321.4 Final Loss245,286.91 sequences/sec64512Selene23.01-py3Mixed128SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P2111.26 Final Loss89,069.47 sequences/sec64512Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training E2E251.26 Final Loss-64512Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P1231.4 Final Loss343,139.17 sequences/sec1281024Selene23.01-py3Mixed64SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training P281.24 Final Loss143,583.49 sequences/sec1281024Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.14.0a0BERT-Large Pre-Training E2E181.24 Final Loss-1281024Selene23.01-py3Mixed32SQuaD v1.1A100-SXM4-80GB

BERT-Large Pre-Training Phase 1 Sequence Length = 128
BERT-Large Pre-Training Phase 2 Sequence Length = 512

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Related Resources

Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:


MLPerf Inference v3.0 Performance Benchmarks

Offline Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionDatasetTarget Accuracy
ResNet-50 v1.5727,437 samples/sec8x H100DGX H100H100 SXM-80GBImageNet76.46% Top1
456,919 samples/sec8x H100Gigabyte G482-Z54H100 PCIe-80GBImageNet76.46% Top1
324,612 samples/sec8x A100DGX A100A100 SXM-80GBImageNet76.46% Top1
319,091 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet76.46% Top1
13,158 samples/sec1x L4Gigabyte G482-Z54NVIDIA L4ImageNet76.46% Top1
RetinaNet12,164 samples/sec8x H100DGX H100H100 SXM-80GBOpenImages0.3755 mAP
7,975 samples/sec8x H100Gigabyte G482-Z54H100 PCIe-80GBOpenImages0.3755 mAP
5,803 samples/sec8x A100DGX A100A100 SXM-80GBOpenImages0.3755 mAP
5,377 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages0.3755 mAP
179 samples/sec1x L4Gigabyte G482-Z54NVIDIA L4OpenImages0.3755 mAP
3D-UNet55 samples/sec8x H100DGX H100H100 SXM-80GBKiTS 20190.863 DICE mean
38 samples/sec8x H100Gigabyte G482-Z54H100 PCIe-80GBKiTS 20190.863 DICE mean
31 samples/sec8x A100DGX A100A100 SXM-80GBKiTS 20190.863 DICE mean
29 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBKiTS 20190.863 DICE mean
1.08 samples/sec1x L4Gigabyte G482-Z54NVIDIA L4KiTS 20190.863 DICE mean
RNN-T179,738 samples/sec8x H100DGX H100H100 SXM-80GBLibriSpeech7.45% WER
119,788 samples/sec8x H100Gigabyte G482-Z54H100 PCIe-80GBLibriSpeech7.45% WER
106,221 samples/sec8x A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
99,332 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech7.45% WER
3,980 samples/sec1x L4Gigabyte G482-Z54NVIDIA L4LibriSpeech7.45% WER
BERT73,108 samples/sec8x H100DGX H100H100 SXM-80GBSQuAD v1.190.87% f1
46,538 samples/sec8x H100Gigabyte G482-Z54H100 PCIe-80GBSQuAD v1.190.87% f1
28,276 samples/sec8x A100DGX A100A100 SXM-80GBSQuAD v1.190.87% f1
25,602 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.190.87% f1
1,032 samples/sec1x L4Gigabyte G482-Z54NVIDIA L4SQuAD v1.190.87% f1
DLRM5,366,820 samples/sec8x H100DGX H100H100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
3,713,980 samples/sec8x H100Gigabyte G482-Z54H100 PCIe-80GBCriteo 1TB Click Logs80.25% AUC
2,443,310 samples/sec8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
2,262,170 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBCriteo 1TB Click Logs80.25% AUC
94,603 samples/sec1x L4Gigabyte G482-Z54NVIDIA L4Criteo 1TB Click Logs80.25% AUC

Server Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionTarget AccuracyMLPerf Server Latency
Constraints (ms)
Dataset
ResNet-50 v1.5600,179 queries/sec8x H100DGX H100H100 SXM-80GB76.46% Top115ImageNet
368,054 queries/sec8x H100Gigabyte G482-Z54H100 PCIe-80GB76.46% Top115ImageNet
300,029 queries/sec8x A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
236,016 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB76.46% Top115ImageNet
12,199 queries/sec8x L4Gigabyte G482-Z54NVIDIA L476.46% Top115ImageNet
RetinaNet11,519 queries/sec8x H100DGX H100H100 SXM-80GB0.3755 mAP100OpenImages
7,363 queries/sec8x H100Gigabyte G482-Z54H100 PCIe-80GB0.3755 mAP100OpenImages
5,603 queries/sec8x A100DGX A100A100 SXM-80GB0.3755 mAP100OpenImages
4,642 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB0.3755 mAP100OpenImages
155 queries/sec8x L4Gigabyte G482-Z54NVIDIA L40.3755 mAP100OpenImages
RNN-T144,006 queries/sec8x H100DGX H100H100 SXM-80GB7.45% WER1,000LibriSpeech
100,001 queries/sec8x H100Gigabyte G482-Z54H100 PCIe-80GB7.45% WER1,000LibriSpeech
104,000 queries/sec8x A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
89,999 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB7.45% WER1,000LibriSpeech
3,801 queries/sec8x L4Gigabyte G482-Z54NVIDIA L47.45% WER1,000LibriSpeech
BERT59,598 queries/sec8x H100DGX H100H100 SXM-80GB90.87% f1130SQuAD v1.1
36,808 queries/sec8x H100Gigabyte G482-Z54H100 PCIe-80GB90.87% f1130SQuAD v1.1
25,404 queries/sec8x A100DGX A100A100 SXM-80GB90.87% f1130SQuAD v1.1
23,008 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB90.87% f1130SQuAD v1.1
899 queries/sec8x L4Gigabyte G482-Z54NVIDIA L490.87% f1130SQuAD v1.1
DLRM5,776,720 queries/sec8x H100DGX H100H100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
2,703,670 queries/sec8x H100Gigabyte G482-Z54H100 PCIe-80GB80.25% AUC30Criteo 1TB Click Logs
2,390,870 queries/sec8x A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
1,601,250 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB80.25% AUC30Criteo 1TB Click Logs
89,000 queries/sec8x L4Gigabyte G482-Z54NVIDIA L480.25% AUC30Criteo 1TB Click Logs

Power Efficiency Offline Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5353,232 samples/sec159.14 samples/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBImageNet
256,083 samples/sec124.93 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet
RetinaNet5,975 samples/sec2.65 samples/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBOpenImages
4,687 samples/sec1.84 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages
3D-UNet27 samples/sec0.013 samples/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBKiTS 2019
20 samples/sec0.011 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBKiTS 2019
RNN-T99,210 samples/sec45.24 samples/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBLibriSpeech
78,878 samples/sec39.03 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech
BERT39,196 samples/sec13.07 samples/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBSQuAD v1.1
22,397 samples/sec8.87 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.1
DLRM3,046,880 samples/sec1,111.31 samples/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBCriteo 1TB Click Logs
1,573,310 samples/sec742.62 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBCriteo 1TB Click Logs

Power Efficiency Server Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5240,018 queries/sec108.44 queries/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBImageNet
203,512 queries/sec97.60 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet
RetinaNet5,603 queries/sec2.42 queries/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBOpenImages
3,600 queries/sec1.78 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages
RNN-T88,000 queries/sec39.66 queries/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBLibriSpeech
75,001 queries/sec34.08 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech
BERT33,004 queries/sec10.78 queries/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBSQuAD v1.1
17,299 queries/sec7.99 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.1
DLRM1,501,100 queries/sec633.42 queries/sec/watt8x H100Gigabyte G482-Z54H100 PCIe-80GBCriteo 1TB Click Logs
1,000,520 queries/sec493.49 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBCriteo 1TB Click Logs

MLPerf™ v3 Inference Closed: ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 3.0-0065, 3.0-0068, 3.0-0071, 3.0-0073, 3.0-0123, 3.0-0066, 3.0-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA L4 is a preview submission
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v3.0


NVIDIA set new records in MLPerf™ Inference 3.0, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.

MLPerf™ v3.0 A100 Inference Closed: ResNet-50 v1.5, RetinaNet, BERT 99% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 3.0-0068, 3.0-0069. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.​

 

NVIDIA Client Batch Size 1 and 2 Performance with Triton Inference Server

A100 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA100-SXM4-80GBtensorflowTensorRTMixed2112432.716734 inf/sec38423.03-py3
BERT Large InferenceA100-SXM4-40GBtensorflowTensorRTMixed4212461.47781 inf/sec38423.03-py3
BERT Large InferenceA100-PCIE-40GBtensorflowTensorRTMixed2112440.087599 inf/sec38423.03-py3
BERT Large InferenceA100-PCIE-40GBtensorflowTensorRTMixed4212474.137647 inf/sec38423.03-py3
BERT Base InferenceA100-SXM4-40GBtensorflowTensorRTMixed411243.6045,548 inf/sec12823.03-py3
BERT Base InferenceA100-SXM4-40GBtensorflowTensorRTMixed221206.2486,401 inf/sec12823.03-py3
BERT Base InferenceA100-PCIE-40GBtensorflowTensorRTMixed211244.8224,976 inf/sec12823.03-py3
BERT Base InferenceA100-PCIE-40GBtensorflowTensorRTMixed421248.3685,735 inf/sec12823.03-py3
DLRM InferenceA100-SXM4-40GBts-tracePyTorchMixed2165,536301.23124,347 inf/sec-23.03-py3
DLRM InferenceA100-SXM4-80GBts-tracePyTorchMixed2265,536301.31145,733 inf/sec-23.03-py3
DLRM InferenceA100-PCIE-40GBts-tracePyTorchMixed1165,536301.29423,174 inf/sec-23.03-py3
DLRM InferenceA100-PCIE-40GBts-tracePyTorchMixed1265,536301.23748,469 inf/sec-23.03-py3
ResNet-50 v1.5A100-SXM4-40GBtensorrtPyTorchMixed4164532.61815,695 inf/sec-23.03-py3
ResNet-50 v1.5A100-PCIE-40GBtensorrtPyTorchMixed2164325.42315,102 inf/sec-23.03-py3

A30 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA30tensorflowTensorRTMixed2112056.058357 inf/sec38423.03-py3
BERT Large InferenceA30tensorflowTensorRTMixed42124128.875372 inf/sec38423.03-py3
BERT Base InferenceA30tensorflowTensorRTMixed211247.743,100 inf/sec12823.03-py3
BERT Base InferenceA30tensorflowTensorRTMixed4212413.5583,540 inf/sec12823.03-py3

A10 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA10tensorflowTensorRTMixed41124102.234235 inf/sec38423.03-py3
BERT Large InferenceA10tensorflowTensorRTMixed42124203.052236 inf/sec38423.03-py3
BERT Base InferenceA10tensorflowTensorRTMixed2112411.0982,162 inf/sec12823.03-py3
BERT Base InferenceA10tensorflowTensorRTMixed2212017.3942,299 inf/sec12823.03-py3

T4 Triton Inference Server Performance +

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceNVIDIA T4tensorflowTensorRTMixed11124273.87488 inf/sec38423.03-py3
BERT Large InferenceNVIDIA T4tensorflowTensorRTMixed12120430.03793 inf/sec38423.03-py3
BERT Base InferenceNVIDIA T4tensorflowTensorRTMixed1112425.406945 inf/sec12823.03-py3
BERT Base InferenceNVIDIA T4tensorflowTensorRTMixed4212452.019923 inf/sec12823.03-py3


V100 Triton Inference Server Performance +

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceV100 SXM2-32GBtensorflowTensorRTMixed4112492.871258 inf/sec38423.03-py3
BERT Large InferenceV100 SXM2-32GBtensorflowTensorRTMixed42124176.975271 inf/sec38423.03-py3
BERT Base InferenceV100 SXM2-32GBtensorflowTensorRTMixed2112411.7762,038 inf/sec12823.03-py3
BERT Base InferenceV100 SXM2-32GBtensorflowTensorRTMixed2212016.4192,436 inf/sec12823.03-py3
DLRM InferenceV100-SXM2-32GBpytorch_libtorchPyTorchMixed2165,536301.81316,538 inf/sec-22.12-py3
DLRM InferenceV100-SXM2-32GBts-tracePyTorchMixed2265,536302.04329,350 inf/sec-23.03-py3

Inference Performance of NVIDIA Data Center GPUs

Benchmarks are reproducible by following links to the NGC catalog scripts

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.5.3 | Batch Size = 128 | 23.03-py3 | Precision: INT8 | Dataset: Synthetic

 

H100 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
Stable Diffusion v2.1 (512x512)12.1 images/sec-475.281x H100DGX H100-MixedLAION-5BTensorRT 8.6.0H100-SXM5-80GB
43.21 images/sec-1,244.731x H100DGX H100-MixedLAION-5BTensorRT 8.6.0H100-SXM5-80GB

512x512 image size, 50 denoising steps for Stable Diffusion

L40 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
Stable Diffusion v2.1 (1,024x1,024)10.2 images/sec-5072.491x L40GIGABYTE G482-Z54-00-MixedLAION-5BTensorRT 8.5.2L40
ResNet-503227,107 images/sec91 images/sec/watt1.181x L40GIGABYTE G482-Z54-0023.02-py3INT8SyntheticTensorRT 8.5.3L40
ResNet-50v1.53225,935 images/sec87 images/sec/watt1.231x L40GIGABYTE G482-Z54-0023.02-py3INT8SyntheticTensorRT 8.5.3L40
BERT-BASE248,149 sequences/sec27 sequences/sec/watt2.951x L40GIGABYTE G482-Z54-0023.02-py3INT8SyntheticTensorRT 8.5.3L40
BERT-LARGE82,422 sequences/sec8 sequences/sec/watt3.31x L40GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L40
122,453 sequences/sec9 sequences/sec/watt4.891x L40GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L40
242,942 sequences/sec10 sequences/sec/watt8.161x L40GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L40
EfficientNet-B06440,617 images/sec137 images/sec/watt1.581x L40GIGABYTE G482-Z54-0023.02-py3INT8SyntheticTensorRT 8.5.3L40
EfficientNet-B484,226 images/sec15 images/sec/watt1.891x L40GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L40
165,076 images/sec17 images/sec/watt3.151x L40GIGABYTE G482-Z54-0023.02-py3INT8SyntheticTensorRT 8.5.3L40

Sequence length=128 for BERT-BASE and BERT-LARGE
1,024x1,024 image size, 50 denoising steps for Stable Diffusion

L4 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
Stable Diffusion v2.1 (512x512)10.47 images/sec-2113.071x L4GIGABYTE G482-Z54-00-MixedLAION-5BTensorRT 8.6.0L4
ResNet-5089,486 images/sec132 images/sec/watt0.841x L4GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L4
ResNet-50v1.589,130 images/sec127 images/sec/watt0.881x L4GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L4
BERT-BASE83,256 sequences/sec45 sequences/sec/watt2.461x L4GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L4
383,816 sequences/sec53 sequences/sec/watt9.961x L4GIGABYTE G482-Z54-0023.02-py3INT8SyntheticTensorRT 8.5.3L4
BERT-LARGE81,025 sequences/sec14 sequences/sec/watt7.81x L4GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L4
121,224 sequences/sec17 sequences/sec/watt9.81x L4GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L4
EfficientNet-B0810,267 images/sec145 images/sec/watt0.781x L4GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L4
EfficientNet-B481,732 images/sec24 images/sec/watt4.621x L4GIGABYTE G482-Z54-0023.03-py3INT8SyntheticTensorRT 8.5.3L4

Sequence length=128 for BERT-BASE and BERT-LARGE
512x512 image size, 50 denoising steps for Stable Diffusion

A100 Full Chip Inference Performance

NetworkBatch SizeFull Chip ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
Stable Diffusion v2.1 (512x512)11.25 images/sec-803.021x A100DGX A100-MixedLAION-5BTensorRT 8.6.0A100-SXM4-80GB
41.65 images/sec-2428.391x A100DGX A100-MixedLAION-5BTensorRT 8.6.0A100-SXM4-80GB
ResNet-50811,623 images/sec59 images/sec/watt0.691x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
12830,780 images/sec81 images/sec/watt4.161x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
22532,533 images/sec- images/sec/watt6.911x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
ResNet-50v1.5811,638 images/sec62 images/sec/watt0.691x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-40GB
12829,965 images/sec78 images/sec/watt4.271x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
21631,307 images/sec- images/sec/watt6.91x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
86,069 sequences/sec22 sequences/sec/watt1.321x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
12811,227 sequences/sec28 sequences/sec/watt11.41x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
82,324 sequences/sec8 sequences/sec/watt3.441x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
1284,030 sequences/sec10 sequences/sec/watt31.761x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
EfficientNet-B089,182 images/sec65 images/sec/watt0.871x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-40GB
12830,360 images/sec95 images/sec/watt4.221x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
22432,915 images/sec- images/sec/watt6.811x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
EfficientNet-B482,596 images/sec11 images/sec/watt3.081x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
253,717 images/sec- images/sec/watt6.731x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
1284,587 images/sec12 images/sec/watt27.911x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB

Sequence length=128 for BERT-BASE and BERT-LARGE
512x512 image size, 50 denoising steps for Stable Diffusion

A100 1/7 MIG Inference Performance

NetworkBatch Size1/7 MIG ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,749 images/sec32 images/sec/watt2.131x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
304,347 images/sec- images/sec/watt6.91x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
1284,705 images/sec40 images/sec/watt27.21x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
ResNet-50v1.583,656 images/sec33 images/sec/watt2.191x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
284,160 images/sec- images/sec/watt6.731x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
1284,553 images/sec37 images/sec/watt28.111x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
BERT-BASE81,540 sequences/sec13 sequences/sec/watt5.191x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
1281,715 sequences/sec13 sequences/sec/watt74.641x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
BERT-LARGE8521 sequences/sec4 sequences/sec/watt15.341x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
128586 sequences/sec5 sequences/sec/watt218.551x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB

Sequence length=128 for BERT-BASE and BERT-LARGE

A100 7 MIG Inference Performance

NetworkBatch Size7 MIG ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50825,879 images/sec90 images/sec/watt2.171x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
2930,131 images/sec- images/sec/watt6.741x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
12832,808 images/sec89 images/sec/watt271x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
ResNet-50v1.5825,302 images/sec78 images/sec/watt2.221x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
2929,295 images/sec- images/sec/watt6.941x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
12831,736 images/sec83 images/sec/watt28.271x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
BERT-BASE810,717 sequences/sec27 sequences/sec/watt5.251x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
12811,649 sequences/sec30 sequences/sec/watt77.041x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
BERT-LARGE83,605 sequences/sec9 sequences/sec/watt15.571x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB
1283,954 sequences/sec10 sequences/sec/watt226.891x A100DGX A10023.03-py3INT8SyntheticTensorRT 8.5.3A100-SXM4-80GB

Sequence length=128 for BERT-BASE and BERT-LARGE

 

A40 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50810,082 images/sec38 images/sec/watt0.791x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
10615,874 images/sec- images/sec/watt6.681x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
12815,713 images/sec53 images/sec/watt8.151x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
ResNet-50v1.589,754 images/sec37 images/sec/watt0.821x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
10115,060 images/sec- images/sec/watt6.711x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
12815,090 images/sec50 images/sec/watt8.481x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
BERT-BASE84,109 sequences/sec15 sequences/sec/watt1.951x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
1285,634 sequences/sec19 sequences/sec/watt22.721x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
BERT-LARGE81,471 sequences/sec5 sequences/sec/watt5.441x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
1281,953 sequences/sec7 sequences/sec/watt65.531x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
EfficientNet-B089,329 images/sec50 images/sec/watt0.861x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
12819,322 images/sec65 images/sec/watt6.621x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
13019,408 images/sec- images/sec/watt6.71x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
EfficientNet-B481,960 images/sec7 images/sec/watt4.081x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
142,188 images/sec- images/sec/watt6.41x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40
1282,646 images/sec9 images/sec/watt48.371x A40GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A40

Sequence length=128 for BERT-BASE and BERT-LARGEr

 

A30 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5088,901 images/sec71 images/sec/watt0.91x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
10615,804 images/sec- images/sec/watt6.711x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
12816,019 images/sec97 images/sec/watt7.991x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
ResNet-50v1.588,752 images/sec66 images/sec/watt0.911x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
10215,294 images/sec- images/sec/watt6.671x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
12815,562 images/sec95 images/sec/watt8.231x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
84,144 sequences/sec26 sequences/sec/watt1.931x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
1285,538 sequences/sec34 sequences/sec/watt23.11x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,455 sequences/sec9 sequences/sec/watt5.51x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
1281,958 sequences/sec12 sequences/sec/watt65.371x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
EfficientNet-B087,560 images/sec76 images/sec/watt1.061x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
11116,669 images/sec- images/sec/watt6.651x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
12816,765 images/sec101 images/sec/watt7.641x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
EfficientNet-B481,719 images/sec12 images/sec/watt4.651x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
131,944 images/sec- images/sec/watt6.691x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
1282,344 images/sec14 images/sec/watt54.611x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30

Sequence length=128 for BERT-BASE and BERT-LARGE

 

A30 1/4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,633 images/sec43 images/sec/watt2.21x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
294,259 images/sec- images/sec/watt6.811x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
1284,630 images/sec49 images/sec/watt27.651x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
ResNet-50v1.583,546 images/sec42 images/sec/watt2.261x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
284,106 images/sec- images/sec/watt6.811x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
1284,482 images/sec48 images/sec/watt28.561x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
BERT-BASE81,514 sequences/sec16 sequences/sec/watt5.281x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
1281,623 sequences/sec17 sequences/sec/watt78.861x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
BERT-LARGE8511 sequences/sec6 sequences/sec/watt15.671x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
128568 sequences/sec6 sequences/sec/watt15.671x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30

Sequence length=128 for BERT-BASE and BERT-LARGE

 

A30 4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50814,094 images/sec86 images/sec/watt2.281x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
2716,262 images/sec- images/sec/watt6.811x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
12817,287 images/sec106 images/sec/watt29.711x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
ResNet-50v1.5813,728 images/sec83 images/sec/watt2.331x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
2615,707 images/sec- images/sec/watt6.651x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
12816,713 images/sec101 images/sec/watt30.731x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
BERT-BASE85,437 sequences/sec34 sequences/sec/watt61x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
1285,757 sequences/sec35 sequences/sec/watt90.741x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
BERT-LARGE81,850 sequences/sec11 sequences/sec/watt17.421x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30
1282,024 sequences/sec12 sequences/sec/watt254.331x A30GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A30

Sequence length=128 for BERT-BASE and BERT-LARGE

 

A10 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5088,124 images/sec54 images/sec/watt0.981x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
7210,867 images/sec- images/sec/watt6.631x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
12811,438 images/sec76 images/sec/watt11.191x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
ResNet-50v1.587,618 images/sec51 images/sec/watt1.051x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
6910,462 images/sec- images/sec/watt6.61x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
12810,887 images/sec73 images/sec/watt11.761x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
83,244 sequences/sec22 sequences/sec/watt2.471x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
1283,889 sequences/sec26 sequences/sec/watt32.911x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,080 sequences/sec7 sequences/sec/watt7.411x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
1281,244 sequences/sec8 sequences/sec/watt102.91x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
EfficientNet-B088,248 images/sec55 images/sec/watt0.971x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
12814,047 images/sec94 images/sec/watt9.111x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
EfficientNet-B481,518 images/sec10 images/sec/watt5.271x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10
1281,828 images/sec12 images/sec/watt70.011x A10GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A10

Sequence length=128 for BERT-BASE and BERT-LARGE

 

A2 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5082,616 images/sec44 images/sec/watt3.061x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
192,889 images/sec- images/sec/watt6.581x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
1283,033 images/sec51 images/sec/watt42.21x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
ResNet-50v1.582,505 images/sec42 images/sec/watt3.191x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
182,753 images/sec- images/sec/watt6.541x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
1282,894 images/sec49 images/sec/watt44.241x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
BERT-BASE8869 sequences/sec14 sequences/sec/watt9.211x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
128955 sequences/sec16 sequences/sec/watt134.051x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
BERT-LARGE8291 sequences/sec5 sequences/sec/watt27.461x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
128313 sequences/sec5 sequences/sec/watt409.551x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
EfficientNet-B083,050 images/sec58 images/sec/watt2.621x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
243,601 images/sec- images/sec/watt6.661x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
1283,885 images/sec65 images/sec/watt32.941x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
EfficientNet-B48468 images/sec8 images/sec/watt17.081x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2
128515 images/sec9 images/sec/watt248.651x A2GIGABYTE G482-Z52-0023.03-py3INT8SyntheticTensorRT 8.5.3A2

Sequence length=128 for BERT-BASE and BERT-LARGE

 

T4 Inference Performance +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
Stable Diffusion v2.1 (512x512)10.23 images/sec-4432.11x T4Supermicro SYS-4029GP-TRT-MixedLAION-5BTensorRT 8.5.2NVIDIA T4
ResNet-5083,750 images/sec54 images/sec/watt2.131x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
304,469 images/sec- images/sec/watt6.711x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
1284,777 images/sec68 images/sec/watt26.81x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
ResNet-50v1.583,603 images/sec51 images/sec/watt2.221x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
284,175 images/sec- images/sec/watt6.711x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
1284,445 images/sec64 images/sec/watt28.81x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,668 sequences/sec24 sequences/sec/watt4.81x T4Supermicro SYS-1029GQ-TRT23.01-py3INT8SyntheticTensorRT 8.5.2NVIDIA T4
1281,767 sequences/sec25 sequences/sec/watt72.441x T4Supermicro SYS-1029GQ-TRT23.01-py3INT8SyntheticTensorRT 8.5.2NVIDIA T4
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
8456 sequences/sec7 sequences/sec/watt17.541x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
128488 sequences/sec7 sequences/sec/watt262.141x T4Supermicro SYS-1029GQ-TRT23.01-py3INT8SyntheticTensorRT 8.5.2NVIDIA T4
EfficientNet-B084,702 images/sec67 images/sec/watt1.71x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
405,957 images/sec- images/sec/watt6.551x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
1286,293 images/sec90 images/sec/watt20.341x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
EfficientNet-B48775 images/sec11 images/sec/watt10.321x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4
128840 images/sec12 images/sec/watt152.391x T4Supermicro SYS-4029GP-TRT23.03-py3INT8SyntheticTensorRT 8.5.3NVIDIA T4

Sequence length=128 for BERT-BASE and BERT-LARGE
512x512 image size, 50 denoising steps for Stable Diffusion



V100 Inference Performance +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5084,396 images/sec15 images/sec/watt1.821x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
1287,855 images/sec23 images/sec/watt16.291x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
ResNet-50v1.584,310 images/sec15 images/sec/watt1.861x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
1287,553 images/sec22 images/sec/watt16.951x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
82,261 sequences/sec7 sequences/sec/watt3.541x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
1283,107 sequences/sec9 sequences/sec/watt41.21x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
8727 sequences/sec2 sequences/sec/watt11.011x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
128947 sequences/sec3 sequences/sec/watt135.171x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
EfficientNet-B084,735 images/sec22 images/sec/watt1.691x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
1289,509 images/sec30 images/sec/watt13.461x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
EfficientNet-B48950 images/sec3 images/sec/watt8.421x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB
1281,265 images/sec4 images/sec/watt101.181x V100DGX-223.03-py3INT8SyntheticTensorRT 8.5.3V100-SXM3-32GB

Sequence length=128 for BERT-BASE and BERT-LARGE


Inference Performance of NVIDIA Data Center GPUs on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Inference Performance on Cloud

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.5811,390 images/sec- images/sec/watt0.71x A100GCP A2-HIGHGPU-1G23.02-py3INT8Synthetic-A100-SXM4-40GB
12828,449 images/sec- images/sec/watt4.51x A100GCP A2-HIGHGPU-1G23.02-py3INT8Synthetic-A100-SXM4-40GB
811,402 images/sec- images/sec/watt0.71x A100AWS EC2 p4d.24xlarge23.02-py3INT8Synthetic-A100-SXM4-40GB
12828,458 images/sec- images/sec/watt4.51x A100AWS EC2 p4d.24xlarge23.02-py3INT8Synthetic-A100-SXM4-40GB
811,578 images/sec- images/sec/watt0.691x A100Azure Standard_ND96amsr_A100_v422.12-py3INT8Synthetic-A100-SXM4-80GB
12829,492 images/sec- images/sec/watt4.341x A100Azure Standard_ND96amsr_A100_v422.12-py3INT8Synthetic-A100-SXM4-80GB
BERT-LARGE82,291 images/sec- images/sec/watt3.491x A100GCP A2-HIGHGPU-1G23.02-py3INT8Synthetic-A100-SXM4-40GB
1283,905 images/sec- images/sec/watt32.781x A100GCP A2-HIGHGPU-1G23.02-py3INT8Synthetic-A100-SXM4-40GB
82,679 images/sec- images/sec/watt2.991x A100Azure Standard_ND96amsr_A100_v422.12-py3INT8Synthetic-A100-SXM4-80GB
1284,787 images/sec- images/sec/watt26.741x A100Azure Standard_ND96amsr_A100_v422.12-py3INT8Synthetic-A100-SXM4-80GB
82,303 images/sec- images/sec/watt3.471x A100AWS EC2 p4d.24xlarge23.02-py3INT8Synthetic-A100-SXM4-80GB
1283,998 images/sec- images/sec/watt32.021x A100AWS EC2 p4d.24xlarge23.02-py3INT8Synthetic-A100-SXM4-80GB

BERT-Large: Sequence Length = 128

Conversational AI

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.

Related Resources

Download and get started with NVIDIA Riva.


Riva Benchmarks

A100 ASR Benchmarks

A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram110.81A100 SXM4-40GB
citrinetn-gram6463.963A100 SXM4-40GB
citrinetn-gram128107.5126A100 SXM4-40GB
citrinetn-gram256172.3248A100 SXM4-40GB
citrinetn-gram384242367A100 SXM4-40GB
citrinetn-gram512318482A100 SXM4-40GB
citrinetn-gram768484703A100 SXM4-40GB
conformern-gram1161A100 SXM4-40GB
conformern-gram6497.263A100 SXM4-40GB
conformern-gram128138126A100 SXM4-40GB
conformern-gram256233247A100 SXM4-40GB
conformern-gram384334365A100 SXM4-40GB
conformern-gram512469478A100 SXM4-40GB

A100 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram19.911A100 SXM4-40GB
citrinetn-gram814.488A100 SXM4-40GB
citrinetn-gram162316A100 SXM4-40GB
citrinetn-gram3234.932A100 SXM4-40GB
citrinetn-gram4846.248A100 SXM4-40GB
citrinetn-gram6455.463A100 SXM4-40GB
conformern-gram113.921A100 SXM4-40GB
conformern-gram826.198A100 SXM4-40GB
conformern-gram1637.116A100 SXM4-40GB
conformern-gram3252.432A100 SXM4-40GB
conformern-gram4861.948A100 SXM4-40GB
conformern-gram6475.563A100 SXM4-40GB

A100 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram32-4000A100 SXM4-40GB
conformern-gram32FALSE1460A100 SXM4-40GB

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.10.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A30 ASR Benchmarks

A30 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram114.881A30
citrinetn-gram64106.163A30
citrinetn-gram128149125A30
citrinetn-gram256274.5245A30
citrinetn-gram384422359A30
citrinetn-gram512620467A30
conformern-gram1211A30
conformern-gram6414063A30
conformern-gram128209125A30
conformern-gram256374.4244A30

A30 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.8391A30
citrinetn-gram823.48A30
citrinetn-gram1640.4816A30
citrinetn-gram3260.232A30
citrinetn-gram4864.348A30
citrinetn-gram648363A30
conformern-gram118.9341A30
conformern-gram840.68A30
conformern-gram1652.916A30
conformern-gram3265.932A30
conformern-gram4891.247A30

A30 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram32-2755A30
conformern-gram32FALSE904A30

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.10.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A10 ASR Benchmarks

A10 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram112.181A10
citrinetn-gram6482.763A10
citrinetn-gram128152126A10
citrinetn-gram256292.3247A10
citrinetn-gram384433363A10
citrinetn-gram512612476A10
conformern-gram115.661A10
conformern-gram6413763A10
conformern-gram128242125A10
conformern-gram256436245A10

A10 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram111.531A10
citrinetn-gram817.98A10
citrinetn-gram1629.216A10
citrinetn-gram3249.332A10
citrinetn-gram4867.648A10
citrinetn-gram6481.863A10
conformern-gram113.9881A10
conformern-gram827.38A10
conformern-gram1639.316A10
conformern-gram326932A10
conformern-gram48102.147.445A10

A10 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram32-2410A10
conformern-gram32FALSE907A10

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.10.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

V100 ASR Benchmarks +

V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.71V100 SXM2-16GB
citrinetn-gram648563V100 SXM2-16GB
citrinetn-gram128154.6125V100 SXM2-16GB
citrinetn-gram256281246V100 SXM2-16GB
citrinetn-gram384416362V100 SXM2-16GB
citrinetn-gram512573473V100 SXM2-16GB
conformern-gram120.41V100 SXM2-16GB
conformern-gram6414663V100 SXM2-16GB
conformern-gram128244125V100 SXM2-16GB
conformern-gram256415244V100 SXM2-16GB

V100 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram112.81V100 SXM2-16GB
citrinetn-gram820.678V100 SXM2-16GB
citrinetn-gram1629.116V100 SXM2-16GB
citrinetn-gram3247.632V100 SXM2-16GB
citrinetn-gram4864.748V100 SXM2-16GB
citrinetn-gram647763V100 SXM2-16GB
conformern-gram119.21V100 SXM2-16GB
conformern-gram841.38V100 SXM2-16GB
conformern-gram1649.716V100 SXM2-16GB
conformern-gram3280.132V100 SXM2-16GB
conformern-gram48111.947V100 SXM2-16GB

V100 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram32-2415V100 SXM2-16GB
conformern-gram32FALSE895V100 SXM2-16GB

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.10.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz



T4 ASR Benchmarks +

T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram123.231NVIDIA T4
citrinetn-gram64157.663NVIDIA T4
citrinetn-gram128323.2123NVIDIA T4
citrinetn-gram256753.7236NVIDIA T4
conformern-gram136.071NVIDIA T4
conformern-gram64238.763NVIDIA T4
conformern-gram128514.5122NVIDIA T4

T4 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram120.761NVIDIA T4
citrinetn-gram841.78NVIDIA T4
citrinetn-gram1654.716NVIDIA T4
citrinetn-gram3276.8132NVIDIA T4
citrinetn-gram48122.647NVIDIA T4
conformern-gram128.61NVIDIA T4
conformern-gram856.88NVIDIA T4
conformern-gram1662.1816NVIDIA T4
conformern-gram3214431.499NVIDIA T4

T4 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram32-1185NVIDIA T4
conformern-gram32FALSE425NVIDIA T4

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.10.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A100 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0210.003147A100 SXM4-40GB
FastPitch + Hifi-GAN40.0410.005327A100 SXM4-40GB
FastPitch + Hifi-GAN60.060.006366A100 SXM4-40GB
FastPitch + Hifi-GAN80.0710.008403A100 SXM4-40GB
FastPitch + Hifi-GAN100.0790.008423A100 SXM4-40GB

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.10.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A30 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0220.004128A30
FastPitch + Hifi-GAN40.0480.007267A30
FastPitch + Hifi-GAN60.0840.008270A30
FastPitch + Hifi-GAN80.1020.009302A30
FastPitch + Hifi-GAN100.1180.009316A30

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.10.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A10 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0220.004131A10
FastPitch + Hifi-GAN40.0550.008235A10
FastPitch + Hifi-GAN60.090.009247A10
FastPitch + Hifi-GAN80.1180.01260A10
FastPitch + Hifi-GAN100.1370.011265A10

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.10.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

V100 TTS Benchmarks +

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0260.005102V100 SXM2-16GB
FastPitch + Hifi-GAN40.0610.009207V100 SXM2-16GB
FastPitch + Hifi-GAN60.1050.01214V100 SXM2-16GB
FastPitch + Hifi-GAN80.1290.012229V100 SXM2-16GB
FastPitch + Hifi-GAN100.1480.013239V100 SXM2-16GB

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.10.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz



T4 TTS Benchmarks +

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0280.00785NVIDIA T4
FastPitch + Hifi-GAN40.0850.015133NVIDIA T4
FastPitch + Hifi-GAN60.1390.019139NVIDIA T4
FastPitch + Hifi-GAN80.1940.022142NVIDIA T4
FastPitch + Hifi-GAN100.2270.025144NVIDIA T4

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.10.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

 

Last updated: April 5th, 2023