Reproducible Performance

Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide

Related Resources

HPC Performance

Review the latest GPU-acceleration factors of popular HPC applications.


Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Related Resources

Read our blog on convergence for more details.

Get up and running quickly with NVIDIA’s complete solution stack:


NVIDIA Performance on MLPerf 2.1 Training Benchmarks

BERT Time to Train on A100

PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements

MLPerf Training Performance

NVIDIA Performance on MLPerf 2.1 AI Benchmarks: Single Node - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.514.74675.90% classification8x H100DGX H1002.1-2091MixedImageNet2012H100-SXM5-80GB
27.68875.90% classification8x A100GIGABYTE: G492-ZD22.1-2038MixedImageNet2012A100-SXM4-80GB
3D U-Net13.1010.908 Mean DICE score8x H100DGX H1002.1-2091MixedKiTS 2019H100-SXM5-80GB
22.9890.908 Mean DICE score8x A100GIGABYTE: G492-ZD22.1-2038MixedKiTS 2019A100-SXM4-80GB
PyTorchBERT6.3780.72 Mask-LM accuracy8x H100DGX H1002.1-2091MixedWikipedia 2020/01/01H100-SXM5-80GB
16.5490.72 Mask-LM accuracy8x A100GIGABYTE: G492-ZD22.1-2039MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN20.3480.377 Box min AP and 0.339 Mask min AP8x H100DGX H1002.1-2091MixedCOCO2017H100-SXM5-80GB
37.9160.377 Box min AP and 0.339 Mask min AP8x A100GIGABYTE: G492-ZD22.1-2039MixedCOCO2017A100-SXM4-80GB
RNN-T18.2020.058 Word Error Rate8x H100DGX H1002.1-2091MixedLibriSpeechH100-SXM5-80GB
29.9480.058 Word Error Rate8x A100GIGABYTE: G492-ZD22.1-2039MixedLibriSpeechA100-SXM4-80GB
RetinaNet38.05034.0% mAP8x H100DGX H1002.1-2091MixedOpenImagesH100-SXM5-80GB
82.52934.0% mAP8x A100GIGABYTE: G492-ZD22.1-2039MixedOpenImagesA100-SXM4-80GB
TensorFlowMiniGo174.58450% win rate vs. checkpoint8x H100DGX H1002.1-2091MixedGoH100-SXM5-80GB
161.84850% win rate vs. checkpoint8x A100GIGABYTE: G492-ZD22.1-2040MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM1.0630.8025 AUC8x H100DGX H1002.1-2091MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)H100-SXM5-80GB
1.6250.8025 AUC8x A100Fujitsu: PRIMERGY-GX2570M6-hugectr2.1-2033MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

NVIDIA Performance on MLPerf 2.1 AI Benchmarks: Multi Node - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
MXNetResNet-50 v1.54.50875.90% classification32x H100DGX H1002.1-2093MixedImageNet2012H100-SXM5-80GB
4.52375.90% classification64x A100DGX A1002.1-2065MixedImageNet2012A100-SXM4-80GB
0.55575.90% classification1,024x A100DGX A1002.1-2073MixedImageNet2012A100-SXM4-80GB
0.31975.90% classification4,216x A100DGX A1002.1-2080MixedImageNet2012A100-SXM4-80GB
3D U-Net5.3470.908 Mean DICE score24x H100DGX H1002.1-2092MixedKiTS 2019H100-SXM5-80GB
3.4370.908 Mean DICE score72x A100Azure: ND96amsr_A100_v4_n92.1-2009MixedKiTS 2019A100-SXM4-80GB
1.2160.908 Mean DICE score768x A100DGX A1002.1-2072MixedKiTS 2019A100-SXM4-80GB
PyTorchBERT1.7970.72 Mask-LM accuracy32x H100DGX H1002.1-2093MixedWikipedia 2020/01/01H100-SXM5-80GB
2.4970.72 Mask-LM accuracy64x A100DGX A1002.1-2068MixedWikipedia 2020/01/01A100-SXM4-80GB
0.4210.72 Mask-LM accuracy1,024x A100DGX A1002.1-2074MixedWikipedia 2020/01/01A100-SXM4-80GB
0.2080.72 Mask-LM accuracy4,096x A100DGX A1002.1-2079MixedWikipedia 2020/01/01A100-SXM4-80GB
Mask R-CNN7.3380.377 Box min AP and 0.339 Mask min AP32x H100DGX H1002.1-2093MixedCOCO2017H100-SXM5-80GB
8.2930.377 Box min AP and 0.339 Mask min AP64x A100HPE-ProLiant-XL675d-Gen10-Plus_A100-SXM-80GB_pytorch2.1-2049MixedCOCO2017A100-SXM4-80GB
2.7500.377 Box min AP and 0.339 Mask min AP384x A100DGX A1002.1-2071MixedCOCO2017A100-SXM4-80GB
RNN-T7.5340.058 Word Error Rate32x H100DGX H1002.1-2093MixedLibriSpeechH100-SXM5-80GB
6.9100.058 Word Error Rate64x A100DGX A1002.1-2066MixedLibriSpeechA100-SXM4-80GB
2.1510.058 Word Error Rate1,536x A100DGX A1002.1-2076MixedLibriSpeechA100-SXM4-80GB
RetinaNet11.79834.0% mAP32x H100DGX H1002.1-2093MixedOpenImagesH100-SXM5-80GB
12.76334.0% mAP64x A100DGX A1002.1-2068MixedOpenImagesA100-SXM4-80GB
2.34934.0% mAP1,280x A100DGX A1002.1-2075MixedOpenImagesA100-SXM4-80GB
1.84334.0% mAP2,048x A100DGX A1002.1-2078MixedOpenImagesA100-SXM4-80GB
TensorFlowMiniGo92.52250% win rate vs. checkpoint32x H100DGX H1002.1-2093MixedGoH100-SXM5-80GB
73.03850% win rate vs. checkpoint64x A100DGX A1002.1-2067MixedGoA100-SXM4-80GB
16.23150% win rate vs. checkpoint1,792x A100DGX A1002.1-2077MixedGoA100-SXM4-80GB
NVIDIA Merlin HugeCTRDLRM0.5150.8025 AUC32x H100DGX H1002.1-2093MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)H100-SXM5-80GB
0.6530.8025 AUC64x A100DGX A1002.1-2064MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB
0.5880.8025 AUC112x A100DGX A1002.1-2070MixedCriteo AI Lab’s Terabyte Click-Through-Rate (CTR)A100-SXM4-80GB

MLPerf™ v2.1 Training Closed: 2.1-2038, 2.1-2039, 2.1-2033, 2.1-2040, 2.1-2065, 2.1-2068, 2.1-2049, 2.1-2066, 2.1-2064, 2.1-2067, 2.1-2009, 2.1-2070, 2.1-2071, 2.1-2072, 2.1-2073, 2.1-2074, 2.1-2075, 2.1-2076, 2.1-2077, 2.1-2078, 2.1-2079, 2.1-2080, 2.1-2091, 2.1-2092, 2.1-2093 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
H100 SXM5-80GB is a preview submission


NVIDIA A100 Performance on MLPerf 2.0 Training HPC Benchmarks: Strong Scaling - Closed Division

FrameworkNetworkTime to Train
(mins)
MLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
PyTorchCosmoFlow3.79Mean average error 0.124512x A100DGX A1002.0-8006MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
DeepCAM1.57IOU 0.822,048x A100DGX A1002.0-8005MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB
OpenCatalyst21.93Forces mean absolute error 0.036512x A100DGX A1002.0-8006MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setA100-SXM4-80GB

NVIDIA A100 Performance on MLPerf 2.0 Training HPC Benchmarks: Weak Scaling - Closed Division

FrameworkNetworkThroughputMLPerf Quality TargetGPUServerMLPerf-IDPrecisionDatasetGPU Version
PyTorchCosmoFlow4.21 models/minMean average error 0.1244,096x A100DGX A1002.0-8014MixedCosmoFlow N-body cosmological simulation data with 4 cosmological parameter targetsA100-SXM4-80GB
DeepCAM6.40 models/minIOU 0.824,096x A100DGX A1002.0-8014MixedCAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)A100-SXM4-80GB
OpenCatalyst0.66 models/minForces mean absolute error 0.0364,096x A100DGX A1002.0-8014MixedOpen Catalyst 2020 (OC20) S2EF 2M training split, ID validation setA100-SXM4-80GB

MLPerf™ v2.0 Training HPC Closed: 2.0-8005, 2.0-8006, 2.0-8014 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v2.0 Training HPC rules and guidelines, click here

Converged Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron298.56 Training Loss316,564 total output mels/sec8x A100DGX A10022.09-py3TF32128LJSpeech 1.1A100-SXM4-80GB
1.13.0a0WaveGlow241-5.72 Training Loss1,763,443 output samples/sec8x A100DGX A10022.07-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
1.13.0a0GNMT v21624.4 BLEU Score962,349 total tokens/sec8x A100DGX A10022.08-py3Mixed128wmt16-en-deA100-SXM4-80GB
1.13.0a0NCF0.35.96 Hit Rate at 10160,067,506 samples/sec8x A100DGX A10022.09-py3Mixed131072MovieLens 20MA100-SXM4-80GB
1.13.0a0Transformer XL Base18122.34 Perplexity732,282 total tokens/sec8x A100DGX A10022.09-py3Mixed128WikiText-103A100-SXM4-80GB
1.13.0a0SE3 Transformer9.04 MAE22,339 molecules/sec8x A100DGX A10022.08-py3Mixed240Quantum Machines 9A100-SXM4-80GB
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.991,072 images/sec8x A100DGX A10022.09-py3Mixed2DAGM2007A100-SXM4-80GB
2.9.1U-Net Medical2.89 DICE Score1,057 images/sec8x A100DGX A10022.09-py3Mixed8EM segmentation challengeA100-SXM4-40GB
2.9.1Electra Fine Tuning392.35 F12,635 sequences/sec8x A100DGX A10022.09-py3Mixed32SQuaD v1.1A100-SXM4-40GB
2.9.1EfficientNet-B052876.48 Top 120,347 images/sec8x A100DGX A10022.09-py3Mixed1024Imagenet2012A100-SXM4-80GB
2.9.1SIM1.82 AUC3,097,360 samples/sec8x A100DGX A10022.09-py3Mixed16384Amazon ReviewsA100-SXM4-40GB

A40 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0NCF1.96 Hit Rate at 1050,352,046 samples/sec8x A40GIGABYTE G482-Z52-0022.09-py3Mixed131072MovieLens 20MA40
1.13.0a0Tacotron2115.56 Training Loss268,967 total output mels/sec8x A40Supermicro AS -4124GS-TNR22.07-py3Mixed128LJSpeech 1.1A40
1.13.0a0WaveGlow464-5.74 Training Loss907,704 output samples/sec8x A40Supermicro AS -4124GS-TNR22.08-py3Mixed10LJSpeech 1.1A40
1.13.0a0GNMT v25424.24 BLEU Score324,183 total tokens/sec8x A40Supermicro AS -4124GS-TNR22.08-py3Mixed128wmt16-en-deA40
1.13.0a0Transformer XL Base43822.41 Perplexity304,580 total tokens/sec8x A40Supermicro AS -4124GS-TNR22.09-py3Mixed128WikiText-103A40
1.13.0a0SE3 Transformer14.04 MAE13,306 molecules/sec8x A40Supermicro AS -4124GS-TNR22.09-py3Mixed240Quantum Machines 9A40
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99734 images/sec8x A40GIGABYTE G482-Z52-0022.09-py3Mixed2DAGM2007A40
2.9.1Electra Fine Tuning492.66 F11,128 sequences/sec8x A40Supermicro AS -4124GS-TNR22.09-py3Mixed32SQuaD v1.1A40

A30 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2118.53 Training Loss262,520 total output mels/sec8x A30GIGABYTE G482-Z52-0022.09-py3Mixed104LJSpeech 1.1A30
1.13.0a0WaveGlow426-5.68 Training Loss986,408 output samples/sec8x A30GIGABYTE G482-Z52-0022.09-py3Mixed10LJSpeech 1.1A30
1.13.0a0GNMT v25424.42 BLEU Score324,219 total tokens/sec8x A30GIGABYTE G482-Z52-0022.09-py3Mixed128wmt16-en-deA30
1.13.0a0NCF1.96 Hit Rate at 1057,621,616 samples/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed131072MovieLens 20MA30
1.13.0a0FastPitch4352.7 Training Loss180,819 frames/sec8x A30GIGABYTE G482-Z52-0022.07-py3Mixed16LJSpeech 1.1A30
1.13.0a0Transformer XL Base14723.69 Perplexity228,197 total tokens/sec8x A30GIGABYTE G482-Z52-0022.07-py3Mixed32WikiText-103A30
1.13.0a0SE3 Transformer12.04 MAE16,339 molecules/sec8x A30GIGABYTE G482-Z52-0022.08-py3Mixed240Quantum Machines 9A30
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99674 images/sec8x A30GIGABYTE G482-Z52-0022.09-py3Mixed2DAGM2007A30
2.9.1U-Net Medical3.89 DICE Score469 images/sec8x A30GIGABYTE G482-Z52-0022.09-py3Mixed8EM segmentation challengeA30
2.9.1Electra Fine Tuning592.6 F1975 sequences/sec8x A30GIGABYTE G482-Z52-0022.09-py3Mixed16SQuaD v1.1A30
2.9.1SIM3.82 AUC2,257,568 samples/sec8x A30GIGABYTE G482-Z52-0022.09-py3Mixed16384Amazon ReviewsA30

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

A10 Training Performance

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2141.54 Training Loss215,014 total output mels/sec8x A10GIGABYTE G482-Z52-0022.08-py3Mixed104LJSpeech 1.1A10
1.13.0a0WaveGlow590-5.75 Training Loss710,549 output samples/sec8x A10GIGABYTE G482-Z52-0022.07-py3Mixed10LJSpeech 1.1A10
1.13.0a0GNMT V25624.31 BLEU Score256,704 total tokens/sec8x A10GIGABYTE G482-Z52-0022.08-py3Mixed128wmt16-en-deA10
1.13.0a0NCF1.96 Hit Rate at 1043,387,067 samples/sec8x A10GIGABYTE G482-Z52-0022.09-py3Mixed131072MovieLens 20MA10
1.13.0a0SE3 Transformer15.04 MAE12,019 molecules/sec8x A10GIGABYTE G482-Z52-0022.09-py3Mixed240Quantum Machines 9A10
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99645 images/sec8x A10GIGABYTE G482-Z52-0022.08-py3Mixed2DAGM2007A10
1.15.5U-Net Medical13.9 DICE Score344 images/sec8x A10GIGABYTE G482-Z52-0022.07-py3Mixed8EM segmentation challengeA10
2.8.0Electra Base Fine Tuning692.64 F1753 sequences/sec8x A10GIGABYTE G482-Z52-0022.05-py3Mixed16SQuaD v1.1A10
2.9.1SIM1.83 AUC2,155,776 samples/sec8x A10GIGABYTE G482-Z52-0022.09-py3Mixed16384Amazon ReviewsA10

T4 Training Performance +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2228.53 Training Loss133,890 total output mels/sec8x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed104LJSpeech 1.1NVIDIA T4
1.13.0a0WaveGlow999-5.69 Training Loss420,050 output samples/sec8x T4Supermicro SYS-4029GP-TRT22.07-py3Mixed10LJSpeech 1.1NVIDIA T4
1.13.0a0GNMT v210924.21 BLEU Score132,047 total tokens/sec8x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed128wmt16-en-deNVIDIA T4
1.13.0a0NCF2.96 Hit Rate at 1026,243,408 samples/sec8x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed131072MovieLens 20MNVIDIA T4
1.13.0a0SE3 Transformer37.04 MAE4,742 molecules/sec8x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed240Quantum Machines 9NVIDIA T4
Tensorflow1.15.5U-Net Industrial2.99 IoU Threshold 0.99300 images/sec8x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed2DAGM2007NVIDIA T4
1.15.5U-Net Medical39.9 DICE Score156 images/sec8x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed8EM segmentation challengeNVIDIA T4
2.9.1Electra Fine Tuning1092.74 F1376 sequences/sec8x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed16SQuaD v1.1NVIDIA T4
1.15.5Transformer XL Base93822.87 Perplexity35,000 total tokens/sec8x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed16WikiText-103NVIDIA T4
2.9.1SIM2.81 AUC1,079,721 samples/sec8x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed16384Amazon ReviewsNVIDIA T4


V100 Training Performance +

FrameworkFramework VersionNetworkTime to Train
(mins)
AccuracyThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron2176.49 Training Loss170,958 total output mels/sec8x V100DGX-222.08-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.13.0a0WaveGlow399-5.81 Training Loss1,069,377 output samples/sec8x V100DGX-222.08-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.13.0a0GNMT v23324.1 BLEU Score443,670 total tokens/sec8x V100DGX-222.08-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.13.0a0NCF1.96 Hit Rate at 1099,209,033 samples/sec8x V100DGX-222.09-py3Mixed131072MovieLens 20MV100-SXM3-32GB
1.13.0a0SE3 Transformer14.04 MAE13,382 molecules/sec8x V100DGX-222.09-py3Mixed240Quantum Machines 9V100-SXM3-32GB
Tensorflow1.15.5U-Net Industrial1.99 IoU Threshold 0.99634 images/sec8x V100DGX-222.06-py3Mixed2DAGM2007V100-SXM3-32GB
1.15.5U-Net Medical13.9 DICE Score464 images/sec8x V100DGX-222.09-py3Mixed8EM segmentation challengeV100-SXM3-32GB
1.15.5Transformer XL Base31822.32 Perplexity103,869 total tokens/sec8x V100DGX-222.08-py3Mixed16WikiText-103V100-SXM3-32GB
2.9.1Electra Fine Tuning592.36 F11,365 sequences/sec8x V100DGX-222.09-py3Mixed32SQuaD v1.1V100-SXM3-32GB
2.9.1SIM1.8 AUC2,190,082 samples/sec8x V100DGX-222.09-py3Mixed16384Amazon ReviewsV100-SXM3-32GB

Single-GPU Training

Some scenarios aren’t used in real-world training, such as single-GPU throughput. The table below provides an indication of a platform’s single-chip throughput.

Related Resources

Achieve unprecedented acceleration at every scale with NVIDIA’s complete solution stack.

Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.

NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit the NVIDIA NGC catalog to pull containers and quickly get up and running with deep learning.


Single GPU Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet1.9.1ResNet-50 v1.53,307 images/sec1x A100DGX A10022.09-py3Mixed256ImageNet2012A100-SXM4-80GB
PyTorch1.13.0a0Tacotron242,388 total output mels/sec1x A100DGX A10022.09-py3Mixed128LJSpeech 1.1A100-SXM4-80GB
1.13.0a0WaveGlow255,902 output samples/sec1x A100DGX A10022.09-py3Mixed10LJSpeech 1.1A100-SXM4-80GB
1.13.0a0FastPitch107,080 frames/sec1x A100DGX A10022.09-py3Mixed32LJSpeech 1.1A100-SXM4-80GB
1.13.0a0GNMT v2170,862 total tokens/sec1x A100DGX A10022.09-py3Mixed128wmt16-en-deA100-SXM4-80GB
1.13.0a0NCF41,257,218 samples/sec1x A100DGX A10022.09-py3Mixed1048576MovieLens 20MA100-SXM4-80GB
1.13.0a0Transformer XL Large17,053 total tokens/sec1x A100DGX A10022.09-py3Mixed16WikiText-103A100-SXM4-80GB
1.13.0a0Transformer XL Base89,417 total tokens/sec1x A100DGX A10022.09-py3Mixed128WikiText-103A100-SXM4-80GB
1.13.0a0nnU-Net1,128 images/sec1x A100DGX A10022.09-py3Mixed64Medical Segmentation DecathlonA100-SXM4-80GB
1.13.0a0EfficientNet-B4389 images/sec1x A100DGX A10022.09-py3Mixed128Imagenet2012A100-SXM4-80GB
1.13.0a0BERT Large Pre-Training Phase 2302 sequences/sec1x A100DGX A10022.07-py3Mixed56Wikipedia 2020/01/01A100-SXM4-80GB
1.13.0a0BERT Large Pre-Training Phase 1853 sequences/sec1x A100DGX A10022.07-py3Mixed512Wikipedia 2020/01/01A100-SXM4-80GB
1.13.0a0EfficientNet-WideSE-B4388 images/sec1x A100DGX A10022.09-py3Mixed128Imagenet2012A100-SXM4-80GB
1.13.0a0SE3 Transformer2,983 molecules/sec1x A100DGX A10022.09-py3Mixed240Quantum Machines 9A100-SXM4-80GB
1.13.0a0TFT - Traffic17,483 items/sec1x A100DGX A10022.09-py3Mixed1024TrafficA100-SXM4-80GB
1.13.0a0TFT - Electricity17,355 items/sec1x A100DGX A10022.09-py3Mixed1024ElectricityA100-SXM4-80GB
Tensorflow1.15.5U-Net Industrial351 images/sec1x A100DGX A10022.09-py3Mixed16DAGM2007A100-SXM4-40GB
2.9.1U-Net Medical150 images/sec1x A100DGX A10022.09-py3Mixed8EM segmentation challengeA100-SXM4-80GB
2.9.1Electra Base Fine Tuning369 sequences/sec1x A100DGX A10022.09-py3Mixed32SQuaD v1.1A100-SXM4-80GB
1.15.5NCF47,214,021 samples/sec1x A100DGX A10022.09-py3Mixed1048576MovieLens 20MA100-SXM4-40GB
2.9.1EfficientNet-B03,271 images/sec1x A100DGX A10022.09-py3Mixed1024Imagenet2012A100-SXM4-80GB
2.9.1SIM594,102 samples/sec1x A100DGX A10022.09-py3Mixed131072Amazon ReviewsA100-SXM4-80GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A40 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron237,587 total output mels/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed128LJSpeech 1.1A40
1.13.0a0WaveGlow149,922 output samples/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed10LJSpeech 1.1A40
1.13.0a0GNMT v280,500 total tokens/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed128wmt16-en-deA40
1.13.0a0NCF19,607,776 samples/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed1048576MovieLens 20MA40
1.13.0a0Transformer XL Large10,105 total tokens/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed16WikiText-103A40
1.13.0a0FastPitch94,386 frames/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed32LJSpeech 1.1A40
1.13.0a0Transformer XL Base41,964 total tokens/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed128WikiText-103A40
1.13.0a0nnU-Net561 images/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed64Medical Segmentation DecathlonA40
1.13.0a0EfficientNet-B4182 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed64Imagenet2012A40
1.13.0a0EfficientNet-WideSE-B4182 images/sec1x A40GIGABYTE G482-Z52-0022.08-py3Mixed64Imagenet2012A40
1.13.0a0SE3 Transformer1,804 molecules/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed240Quantum Machines 9A40
1.13.0a0TFT - Traffic9,684 items/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed1024TrafficA40
1.13.0a0TFT - Electricity9,480 items/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed1024ElectricityA40
Tensorflow1.15.5U-Net Industrial123 images/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed16DAGM2007A40
1.15.5U-Net Medical67 images/sec1x A40GIGABYTE G482-Z52-0022.09-py3Mixed8EM segmentation challengeA40
2.9.1EfficientNet-B0928 images/sec1x A40GIGABYTE G482-Z52-0022.07-py3TF32512Imagenet2012A40

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A30 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron234,616 total output mels/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed104LJSpeech 1.1A30
1.13.0a0WaveGlow155,908 output samples/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed10LJSpeech 1.1A30
1.13.0a0FastPitch88,937 frames/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed16LJSpeech 1.1A30
1.13.0a0NCF21,661,040 samples/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed1048576MovieLens 20MA30
1.13.0a0GNMT v291,540 total tokens/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed128wmt16-en-deA30
1.13.0a0Transformer XL Base19,014 total tokens/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed32WikiText-103A30
1.13.0a0Transformer XL Large7,020 total tokens/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed4WikiText-103A30
1.13.0a0nnU-Net597 images/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed64Medical Segmentation DecathlonA30
1.13.0a0EfficientNet-B4189 images/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed32Imagenet2012A30
1.13.0a0EfficientNet-WideSE-B4188 images/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed32Imagenet2012A30
1.13.0a0SE3 Transformer2,135 molecules/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed240Quantum Machines 9A30
1.13.0a0TFT - Traffic10,535 items/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed1024TrafficA30
1.13.0a0TFT - Electricity10,498 items/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed1024ElectricityA30
Tensorflow1.15.5U-Net Industrial117 images/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed16DAGM2007A30
2.9.1U-Net Medical74 images/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed8EM segmentation challengeA30
1.15.5Transformer XL Base18,647 total tokens/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed16WikiText-103A30
2.9.1Electra Base Fine Tuning162 sequences/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed16SQuaD v1.1A30
2.9.1EfficientNet-B01,620 images/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed512Imagenet2012A30
2.9.1SIM396,452 samples/sec1x A30GIGABYTE G482-Z52-0022.09-py3Mixed131072Amazon ReviewsA30

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

A10 Training Performance

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron228,564 total output mels/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed104LJSpeech 1.1A10
1.13.0a0WaveGlow114,489 output samples/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed10LJSpeech 1.1A10
1.13.0a0FastPitch73,387 frames/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed16LJSpeech 1.1A10
1.13.0a0Transformer XL Base15,619 total tokens/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed32WikiText-103A10
1.13.0a0GNMT v264,911 total tokens/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed128wmt16-en-deA10
1.13.0a0NCF16,363,689 samples/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed1048576MovieLens 20MA10
1.13.0a0Transformer XL Large5,976 total tokens/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed4WikiText-103A10
1.13.0a0nnU-Net448 images/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed64Medical Segmentation DecathlonA10
1.13.0a0EfficientNet-B4145 images/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed32Imagenet2012A10
1.13.0a0EfficientNet-WideSE-B4145 images/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed32Imagenet2012A10
1.13.0a0SE3 Transformer1,576 molecules/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed240Quantum Machines 9A10
1.13.0a0TFT - Traffic7,938 items/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed1024TrafficA10
1.13.0a0TFT - Electricity7,964 items/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed1024ElectricityA10
Tensorflow1.15.5U-Net Industrial99 images/sec1x A10GIGABYTE G482-Z52-0022.08-py3Mixed16DAGM2007A10
2.9.1U-Net Medical51 images/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed8EM segmentation challengeA10
2.9.1Electra Base Fine Tuning120 sequences/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed16SQuaD v1.1A10
2.9.1EfficientNet-B01,343 images/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed512Imagenet2012A10
2.9.1SIM369,569 samples/sec1x A10GIGABYTE G482-Z52-0022.09-py3Mixed131072Amazon ReviewsA10

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server

T4 Training Performance +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron217,962 total output mels/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed104LJSpeech 1.1NVIDIA T4
1.13.0a0WaveGlow55,246 output samples/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed10LJSpeech 1.1NVIDIA T4
1.13.0a0FastPitch33,297 frames/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed16LJSpeech 1.1NVIDIA T4
1.13.0a0GNMT v230,655 total tokens/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed128wmt16-en-deNVIDIA T4
1.13.0a0NCF7,713,244 samples/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed1048576MovieLens 20MNVIDIA T4
1.13.0a0Transformer XL Base9,011 total tokens/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed32WikiText-103NVIDIA T4
1.13.0a0Transformer XL Large2,729 total tokens/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed4WikiText-103NVIDIA T4
1.13.0a0nnU-Net202 images/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed64Medical Segmentation DecathlonNVIDIA T4
1.13.0a0EfficientNet-B468 images/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed32Imagenet2012NVIDIA T4
1.13.0a0EfficientNet-WideSE-B467 images/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed32Imagenet2012NVIDIA T4
1.13.0a0SE3 Transformer610 molecules/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed240Quantum Machines 9NVIDIA T4
1.13.0a0TFT - Traffic4,297 items/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed1024TrafficNVIDIA T4
1.13.0a0TFT - Electricity4,262 items/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed1024ElectricityNVIDIA T4
Tensorflow1.15.5U-Net Industrial44 images/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed16DAGM2007NVIDIA T4
1.15.5U-Net Medical21 images/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed8EM segmentation challengeNVIDIA T4
2.9.1Electra Base Fine Tuning56 sequences/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed16SQuaD v1.1NVIDIA T4
2.9.1EfficientNet-B0579 images/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed256Imagenet2012NVIDIA T4
2.9.1SIM168,087 samples/sec1x T4Supermicro SYS-4029GP-TRT22.09-py3Mixed131072Amazon ReviewsNVIDIA T4

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec



V100 Training Performance +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
PyTorch1.13.0a0Tacotron223,774 total output mels/sec1x V100DGX-222.08-py3Mixed104LJSpeech 1.1V100-SXM3-32GB
1.13.0a0WaveGlow155,021 output samples/sec1x V100DGX-222.08-py3Mixed10LJSpeech 1.1V100-SXM3-32GB
1.13.0a0FastPitch87,274 frames/sec1x V100DGX-222.09-py3Mixed16LJSpeech 1.1V100-SXM3-32GB
1.13.0a0GNMT v278,543 total tokens/sec1x V100DGX-222.08-py3Mixed128wmt16-en-deV100-SXM3-32GB
1.13.0a0NCF24,127,863 samples/sec1x V100DGX-222.09-py3Mixed1048576MovieLens 20MV100-SXM3-32GB
1.13.0a0Transformer XL Base17,830 total tokens/sec1x V100DGX-222.09-py3Mixed32WikiText-103V100-SXM3-32GB
1.13.0a0Transformer XL Large7,211 total tokens/sec1x V100DGX-222.09-py3Mixed8WikiText-103V100-SXM3-32GB
1.13.0a0nnU-Net657 images/sec1x V100DGX-222.09-py3Mixed64Medical Segmentation DecathlonV100-SXM3-32GB
1.13.0a0EfficientNet-B4220 images/sec1x V100DGX-222.09-py3Mixed64Imagenet2012V100-SXM3-32GB
1.13.0a0EfficientNet-WideSE-B4220 images/sec1x V100DGX-222.09-py3Mixed64Imagenet2012V100-SXM3-32GB
1.13.0a0SE3 Transformer1,989 molecules/sec1x V100DGX-222.09-py3Mixed240Quantum Machines 9V100-SXM3-32GB
1.13.0a0TFT - Traffic11,685 items/sec1x V100DGX-222.09-py3Mixed1024TrafficV100-SXM3-32GB
1.13.0a0TFT - Electricity11,602 items/sec1x V100DGX-222.09-py3Mixed1024ElectricityV100-SXM3-32GB
Tensorflow1.15.5U-Net Industrial118 images/sec1x V100DGX-222.08-py3Mixed16DAGM2007V100-SXM3-32GB
1.15.5U-Net Medical68 images/sec1x V100DGX-222.09-py3Mixed8EM segmentation challengeV100-SXM3-32GB
2.9.1Electra Base Fine Tuning188 sequences/sec1x V100DGX-222.09-py3Mixed32SQuaD v1.1V100-SXM3-32GB
1.15.5Transformer XL Base18,514 total tokens/sec1x V100DGX-222.08-py3Mixed16WikiText-103V100-SXM3-32GB
2.9.1EfficientNet-B0621 images/sec1x V100DGX-222.09-py3FP32256Imagenet2012V100-SXM3-32GB
2.9.1SIM351,636 samples/sec1x V100DGX-222.09-py3Mixed131072Amazon ReviewsV100-SXM3-32GB

FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec

Single GPU Training Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Training Performance on Cloud

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.52,887 images/sec1x A100GCP A2-HIGHGPU-1G22.07-py3Mixed192ImageNet2012A100-SXM4-40GB
PyTorch-DLRM3,450,000 records/sec1x A100GCP A2-HIGHGPU-1G22.07-py3Mixed32768Criteo Terabyte DatasetA100-SXM4-40GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

T4 Training Performance on Cloud +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.5450 images/sec1x T4AWS EC2 g4dn.4xlarge22.06-py3Mixed192ImageNet2012NVIDIA T4
-ResNet-50 v1.5432 images/sec1x T4GCP N1-HIGHMEM-822.07-py3Mixed192ImageNet2012NVIDIA T4
TensorFlow-ResNet-50 v1.5419 images/sec1x T4AWS EC2 g4dn.4xlarge22.06-py3Mixed256Imagenet2012NVIDIA T4

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384



V100 Training Performance on Cloud +

FrameworkFramework VersionNetworkThroughputGPUServerContainerPrecisionBatch SizeDatasetGPU Version
MXNet-ResNet-50 v1.51,467 images/sec1x V100GCP N1-HIGHMEM-822.07-py3Mixed192ImageNet2012V100-SXM2-16GB

BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Related Resources

Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:


MLPerf Inference v2.1 Performance Benchmarks

Offline Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionDatasetTarget Accuracy
ResNet-50 v1.581,292 samples/sec1x H100NVIDIA H100H100-SXM-80GBImageNet76.46% Top1
335,144 samples/sec8x A100DGX A100A100 SXM-80GBImageNet76.46% Top1
5,589 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBImageNet76.46% Top1
316,342 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet76.46% Top1
RetinaNet960 samples/sec1x H100NVIDIA H100H100-SXM-80GBOpenImages0.3755 mAP
4,739 samples/sec8x A100DGX A100A100 SXM-80GBOpenImages0.3755 mAP
74 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBOpenImages0.3755 mAP
4,345 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages0.3755 mAP
3D-UNet5 samples/sec1x H100NVIDIA H100H100-SXM-80GBKiTS 20190.863 DICE mean
26 samples/sec8x A100DGX A100A100 SXM-80GBKiTS 20190.863 DICE mean
0.51 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBKiTS 20190.863 DICE mean
25 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBKiTS 20190.863 DICE mean
RNN-T22,885 samples/sec1x H100NVIDIA H100H100-SXM-80GBLibriSpeech7.45% WER
106,726 samples/sec8x A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
1,918 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBLibriSpeech7.45% WER
102,784 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech7.45% WER
BERT7,921 samples/sec1x H100NVIDIA H100H100-SXM-80GBSQuAD v1.190.87% f1
13,968 samples/sec8x A100DGX A100A100 SXM-80GBSQuAD v1.190.87% f1
1,757 samples/sec1x A100DGX A100A100 SXM-80GBSQuAD v1.190.87% f1
247 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBSQuAD v1.190.87% f1
12,822 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.190.87% f1
DLRM695,298 samples/sec1x H100NVIDIA H100H100-SXM-80GBCriteo 1TB Click Logs80.25% AUC
2,443,220 samples/sec8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
314,992 samples/sec1x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
38,995 samples/sec1x1g.10gb A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs80.25% AUC
2,291,310 samples/sec8x A100Gigabyte G482-Z54A100 PCIe-80GBCriteo 1TB Click Logs80.25% AUC

Server Scenario - Closed Division

NetworkThroughputGPUServerGPU VersionTarget AccuracyMLPerf Server Latency
Constraints (ms)
Dataset
ResNet-50 v1.558,995 queries/sec1x H100NVIDIA H100H100-SXM-80GB76.46% Top115ImageNet
300,064 queries/sec8x A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
3,527 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB76.46% Top115ImageNet
236,057 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB76.46% Top115ImageNet
RetinaNet848 queries/sec1x H100NVIDIA H100H100-SXM-80GB0.3755 mAP100OpenImages
4,096 queries/sec8x A100DGX A100A100 SXM-80GB0.3755 mAP100OpenImages
45 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB0.3755 mAP100OpenImages
3,997 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB0.3755 mAP100OpenImages
RNN-T21,488 queries/sec1x H100NVIDIA H100H100-SXM-80GB7.45% WER1,000LibriSpeech
104,020 queries/sec8x A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
1,347 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB7.45% WER1,000LibriSpeech
90,005 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB7.45% WER1,000LibriSpeech
BERT6,195 queries/sec1x H100NVIDIA H100H100-SXM-80GB90.87% f1130SQuAD v1.1
12,815 queries/sec8x A100DGX A100A100 SXM-80GB90.87% f1130SQuAD v1.1
1,572 queries/sec1x A100DGX A100A100 SXM-80GB90.87% f1130SQuAD v1.1
164 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB90.87% f1130SQuAD v1.1
10,795 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB90.87% f1130SQuAD v1.1
DLRM545,174 queries/sec1x H100NVIDIA H100H100-SXM-80GB80.25% AUC30Criteo 1TB Click Logs
2,390,910 queries/sec8x A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
298,565 queries/sec1x A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
35,991 queries/sec1x1g.10gb A100DGX A100A100 SXM-80GB80.25% AUC30Criteo 1TB Click Logs
1,326,940 queries/sec8x A100Gigabyte G482-Z54A100 PCIe-80GB80.25% AUC30Criteo 1TB Click Logs

Power Efficiency Offline Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5288,733 samples/sec93.68 samples/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
252,721 samples/sec122.19 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet
RetinaNet4,122 samples/sec1.32 samples/sec/watt8x A100DGX A100A100 SXM-80GBOpenImages
3,805 samples/sec1.73 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages
3D-UNet23 samples/sec0.008 samples/sec/watt8x A100DGX A100A100 SXM-80GBKiTS 2019
19 samples/sec0.011 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBKiTS 2019
RNN-T84,508 samples/sec27.79 samples/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
78,750 samples/sec38.88 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech
BERT11,152 samples/sec3.33 samples/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
11,158 samples/sec4.37 samples/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.1
DLRM2,128,420 samples/sec641.77 samples/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs

Power Efficiency Server Scenario - Closed Division

NetworkThroughputThroughput per WattGPUServerGPU VersionDataset
ResNet-50 v1.5229,055 queries/sec78.93 queries/sec/watt8x A100DGX A100A100 SXM-80GBImageNet
185,047 queries/sec87.2 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBImageNet
RetinaNet3,896 queries/sec1.25 queries/sec/watt8x A100DGX A100A100 SXM-80GBOpenImages
2,296 queries/sec1.21 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBOpenImages
RNN-T88,003 queries/sec25.44 queries/sec/watt8x A100DGX A100A100 SXM-80GBLibriSpeech
74,995 queries/sec33.88 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBLibriSpeech
BERT9,995 queries/sec2.93 queries/sec/watt8x A100DGX A100A100 SXM-80GBSQuAD v1.1
7,494 queries/sec3.45 queries/sec/watt8x A100Gigabyte G482-Z54A100 PCIe-80GBSQuAD v1.1
DLRM2,002,080 queries/sec592.73 queries/sec/watt8x A100DGX A100A100 SXM-80GBCriteo 1TB Click Logs

MLPerf™ v2.1 Inference Closed: ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99.9% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 2.1-0082, 2.1-0084, 2.1-0085, 2.1-0087, 2.1-0088, 2.1-0089, 2.1-0121, 2.1-0122. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
H100 SXM-80GB is a preview submission
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
1x1g.10gb is a notation used to refer to the MIG configuration. In this example, the workload is running on a single MIG slice, with 10GB of memory on a single A100.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v2.1


NVIDIA landed top performance spots on all MLPerf™ Inference 2.1 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.

MLPerf™ v2.1 A100 Inference Closed: ResNet-50 v1.5, RetinaNet, BERT 99.9% of FP32 accuracy target, DLRM 99.9% of FP32 accuracy target: 2.1-0088, 2.1-0090. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.​

 

NVIDIA Client Batch Size 1 and 2 Performance with Triton Inference Server

A100 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA100-SXM4-40GBtensorrt_planTensorRTMixed4112432.705734 inf/sec38422.08-py3
BERT Large InferenceA100-SXM4-80GBtensorrt_planTensorRTMixed4212462.407770 inf/sec38422.08-py3
BERT Large InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed4112438.956616 inf/sec38422.08-py3
BERT Large InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed4212474.039648 inf/sec38422.08-py3
BERT Base InferenceA100-SXM4-80GBtensorrt_planTensorRTMixed411244.1945,721 inf/sec12822.08-py3
BERT Base InferenceA100-SXM4-40GBtensorrt_planTensorRTMixed421247.0096,848 inf/sec12822.08-py3
BERT Base InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed411244.8194,979 inf/sec12822.08-py3
BERT Base InferenceA100-PCIE-40GBtensorrt_planTensorRTMixed421248.1865,862 inf/sec12822.08-py3
DLRM InferenceA100-SXM4-40GBpytorch_libtorchPyTorchMixed2165,536302.35112,756 inf/sec-22.08-py3
DLRM InferenceA100-SXM4-40GBpytorch_libtorchPyTorchMixed2265,536282.22325,185 inf/sec-22.08-py3
DLRM InferenceA100-PCIE-40GBpytorch_libtorchPyTorchMixed4165,536302.31612,946 inf/sec-22.08-py3
DLRM InferenceA100-PCIE-40GBpytorch_libtorchPyTorchMixed4265,536302.15227,875 inf/sec-22.07-py3

A30 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA30tensorrt_planTensorRTMixed4112468.117352 inf/sec38422.08-py3
BERT Large InferenceA30tensorrt_planTensorRTMixed2211687.427366 inf/sec38422.08-py3
BERT Base InferenceA30tensorrt_planTensorRTMixed411247.6793,125 inf/sec12822.08-py3
BERT Base InferenceA30tensorrt_planTensorRTMixed221169.5023,367 inf/sec12822.08-py3

A10 Triton Inference Server Performance

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceA10tensorrt_planTensorRTMixed41124107.907222 inf/sec38422.08-py3
BERT Large InferenceA10tensorrt_planTensorRTMixed22124211.233228 inf/sec38422.08-py3
BERT Base InferenceA10tensorrt_planTensorRTMixed2112411.0782,166 inf/sec12822.08-py3
BERT Base InferenceA10tensorrt_planTensorRTMixed4212421.2612,257 inf/sec12822.08-py3

T4 Triton Inference Server Performance +

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceNVIDIA T4tensorrt_planTensorRTMixed111892.26487 inf/sec38422.08-py3
BERT Large InferenceNVIDIA T4tensorrt_planTensorRTMixed1218183.10687 inf/sec38422.08-py3
BERT Base InferenceNVIDIA T4tensorrt_planTensorRTMixed1112426.479906 inf/sec12822.08-py3
BERT Base InferenceNVIDIA T4tensorrt_planTensorRTMixed1212043.281924 inf/sec12822.08-py3


V100 Triton Inference Server Performance +

NetworkAcceleratorModel FormatFramework BackendPrecisionModel Instances on TritonClient Batch SizeDynamic Batch Size (Triton)Number of Concurrent Client RequestsLatency (ms)ThroughputSequence/Input LengthTriton Container Version
BERT Large InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed2112496.163249 inf/sec38422.08-py3
BERT Large InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed42124189.991253 inf/sec38422.08-py3
BERT Base InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed4112412.5671,910 inf/sec12822.08-py3
BERT Base InferenceV100 SXM2-32GBtensorrt_planTensorRTMixed4212420.9072,295 inf/sec12822.08-py3
DLRM InferenceV100-SXM2-32GBpytorch_libtorchPyTorchMixed2165,536303.3588,931 inf/sec-22.08-py3
DLRM InferenceV100-SXM2-32GBpytorch_libtorchPyTorchMixed2265,536303.53216,983 inf/sec-22.08-py3

Inference Performance of NVIDIA A100, A40, A30, A10, A2, T4 and V100

Benchmarks are reproducible by following links to the NGC catalog scripts

Inference Image Classification on CNNs with TensorRT

ResNet-50 v1.5 Throughput

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.5.0 | Batch Size = 128 | 22.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.5.0 | Batch Size = 128 | 22.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.5.0 | Batch Size = 128 | 22.09-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.5.0 | Batch Size = 128 | 22.09-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.5.0 | Batch Size = 128 | 22.09-py3 | Precision: Mixed | Dataset: Synthetic

 
 

ResNet-50 v1.5 Power Efficiency

DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.5.0 | Batch Size = 128 | 22.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.5.0 | Batch Size = 128 | 22.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.4.2 | Batch Size = 128 | 22.08-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.5.0 | Batch Size = 128 | 22.09-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.5.0 | Batch Size = 128 | 22.09-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.5.0 | Batch Size = 128 | 22.09-py3 | Precision: Mixed | Dataset: Synthetic

 

A100 Full Chip Inference Performance

NetworkBatch SizeFull Chip ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50811,827 images/sec64 images/sec/watt0.681x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-40GB
12829,514 images/sec86 images/sec/watt4.341x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-40GB
21931,828 images/sec- images/sec/watt6.881x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
ResNet-50v1.5811,587 images/sec56 images/sec/watt0.691x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-80GB
12829,927 images/sec77 images/sec/watt4.281x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-80GB
21330,810 images/sec- images/sec/watt6.911x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
87,315 sequences/sec27 sequences/sec/watt1.091x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-80GB
12815,191 sequences/sec38 sequences/sec/watt8.431x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-40GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
82,682 sequences/sec9 sequences/sec/watt2.981x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-80GB
1284,964 sequences/sec13 sequences/sec/watt25.781x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-40GB
EfficientNet-B089,152 images/sec65 images/sec/watt0.871x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-40GB
12830,289 images/sec96 images/sec/watt4.231x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-80GB
EfficientNet-B482,586 images/sec12 images/sec/watt3.091x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-40GB
1284,394 images/sec13 images/sec/watt29.131x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-40GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 1/7 MIG Inference Performance

NetworkBatch Size1/7 MIG ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,723 images/sec34 images/sec/watt2.151x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
304,309 images/sec- images/sec/watt6.961x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
1284,671 images/sec37 images/sec/watt27.41x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
ResNet-50v1.583,640 images/sec32 images/sec/watt2.21x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-80GB
284,125 images/sec- images/sec/watt6.791x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
1284,532 images/sec37 images/sec/watt28.241x A100DGX A10022.09-py3INT8SyntheticTensorRT 8.5.0A100-SXM4-80GB
BERT-BASE81,886 sequences/sec15 sequences/sec/watt4.241x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
1282,342 sequences/sec17 sequences/sec/watt54.651x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
BERT-LARGE8618 sequences/sec5 sequences/sec/watt12.951x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
128744 sequences/sec5 sequences/sec/watt172.081x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

A100 7 MIG Inference Performance

NetworkBatch Size7 MIG ThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50825,682 images/sec79 images/sec/watt2.181x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
2929,917 images/sec- images/sec/watt6.791x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
12832,531 images/sec88 images/sec/watt27.621x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
ResNet-50v1.5825,056 images/sec77 images/sec/watt2.241x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
2828,819 images/sec- images/sec/watt6.81x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
12831,490 images/sec82 images/sec/watt28.541x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
BERT-BASE813,095 sequences/sec34 sequences/sec/watt4.291x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
12815,342 sequences/sec40 sequences/sec/watt58.531x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
BERT-LARGE84,214 sequences/sec11 sequences/sec/watt13.311x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB
1284,812 sequences/sec12 sequences/sec/watt186.611x A100DGX A10022.08-py3INT8SyntheticTensorRT 8.4.2A100-SXM4-80GB

Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128

 

A40 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5089,917 images/sec38 images/sec/watt0.811x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
10615,867 images/sec- images/sec/watt6.681x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
12815,867 images/sec53 images/sec/watt8.071x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
ResNet-50v1.589,686 images/sec37 images/sec/watt0.831x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
10115,016 images/sec- images/sec/watt6.731x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
12815,171 images/sec51 images/sec/watt8.441x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
BERT-BASE85,609 sequences/sec19 sequences/sec/watt1.431x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
1287,753 sequences/sec26 sequences/sec/watt16.511x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
BERT-LARGE81,724 sequences/sec6 sequences/sec/watt4.641x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
1282,386 sequences/sec8 sequences/sec/watt53.651x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
EfficientNet-B089,194 images/sec50 images/sec/watt0.871x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
12819,286 images/sec65 images/sec/watt6.641x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
EfficientNet-B481,950 images/sec7 images/sec/watt4.11x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40
1282,648 images/sec9 images/sec/watt48.341x A40GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A40

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5088,915 images/sec71 images/sec/watt0.91x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
10515,657 images/sec- images/sec/watt6.711x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
12816,083 images/sec98 images/sec/watt7.961x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
ResNet-50v1.588,737 images/sec68 images/sec/watt0.921x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
10115,071 images/sec- images/sec/watt6.71x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
12815,582 images/sec95 images/sec/watt8.211x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
85,146 sequences/sec31 sequences/sec/watt1.551x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
1287,327 sequences/sec45 sequences/sec/watt17.471x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,776 sequences/sec12 sequences/sec/watt4.51x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
1282,456 sequences/sec15 sequences/sec/watt52.121x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
EfficientNet-B087,595 images/sec74 images/sec/watt1.051x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
12816,798 images/sec102 images/sec/watt7.621x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
EfficientNet-B481,718 images/sec12 images/sec/watt4.661x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
1282,357 images/sec14 images/sec/watt54.311x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 1/4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,625 images/sec44 images/sec/watt2.211x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
294,253 images/sec- images/sec/watt6.821x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
1284,605 images/sec54 images/sec/watt27.791x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
ResNet-50v1.583,541 images/sec44 images/sec/watt2.261x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
274,143 images/sec- images/sec/watt6.521x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
1284,458 images/sec51 images/sec/watt28.711x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
BERT-BASE81,873 sequences/sec21 sequences/sec/watt4.271x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
1282,276 sequences/sec23 sequences/sec/watt56.231x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
BERT-LARGE8601 sequences/sec7 sequences/sec/watt13.31x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
128742 sequences/sec7 sequences/sec/watt172.551x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A30 4 MIG Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50813,975 images/sec85 images/sec/watt2.291x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
2716,175 images/sec- images/sec/watt6.71x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
12817,191 images/sec104 images/sec/watt29.881x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
ResNet-50v1.5813,516 images/sec82 images/sec/watt2.381x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
2615,607 images/sec- images/sec/watt6.71x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
12816,658 images/sec101 images/sec/watt30.91x A30GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A30
BERT-BASE86,894 sequences/sec42 sequences/sec/watt4.661x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
1287,732 sequences/sec47 sequences/sec/watt66.451x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
BERT-LARGE82,187 sequences/sec13 sequences/sec/watt14.681x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30
1282,458 sequences/sec15 sequences/sec/watt208.611x A30GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A30

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A10 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5087,774 images/sec52 images/sec/watt1.031x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
7110,857 images/sec- images/sec/watt6.631x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
12811,284 images/sec77 images/sec/watt11.341x A10GIGABYTE G482-Z52-0022.08-py3INT8SyntheticTensorRT 8.4.2A10
ResNet-50v1.587,640 images/sec51 images/sec/watt1.051x A10GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A10
12810,668 images/sec71 images/sec/watt121x A10GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A10
BERT-BASE84,154 sequences/sec28 sequences/sec/watt1.931x A10GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A10
1285,004 sequences/sec33 sequences/sec/watt25.581x A10GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A10
BERT-LARGE81,251 sequences/sec8 sequences/sec/watt6.41x A10GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A10
1281,537 sequences/sec11 sequences/sec/watt83.281x A10GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A10
EfficientNet-B088,416 images/sec56 images/sec/watt0.951x A10GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A10
12813,900 images/sec93 images/sec/watt9.211x A10GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A10
EfficientNet-B481,531 images/sec10 images/sec/watt5.231x A10GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A10
1281,836 images/sec12 images/sec/watt69.711x A10GIGABYTE G482-Z52-0022.09-py3INT8SyntheticTensorRT 8.5.0A10

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container

 

A2 Inference Performance

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5082,610 images/sec43 images/sec/watt3.061x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
192,901 images/sec- images/sec/watt6.551x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
1283,027 images/sec51 images/sec/watt42.281x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
ResNet-50v1.582,520 images/sec42 images/sec/watt3.171x A2GIGABYTE MZ52-G41-0022.09-py3INT8SyntheticTensorRT 8.5.0A2
182,761 images/sec- images/sec/watt6.521x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
1282,939 images/sec49 images/sec/watt43.551x A2GIGABYTE MZ52-G41-0022.09-py3INT8SyntheticTensorRT 8.5.0A2
BERT-BASE81,131 sequences/sec19 sequences/sec/watt7.071x A2GIGABYTE MZ52-G41-0022.09-py3INT8SyntheticTensorRT 8.5.0A2
1281,194 sequences/sec20 sequences/sec/watt107.171x A2GIGABYTE MZ52-G41-0022.09-py3INT8SyntheticTensorRT 8.5.0A2
BERT-LARGE8329 sequences/sec5 sequences/sec/watt24.351x A2GIGABYTE MZ52-G41-0022.09-py3INT8SyntheticTensorRT 8.5.0A2
128364 sequences/sec6 sequences/sec/watt351.871x A2GIGABYTE MZ52-G41-0022.09-py3INT8SyntheticTensorRT 8.5.0A2
EfficientNet-B083,054 images/sec59 images/sec/watt2.621x A2GIGABYTE MZ52-G41-0022.09-py3INT8SyntheticTensorRT 8.5.0A2
1283,924 images/sec66 images/sec/watt32.621x A2GIGABYTE MZ52-G41-0022.09-py3INT8SyntheticTensorRT 8.5.0A2
EfficientNet-B48469 images/sec8 images/sec/watt17.041x A2GIGABYTE MZ52-G41-0022.08-py3INT8SyntheticTensorRT 8.4.2A2
128519 images/sec9 images/sec/watt246.491x A2GIGABYTE MZ52-G41-0022.09-py3INT8SyntheticTensorRT 8.5.0A2

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power

 

T4 Inference Performance +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5083,901 images/sec56 images/sec/watt2.051x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
304,562 images/sec- images/sec/watt6.581x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
1284,966 images/sec71 images/sec/watt25.781x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
ResNet-50v1.583,639 images/sec52 images/sec/watt2.21x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
274,213 images/sec- images/sec/watt6.651x T4Supermicro SYS-1029GQ-TRT22.08-py3INT8SyntheticTensorRT 8.4.2NVIDIA T4
1284,918 images/sec70 images/sec/watt26.031x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,681 sequences/sec24 sequences/sec/watt4.761x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
1281,856 sequences/sec27 sequences/sec/watt691x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
8559 sequences/sec8 sequences/sec/watt14.31x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
128535 sequences/sec8 sequences/sec/watt239.341x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
EfficientNet-B084,812 images/sec69 images/sec/watt1.661x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
1286,313 images/sec91 images/sec/watt20.281x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
EfficientNet-B48793 images/sec11 images/sec/watt10.081x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4
128852 images/sec12 images/sec/watt150.191x T4Supermicro SYS-4029GP-TRT22.09-py3INT8SyntheticTensorRT 8.5.0NVIDIA T4

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container



V100 Inference Performance +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-5084,426 images/sec16 images/sec/watt1.811x V100DGX-222.09-py3INT8SyntheticTensorRT 8.5.0V100-SXM3-32GB
1287,889 images/sec23 images/sec/watt16.231x V100DGX-222.09-py3INT8SyntheticTensorRT 8.5.0V100-SXM3-32GB
ResNet-50v1.584,320 images/sec14 images/sec/watt1.851x V100DGX-222.09-py3INT8SyntheticTensorRT 8.5.0V100-SXM3-32GB
1287,502 images/sec22 images/sec/watt17.061x V100DGX-222.09-py3INT8SyntheticTensorRT 8.5.0V100-SXM3-32GB
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
82,369 sequences/sec7 sequences/sec/watt3.381x V100DGX-222.09-py3MixedSyntheticTensorRT 8.5.0V100-SXM3-32GB
1283,186 sequences/sec10 sequences/sec/watt40.171x V100DGX-222.09-py3MixedSyntheticTensorRT 8.5.0V100-SXM3-32GB
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
8802 sequences/sec2 sequences/sec/watt9.971x V100DGX-222.09-py3MixedSyntheticTensorRT 8.5.0V100-SXM3-32GB
128970 sequences/sec3 sequences/sec/watt131.91x V100DGX-222.09-py3MixedSyntheticTensorRT 8.5.0V100-SXM3-32GB
EfficientNet-B084,567 images/sec22 images/sec/watt1.751x V100DGX-222.09-py3INT8SyntheticTensorRT 8.5.0V100-SXM3-32GB
1289,390 images/sec28 images/sec/watt13.631x V100DGX-222.09-py3INT8SyntheticTensorRT 8.5.0V100-SXM3-32GB
EfficientNet-B48931 images/sec3 images/sec/watt8.591x V100DGX-222.09-py3INT8SyntheticTensorRT 8.5.0V100-SXM3-32GB
1281,290 images/sec4 images/sec/watt99.211x V100DGX-222.09-py3INT8SyntheticTensorRT 8.5.0V100-SXM3-32GB

Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container


Inference Performance of NVIDIA GPU on Cloud

Benchmarks are reproducible by following links to the NGC catalog scripts

A100 Inference Performance on Cloud

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.5811,495 images/sec- images/sec/watt0.71x A100GCP A2-HIGHGPU-1G22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
12828,222 images/sec- images/sec/watt4.541x A100GCP A2-HIGHGPU-1G22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
811,288 images/sec- images/sec/watt0.711x A100AWS EC2 p4d.24xlarge22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
12828,211 images/sec- images/sec/watt4.541x A100AWS EC2 p4d.24xlarge22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
811,334 images/sec- images/sec/watt0.711x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-80GB
12829,613 images/sec- images/sec/watt4.321x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-40GB
BERT-LARGE82,569 images/sec- images/sec/watt3.111x A100AWS EC2 p4d.24xlarge22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
1285,008 images/sec- images/sec/watt25.561x A100AWS EC2 p4d.24xlarge22.07-py3INT8SyntheticTensorRT 8.4.1A100-SXM4-40GB
82,698 images/sec- images/sec/watt2.961x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-80GB
1284,907 images/sec- images/sec/watt26.091x A100Azure Standard_ND96amsr_A100_v422.08-py3INT8Synthetic-A100-SXM4-80GB

BERT-Large: Sequence Length = 128

T4 Inference Performance on Cloud +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.583,351 images/sec- images/sec/watt2.391x T4GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1NVIDIA T4
1283,885 images/sec- images/sec/watt32.951x T4GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1NVIDIA T4
ResNet-50v1.583,308 images/sec- images/sec/watt2.421x T4AWS EC2 g4dn.4xlarge22.06-py3INT8SyntheticTensorRT 8.2.5NVIDIA T4
1284,143 images/sec- images/sec/watt30.891x T4AWS EC2 g4dn.4xlarge22.06-py3INT8SyntheticTensorRT 8.2.5NVIDIA T4
BERT-LARGE8475 images/sec- images/sec/watt16.831x T4GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1NVIDIA T4
128430 images/sec- images/sec/watt297.771x T4GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1NVIDIA T4


V100 Inference Performance on Cloud +

NetworkBatch SizeThroughputEfficiencyLatency (ms)GPUServerContainerPrecisionDatasetFrameworkGPU Version
ResNet-50v1.584,257 images/sec- images/sec/watt1.881x V100GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1V100-SXM2-16GB
1287,360 images/sec- images/sec/watt17.391x V100GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1V100-SXM2-16GB
83,824 images/sec- images/sec/watt2.091x V100Azure Standard_NC6s_v322.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM2-16GB
1287,043 images/sec- images/sec/watt18.171x V100Azure Standard_NC6s_v322.05-py3INT8SyntheticTensorRT 8.2.3V100-SXM2-16GB
BERT-LARGE8684 images/sec- images/sec/watt11.691x V100GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1V100-SXM2-16GB
128915 images/sec- images/sec/watt139.891x V100GCP N1-HIGHMEM-822.07-py3INT8SyntheticTensorRT 8.4.1V100-SXM2-16GB

Conversational AI

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.

Related Resources

Download and get started with NVIDIA Riva.


Riva Benchmarks

A100 ASR Benchmarks

A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram111.41A100 SXM4-40GB
citrinetn-gram6464.164A100 SXM4-40GB
citrinetn-gram128103126A100 SXM4-40GB
citrinetn-gram256166.7250A100 SXM4-40GB
citrinetn-gram384235371A100 SXM4-40GB
citrinetn-gram512311490A100 SXM4-40GB
citrinetn-gram768492717A100 SXM4-40GB
conformern-gram116.81A100 SXM4-40GB
conformern-gram6410964A100 SXM4-40GB
conformern-gram128130126A100 SXM4-40GB
conformern-gram256236249A100 SXM4-40GB
conformern-gram384342369A100 SXM4-40GB
conformern-gram512485486A100 SXM4-40GB

A100 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram110.471A100 SXM4-40GB
citrinetn-gram815.148A100 SXM4-40GB
citrinetn-gram1626.216A100 SXM4-40GB
citrinetn-gram3239.132A100 SXM4-40GB
citrinetn-gram484848A100 SXM4-40GB
citrinetn-gram6455.464A100 SXM4-40GB
conformern-gram114.691A100 SXM4-40GB
conformern-gram837.78A100 SXM4-40GB
conformern-gram1641.516A100 SXM4-40GB
conformern-gram3255.732A100 SXM4-40GB
conformern-gram4866.848A100 SXM4-40GB
conformern-gram6482.263A100 SXM4-40GB

A100 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram324390A100 SXM4-40GB
conformern-gram321700A100 SXM4-40GB

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.6.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A30 ASR Benchmarks

A30 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram114.641A30
citrinetn-gram6410163A30
citrinetn-gram128152126A30
citrinetn-gram256272249A30
citrinetn-gram384393368A30
citrinetn-gram512569484A30
conformern-gram121.761A30
conformern-gram6413463A30
conformern-gram128216126A30
conformern-gram256397248A30
conformern-gram384672364A30

A30 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.741A30
citrinetn-gram829.48A30
citrinetn-gram1644.216A30
citrinetn-gram3258.732A30
citrinetn-gram4865.848A30
citrinetn-gram648363A30
conformern-gram120.321A30
conformern-gram842.28A30
conformern-gram1651.516A30
conformern-gram3271.332A30
conformern-gram48103.948A30
conformern-gram64126.863A30

A30 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram323142A30
conformern-gram321120A30

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.6.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A10 ASR Benchmarks

A10 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram112.931A10
citrinetn-gram6488.564A10
citrinetn-gram128162.6126A10
citrinetn-gram256316248A10
citrinetn-gram384486367A10
citrinetn-gram512710481A10
conformern-gram115.331A10
conformern-gram6413363A10
conformern-gram128234126A10
conformern-gram256434247A10

A10 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram110.4051A10
citrinetn-gram820.228A10
citrinetn-gram1629.816A10
citrinetn-gram3249.132A10
citrinetn-gram4867.648A10
citrinetn-gram6484.763A10
conformern-gram113.491A10
conformern-gram833.88A10
conformern-gram1640.916A10
conformern-gram3271.532A10
conformern-gram4810848A10
conformern-gram6414063A10

A10 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram322719A10
conformern-gram32992A10

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.6.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

V100 ASR Benchmarks +

V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.911V100 SXM2-16GB
citrinetn-gram6487.963V100 SXM2-16GB
citrinetn-gram128153125V100 SXM2-16GB
citrinetn-gram256283.7246V100 SXM2-16GB
citrinetn-gram384407363V100 SXM2-16GB
citrinetn-gram512590474V100 SXM2-16GB
conformern-gram122.31V100 SXM2-16GB
conformern-gram6415363V100 SXM2-16GB
conformern-gram128230.6125V100 SXM2-16GB
conformern-gram256400245V100 SXM2-16GB
conformern-gram384716359V100 SXM2-16GB

V100 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram113.261V100 SXM2-16GB
citrinetn-gram824.328V100 SXM2-16GB
citrinetn-gram1632.916V100 SXM2-16GB
citrinetn-gram3250.532V100 SXM2-16GB
citrinetn-gram486548V100 SXM2-16GB
citrinetn-gram6484.163V100 SXM2-16GB
conformern-gram119.71V100 SXM2-16GB
conformern-gram8558V100 SXM2-16GB
conformern-gram1652.316V100 SXM2-16GB
conformern-gram3276.732V100 SXM2-16GB
conformern-gram48119.847V100 SXM2-16GB
conformern-gram6414363V100 SXM2-16GB

V100 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram322693V100 SXM2-16GB
conformern-gram32964V100 SXM2-16GB

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.6.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz



T4 ASR Benchmarks +

T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram126.71NVIDIA T4
citrinetn-gram64170.863NVIDIA T4
citrinetn-gram128342125NVIDIA T4
citrinetn-gram256736242NVIDIA T4
conformern-gram159.11NVIDIA T4
conformern-gram6431063NVIDIA T4
conformern-gram128505124NVIDIA T4

T4 Best Streaming Latency Mode (160 ms chunk)
Acoustic ModelLanguage Model# of StreamsAvg Latency (ms)Throughput (RTFX)GPU Version
citrinetn-gram125.91NVIDIA T4
citrinetn-gram8578NVIDIA T4
citrinetn-gram1660.516NVIDIA T4
citrinetn-gram3293.132NVIDIA T4
citrinetn-gram48139.747NVIDIA T4
conformern-gram153.41NVIDIA T4
conformern-gram8828NVIDIA T4
conformern-gram16104.116NVIDIA T4
conformern-gram3223932NVIDIA T4

T4 Offline Mode (1600 ms chunk)
Acoustic ModelLanguage Model# of StreamsThroughput (RTFX)GPU Version
citrinetn-gram321322NVIDIA T4
conformern-gram32488NVIDIA T4

ASR Throughput (RTFX) - Number of seconds of audio processed per second | Riva version: v2.6.0 | ASR Dataset - Librispeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A100 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0210.003145A100 SXM4-40GB
FastPitch + Hifi-GAN40.0370.006336A100 SXM4-40GB
FastPitch + Hifi-GAN60.0460.007395A100 SXM4-40GB
FastPitch + Hifi-GAN80.0560.009421A100 SXM4-40GB
FastPitch + Hifi-GAN100.0590.01434A100 SXM4-40GB

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.6.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A30 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0220.004127A30
FastPitch + Hifi-GAN40.0440.007267A30
FastPitch + Hifi-GAN60.0640.009292A30
FastPitch + Hifi-GAN80.0820.011310A30
FastPitch + Hifi-GAN100.0910.013318A30

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.6.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

A10 TTS Benchmarks

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0210.004127A10
FastPitch + Hifi-GAN40.0490.008235A10
FastPitch + Hifi-GAN60.0720.011250A10
FastPitch + Hifi-GAN80.0960.014256A10

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.6.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

V100 TTS Benchmarks +

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.0240.005104V100 SXM2-16GB
FastPitch + Hifi-GAN40.0550.009215V100 SXM2-16GB
FastPitch + Hifi-GAN60.080.012227V100 SXM2-16GB
FastPitch + Hifi-GAN80.1080.015232V100 SXM2-16GB
FastPitch + Hifi-GAN100.1190.018235V100 SXM2-16GB

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.6.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz



T4 TTS Benchmarks +

Model# of streamsAvg Latency to first audio (sec)Avg Latency between audio chunks (sec)Throughput (RTFX)GPU Version
FastPitch + Hifi-GAN10.050.00764NVIDIA T4
FastPitch + Hifi-GAN40.0960.016121NVIDIA T4
FastPitch + Hifi-GAN60.1420.022127NVIDIA T4
FastPitch + Hifi-GAN80.1880.028132NVIDIA T4
FastPitch + Hifi-GAN100.2180.03134NVIDIA T4

TTS Throughput (RTFX) - Number of seconds of audio generated per second | Riva version: v2.6.0 | TTS Dataset - LJSpeech | Hardware: DGX A100 (1x A100 SXM4-40GB) with EPYC 7742@2.25GHz, NVIDIA A30 with EPYC 7742@2.25GHz, NVIDIA A10 with EPYC 7763@2.45GHz, DGX-1 (1x V100-SXM2-16GB) with Xeon E5-2698@2.20GHz, and NVIDIA T4 with Gold 6240@2.60GHz

 

Last updated: November 9th, 2022