NVIDIA Data Center Deep Learning Product Performance
Reproducible Performance
Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide
Related Resources
Read why training to convergence is essential for enterprise AI adoption.
Learn how Cloud Service, OEMs Raise the Bar on AI Training with NVIDIA AI in the MLPerf training.
Access containers in the NVIDIA NGC™ catalog.
Learn how MLPerf Benchmarks show why AI is the future of HPC.
HPC Performance
Review the latest GPU-acceleration factors of popular HPC applications.
Training to Convergence
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Related Resources
Read our blog on convergence for more details.
Get up and running quickly with NVIDIA’s complete solution stack:
Pull software containers from NVIDIA NGC.
Learn how NVIDIA A100 Tensor Core GPUs provide unprecedented acceleration at every scale, setting records in MLPerf.
NVIDIA Performance on MLPerf 1.1 Training Benchmarks
BERT Time to Train on A100
PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements
MLPerf Training Performance
NVIDIA A100 Performance on MLPerf 1.1 AI Benchmarks - Closed Division
MLPerf™ v1.1 Training Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Strong Scaling - Closed Division
| Framework | Network | Time to Train (mins) | MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| MXNet | CosmoFlow | 8.04 | Mean average error 0.124 | 1,024x A100 | DGX A100 | 1.0-1120 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | A100-SXM4-80GB |
| 25.78 | Mean average error 0.124 | 128x A100 | DGX A100 | 1.0-1121 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | A100-SXM4-80GB | ||
| PyTorch | DeepCAM | 1.67 | IOU 0.82 | 2,048x A100 | DGX A100 | 1.0-1122 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | A100-SXM4-80GB |
| 2.65 | IOU 0.82 | 512x A100 | DGX A100 | 1.0-1123 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | A100-SXM4-80GB |
NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Weak Scaling - Closed Division
| Framework | Network | Throughput | MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| MXNet | CosmoFlow | 0.73 models/min | Mean average error 0.124 | 4,096x A100 | DGX A100 | 1.0-1131 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | A100-SXM4-80GB |
| PyTorch | DeepCAM | 5.27 models/min | IOU 0.82 | 4,096x A100 | DGX A100 | 1.0-1132 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | A100-SXM4-80GB |
MLPerf™ v1.0 Training HPC Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v1.0 Training HPC rules and guidelines, click here
Converged Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 84 | 77.2 Top 1 | 23,377 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 192 | ImageNet2012 | A100-SXM4-80GB |
| PyTorch | 1.12.0a0 | SSD v1.1 | 46 | .26 mAP | 27,436 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | COCO 2017 | A100-SXM4-80GB |
| 1.12.0a0 | Tacotron2 | 108 | .56 Training Loss | 289,107 total output mels/sec | 8x A100 | DGX A100 | 22.04-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM4-40GB | |
| 1.12.0a0 | WaveGlow | 250 | -5.81 Training Loss | 1,709,596 output samples/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A100-SXM4-80GB | |
| 1.12.0a0 | GNMT V2 | 16 | 24.11 BLEU Score | 936,798 total tokens/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | wmt16-en-de | A100-SXM4-80GB | |
| 1.12.0a0 | NCF | 0.35 | .96 Hit Rate at 10 | 160,016,529 samples/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 131072 | MovieLens 20M | A100-SXM4-80GB | |
| 1.12.0a0 | BERT-LARGE | 3 | 90.51 F1 | 1,022 sequences/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.12.0a0 | Transformer-XL Base | 186 | 22.42 Perplexity | 707,615 total tokens/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | WikiText-103 | A100-SXM4-40GB | |
| 1.11.0a0 | BERT-Large Pre-Training P1 | 1,781 | - | 4,591 sequences/sec | 8x A100 | DGX A100 | 21.11-py3 | Mixed | - | Wikipedia | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-Large Pre-Training P2 | 653 | 1.24 Final Loss | 1,657 sequences/sec | 8x A100 | DGX A100 | 21.11-py3 | Mixed | - | Wikipedia | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-Large Pre-Training E2E | 2,434 | 1.24 Final Loss | - | 8x A100 | DGX A100 | 21.11-py3 | Mixed | - | Wikipedia | A100-SXM4-80GB | |
| 1.12.0a0 | EfficientNet-B0 | 576 | 76.54 Top 1 | 16,016 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-40GB | |
| 1.12.0a0 | EfficientNet-B4 | 1,341 | 78.06 Top 1 | 6,974 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A100-SXM4-80GB | |
| 1.12.0a0 | EfficientDet-D0 | 454 | .34 BBOX mAP | 1,990 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 150 | COCO 2017 | A100-SXM4-80GB | |
| 1.12.0a0 | EfficientNet-WideSE-B0 | 575 | 76.89 Top 1 | 15,489 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
| 1.12.0a0 | EfficientNet-WideSE-B4 | 1,252 | 78.06 Top 1 | 6,956 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A100-SXM4-80GB | |
| 1.12.0a0 | SE3 Transformer | 9 | .04 MAE | 21,658 molecules/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | A100-SXM4-80GB | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 95 | 76.92 Top 1 | 20,614 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | ImageNet2012 | A100-SXM4-80GB |
| 1.15.5 | ResNext101 | 188 | 79.19 Top 1 | 10,300 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
| 1.15.5 | SE-ResNext101 | 216 | 79.58 Top 1 | 8,977 images/sec | 8x A100 | DGX A100 | 22.03-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
| 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 1,062 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 2 | DAGM2007 | A100-SXM4-80GB | |
| 2.8.0 | U-Net Medical | 2 | .89 DICE Score | 1,030 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 8 | EM segmentation challenge | A100-SXM4-80GB | |
| 2.8.0 | Electra Base Fine Tuning | 3 | 92.5 F1 | 2,820 sequences/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
| 2.8.0 | EfficientNet-B0 | 544 | 76.36 Top 1 | 19,751 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 1024 | Imagenet2012 | A100-SXM4-80GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
BERT Large Pre-Training accuracy and time to train data for 8-GPU configuration estimated with measured data from 32-node (256-GPU) end-to-end run | Sequence Length for Phase 1 = 128 and Phase 2 = 512 | Global Batch Size for Phase 1 = 65,536 and Phase 2 = 32,768
Starting from 21.09-py3, ECC is enabled
A40 Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 195 | 77.23 Top 1 | 10,100 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 192 | ImageNet2012 | A40 |
| PyTorch | 1.12.0a0 | NCF | 1 | .96 Hit Rate at 10 | 48,771,822 samples/sec | 8x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 131072 | MovieLens 20M | A40 |
| 1.12.0a0 | BERT-LARGE | 8 | 90.61 F1 | 391 sequences/sec | 8x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | SQuaD v1.1 | A40 | |
| 1.11.0a0 | Tacotron2 | 117 | .55 Training Loss | 263,709 total output mels/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.02-py3 | Mixed | 128 | LJSpeech 1.1 | A40 | |
| 1.12.0a0 | WaveGlow | 468 | -5.76 Training Loss | 901,085 output samples/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A40 | |
| 1.12.0a0 | GNMT v2 | 45 | 24.12 BLEU Score | 324,767 total tokens/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 128 | wmt16-en-de | A40 | |
| 1.11.0a0 | Transformer XL Large | 927 | 18.52 Perplexity | 89,138 total tokens/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.01-py3 | Mixed | 16 | WikiText-103 | A40 | |
| 1.12.0a0 | Transformer XL Base | 424 | 22.5 Perplexity | 311,204 total tokens/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 128 | WikiText-103 | A40 | |
| 1.12.0a0 | EfficientNet-B0 | 875 | 76.44 Top 1 | 10,150 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 256 | Imagenet2012 | A40 | |
| 1.12.0a0 | EfficientNet-B4 | 539 | 78.67 Top 1 | 3,649 images/sec | 8x A40 | GIGABYTE G482-Z52-00 | 22.03-py3 | Mixed | 128 | Imagenet2012 | A40 | |
| 1.12.0a0 | EfficientDet-D0 | 640 | .34 BBOX mAP | 1,266 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 60 | COCO 2017 | A40 | |
| 1.12.0a0 | SE3 Transformer | 13 | .04 MAE | 13,786 molecules/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | A40 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 214 | 76.84 Top 1 | 9,004 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 256 | ImageNet2012 | A40 |
| 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 660 images/sec | 8x A40 | GIGABYTE G482-Z52-00 | 22.02-py3 | Mixed | 2 | DAGM2007 | A40 | |
| 1.15.5 | ResNeXt101 | 424 | 79.18 Top 1 | 4,541 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 256 | Imagenet2012 | A40 | |
| 1.15.5 | SE-ResNext101 | 471 | 79.61 Top 1 | 4,093 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.01-py3 | Mixed | 256 | Imagenet2012 | A40 | |
| 2.8.0 | Electra Base Fine Tuning | 4 | 92.46 F1 | 1,133 sequences/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 32 | SQuaD v1.1 | A40 |
Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
A30 Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 175 | 77.28 Top 1 | 11,175 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 192 | ImageNet2012 | A30 |
| PyTorch | 1.12.0a0 | Tacotron2 | 133 | .52 Training Loss | 233,210 total output mels/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 104 | LJSpeech 1.1 | A30 |
| 1.12.0a0 | WaveGlow | 450 | -5.8 Training Loss | 939,042 output samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A30 | |
| 1.12.0a0 | GNMT V2 | 45 | 24.17 BLEU Score | 322,906 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | wmt16-en-de | A30 | |
| 1.12.0a0 | NCF | 1 | .96 Hit Rate at 10 | 58,315,718 samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 131072 | MovieLens 20M | A30 | |
| 1.12.0a0 | BERT-LARGE | 10 | 90.42 F1 | 291 sequences/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | SQuaD v1.1 | A30 | |
| 1.11.0a0 | ResNeXt101 | 521 | 79.6 Top 1 | 3,783 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.02-py3 | Mixed | 112 | Imagenet2012 | A30 | |
| 1.11.0a0 | FastPitch | 299 | 2.7 Training Loss | 267,992 frames/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.01-py3 | Mixed | 32 | LJSpeech 1.1 | A30 | |
| 1.12.0a0 | EfficientNet-B0 | 908 | 76.4 Top 1 | 9,667 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A30 | |
| 1.12.0a0 | EfficientNet-B4 | 831 | 78.48 Top 1 | 2,365 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.03-py3 | Mixed | 32 | Imagenet2012 | A30 | |
| 1.12.0a0 | EfficientDet-D0 | 768 | .34 BBOX mAP | 956 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 30 | COCO 2017 | A30 | |
| 1.12.0a0 | SE3 Transformer | 12 | .04 MAE | 15,418 molecules/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | A30 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 197 | 76.98 Top 1 | 9,764 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 256 | ImageNet2012 | A30 |
| 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 681 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 2 | DAGM2007 | A30 | |
| 2.8.0 | U-Net Medical | 2 | .89 DICE Score | 481 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 8 | EM segmentation challenge | A30 | |
| 1.15.5 | ResNeXt101 | 459 | 79.33 Top 1 | 4,213 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A30 | |
| 1.15.5 | SE-ResNext101 | 564 | 79.79 Top 1 | 3,432 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.01-py3 | Mixed | 96 | Imagenet2012 | A30 | |
| 2.8.0 | Electra Base Fine Tuning | 5 | 92.75 F1 | 989 sequences/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | SQuaD v1.1 | A30 |
Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
A10 Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 251 | 77.28 Top 1 | 7,761 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 192 | ImageNet2012 | A10 |
| PyTorch | 1.12.0a0 | Tacotron2 | 143 | .53 Training Loss | 216,719 total output mels/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 104 | LJSpeech 1.1 | A10 |
| 1.12.0a0 | WaveGlow | 546 | -5.86 Training Loss | 771,407 output samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A10 | |
| 1.11.0a0 | SSD v1.1 | 107 | .19 mAP | 9,498 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.02-py3 | Mixed | 64 | COCO 2017 | A10 | |
| 1.12.0a0 | GNMT V2 | 52 | 24.25 BLEU Score | 279,952 total tokens/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | wmt16-en-de | A10 | |
| 1.12.0a0 | NCF | 1 | .96 Hit Rate at 10 | 45,600,484 samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 131072 | MovieLens 20M | A10 | |
| 1.12.0a0 | BERT-LARGE | 14 | 91.16 F1 | 215 sequences/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.03-py3 | Mixed | 10 | SQuaD v1.1 | A10 | |
| 1.12.0a0 | EfficientNet-B0 | 1,117 | 76.3 Top 1 | 7,885 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A10 | |
| 1.12.0a0 | EfficientNet-B4 | 874 | 78.19 Top 1 | 2,231 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.03-py3 | Mixed | 32 | Imagenet2012 | A10 | |
| 1.12.0a0 | EfficientDet-D0 | 790 | .34 BBOX mAP | 923 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 30 | COCO 2017 | A10 | |
| 1.12.0a0 | EfficientNet-WideSE-B0 | 1,143 | 76.78 Top 1 | 7,729 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A10 | |
| 1.12.0a0 | SE3 Transformer | 15 | .04 MAE | 12,096 molecules/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | A10 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 262 | 76.86 Top 1 | 7,356 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 256 | ImageNet2012 | A10 |
| 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 612 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.03-py3 | Mixed | 2 | DAGM2007 | A10 | |
| 2.8.0 | U-Net Medical | 3 | .89 DICE Score | 370 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 8 | EM segmentation challenge | A10 | |
| 1.15.5 | ResNext101 | 573 | 79.16 Top 1 | 3,365 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A10 | |
| 1.15.5 | SE-ResNext101 | 709 | 79.85 Top 1 | 2,740 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.03-py3 | Mixed | 96 | Imagenet2012 | A10 | |
| 2.8.0 | Electra Base Fine Tuning | 5 | 92.57 F1 | 791 sequences/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | SQuaD v1.1 | A10 |
Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
T4 Training Performance +
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PyTorch | 1.12.0a0 | ResNeXt101 | 1,382 | 79.66 Top 1 | 1,429 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 112 | Imagenet2012 | NVIDIA T4 |
| 1.12.0a0 | WaveGlow | 1,120 | -5.82 Training Loss | 387,032 output samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.03-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA T4 | |
| 1.12.0a0 | GNMT V2 | 93 | 24.22 BLEU Score | 155,176 total tokens/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
| 1.12.0a0 | NCF | 2 | .96 Hit Rate at 10 | 26,370,536 samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 131072 | MovieLens 20M | NVIDIA T4 | |
| 1.12.0a0 | BERT-LARGE | 24 | 90.69 F1 | 125 sequences/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 10 | SQuaD v1.1 | NVIDIA T4 | |
| 1.12.0a0 | EfficientNet-B0 | 2,371 | 76.43 Top 1 | 3,702 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
| 1.12.0a0 | EfficientNet-B4 | 1,577 | 78.55 Top 1 | 1,252 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.03-py3 | Mixed | 32 | Imagenet2012 | NVIDIA T4 | |
| 1.12.0a0 | EfficientDet-D0 | 1,349 | .34 BBOX mAP | 506 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 30 | COCO 2017 | NVIDIA T4 | |
| 1.12.0a0 | EfficientNet-WideSE-B0 | 2,480 | 76.67 Top 1 | 3,567 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
| 1.12.0a0 | SE3 Transformer | 38 | .04 MAE | 4,646 molecules/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | NVIDIA T4 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 585 | 77.06 Top 1 | 3,288 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 256 | ImageNet2012 | NVIDIA T4 |
| 1.15.5 | U-Net Industrial | 2 | .99 IoU Threshold 0.99 | 298 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 2 | DAGM2007 | NVIDIA T4 | |
| 1.15.5 | U-Net Medical | 38 | .89 DICE Score | 151 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 8 | EM segmentation challenge | NVIDIA T4 | |
| 1.15.5 | ResNext101 | 1,257 | 79.38 Top 1 | 1,533 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
| 1.15.5 | SE-ResNext101 | 1,626 | 79.72 Top 1 | 1,197 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.03-py3 | Mixed | 96 | Imagenet2012 | NVIDIA T4 | |
| 2.8.0 | Electra Base Fine Tuning | 10 | 92.54 F1 | 376 sequences/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 16 | SQuaD v1.1 | NVIDIA T4 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
V100 Training Performance +
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 166 | 77.26 Top 1 | 12,042 images/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 256 | ImageNet2012 | V100-SXM3-32GB |
| PyTorch | 1.12.0a0 | Tacotron2 | 181 | .53 Training Loss | 180,095 total output mels/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 104 | LJSpeech 1.1 | V100-SXM3-32GB |
| 1.12.0a0 | WaveGlow | 411 | -5.72 Training Loss | 1,035,406 output samples/sec | 8x V100 | DGX-2 | 22.03-py3 | Mixed | 10 | LJSpeech 1.1 | V100-SXM3-32GB | |
| 1.12.0a0 | GNMT V2 | 34 | 24.17 BLEU Score | 437,690 total tokens/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 128 | wmt16-en-de | V100-SXM3-32GB | |
| 1.12.0a0 | NCF | 1 | .96 Hit Rate at 10 | 100,288,565 samples/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 131072 | MovieLens 20M | V100-SXM3-32GB | |
| 1.12.0a0 | BERT-LARGE | 8 | 90.61 F1 | 385 sequences/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
| 1.12.0a0 | EfficientNet-B0 | 1,028 | 76.47 Top 1 | 8,709 images/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 256 | Imagenet2012 | V100-SXM3-32GB | |
| 1.12.0a0 | EfficientNet-B4 | 603 | 78.65 Top 1 | 3,320 images/sec | 8x V100 | DGX-2 | 22.03-py3 | Mixed | 64 | Imagenet2012 | V100-SXM3-32GB | |
| 1.12.0a0 | EfficientDet-D0 | 1,236 | .34 BBOX mAP | 576 images/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 60 | COCO 2017 | V100-SXM3-32GB | |
| 1.12.0a0 | EfficientNet-WideSE-B0 | 1,024 | 76.97 Top 1 | 8,737 images/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 256 | Imagenet2012 | V100-SXM3-32GB | |
| 1.12.0a0 | SE3 Transformer | 14 | .04 MAE | 13,114 molecules/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | V100-SXM3-32GB | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 187 | 76.99 Top 1 | 10,316 images/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 256 | ImageNet2012 | V100-SXM3-32GB |
| 1.15.5 | ResNext101 | 419 | 79.4 Top 1 | 4,622 images/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 128 | Imagenet2012 | V100-SXM3-32GB | |
| 1.15.5 | SE-ResNext101 | 499 | 79.63 Top 1 | 3,894 images/sec | 8x V100 | DGX-2 | 22.03-py3 | Mixed | 96 | Imagenet2012 | V100-SXM3-32GB | |
| 1.15.5 | U-Net Medical | 12 | .89 DICE Score | 461 images/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 8 | EM segmentation challenge | V100-SXM3-32GB | |
| 2.8.0 | Wide and Deep | 9 | .66 MAP at 12 | 2,921,693 samples/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 16384 | Kaggle Outbrain Click Prediction | V100-SXM3-32GB | |
| 2.8.0 | Electra Base Fine Tuning | 4 | 92.35 F1 | 1,375 sequences/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 32 | SQuaD v1.1 | V100-SXM3-32GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
Converged Training Performance of NVIDIA GPU on Cloud
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance on Cloud
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PyTorch | - | BERT-LARGE | 3 | 91.05 F1 | 896 sequences/sec | 8x A100 | AWS EC2 p4d.24xlarge | 21.12-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-40GB |
| - | BERT-LARGE | 3 | 91.05 F1 | 885 sequences/sec | 8x A100 | GCP A2-HIGHGPU-8G | 21.11-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-40GB | |
| Tensorflow | - | BERT-LARGE | 13 | 91.38 F1 | 747 sequences/sec | 8x A100 | GCP A2-HIGHGPU-8G | 22.01-py3 | Mixed | 24 | SQuaD v1.1 | A100-SXM4-40GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
V100 Training Performance on Cloud +
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PyTorch | - | BERT-LARGE | 8 | 90.8 F1 | 358 sequences/sec | 8x V100 | GCP N1-HIGHMEM-64 | 22.01-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM2-16GB |
| - | BERT-LARGE | 8 | 91.03 F1 | 360 sequences/sec | 8x V100 | AWS EC2 p3.16xlarge | 21.12-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM2-16GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
Converged Multi-Node Training Performance of NVIDIA GPU
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Multi-Node Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | Total GPUs | Nodes | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PyTorch | 1.11.0a0 | BERT-LARGE Pre-Training P1 | 296 | 1.53 Training Loss | 25,365 sequences/sec | 64x A100 | 8 | Selene | 21.12-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM4-80GB |
| 1.11.0a0 | BERT-LARGE Pre-Training P2 | 169 | 1.35 Training Loss | 5,112 sequences/sec | 64x A100 | 8 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-LARGE Pre-Training E2E | 253 | 1.35 Training Loss | - | 64x A100 | 8 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-LARGE Pre-Training P1 | 160 | 1.51 Training Loss | 48,380 sequences/sec | 128x A100 | 16 | Selene | 21.12-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-LARGE Pre-Training P2 | 87 | 1.34 Training Loss | 9,961 sequences/sec | 128x A100 | 16 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-LARGE Pre-Training E2E | 136 | 1.34 Training Loss | - | 128x A100 | 16 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-LARGE Pre-Training P1 | 87 | 1.49 Training Loss | 89,062 sequences/sec | 256x A100 | 32 | Selene | 21.12-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-LARGE Pre-Training P2 | 46 | 1.34 Training Loss | 19,169 sequences/sec | 256x A100 | 32 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-LARGE Pre-Training E2E | 73 | 1.34 Training Loss | - | 256x A100 | 32 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-LARGE Pre-Training P1 | 51 | 1.5 Training Loss | 153,429 sequences/sec | 512x A100 | 64 | Selene | 21.12-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-LARGE Pre-Training P2 | 25 | 1.33 Training Loss | 36,887 sequences/sec | 512x A100 | 64 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT-LARGE Pre-Training E2E | 42 | 1.33 Training Loss | - | 512x A100 | 64 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training P1 | 26 | 1.5 Training Loss | 300,769 sequences/sec | 1,024x A100 | 128 | Selene | 21.09-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training P2 | 13 | 1.35 Training Loss | 74,498 sequences/sec | 1,024x A100 | 128 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training E2E | 22 | 1.35 Training Loss | - | 1,024x A100 | 128 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | Transformer | 186 | 18.25 Perplexity | 454,979 total tokens/sec | 16x A100 | 2 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | Transformer | 105 | 18.27 Perplexity | 822,173 total tokens/sec | 64x A100 | 4 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | Transformer | 63 | 18.34 Perplexity | 1,389,494 total tokens/sec | 64x A100 | 8 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB |
BERT-Large Pre-Training Phase 1 Sequence Length = 128
BERT-Large Pre-Training Phase 2 Sequence Length = 512
Starting from 21.09-py3, ECC is enabled
Single-GPU Training
Some scenarios aren’t used in real-world training, such as single-GPU throughput. The table below provides an indication of a platform’s single-chip throughput.
Related Resources
Achieve unprecedented acceleration at every scale with NVIDIA’s complete solution stack.
Pull software containers from NVIDIA NGC.
Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit the NVIDIA NGC catalog to pull containers and quickly get up and running with deep learning.
Single GPU Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 3,163 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 192 | ImageNet2012 | A100-SXM4-80GB |
| PyTorch | 1.12.0a0 | SSD v1.1 | 438 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | COCO 2017 | A100-SXM4-80GB |
| 1.12.0a0 | Mask R-CNN | 32 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 8 | COCO 2014 | A100-SXM4-80GB | |
| 1.12.0a0 | Tacotron2 | 40,128 total output mels/sec | 1x A100 | DGX A100 | 22.04-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM4-80GB | |
| 1.12.0a0 | WaveGlow | 230,472 output samples/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A100-SXM4-80GB | |
| 1.11.0a0 | FastPitch | 87,184 frames/sec | 1x A100 | DGX A100 | 22.02-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM4-80GB | |
| 1.12.0a0 | GNMT V2 | 166,318 total tokens/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | wmt16-en-de | A100-SXM4-80GB | |
| 1.12.0a0 | NCF | 39,322,684 samples/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 1048576 | MovieLens 20M | A100-SXM4-80GB | |
| 1.12.0a0 | ResNeXt101 | 1,128 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A100-SXM4-80GB | |
| 1.12.0a0 | BERT-LARGE | 133 sequences/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.12.0a0 | Transformer-XL Large | 17,298 total tokens/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 16 | WikiText-103 | A100-SXM4-80GB | |
| 1.12.0a0 | Transformer-XL Base | 91,649 total tokens/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | WikiText-103 | A100-SXM4-80GB | |
| 1.12.0a0 | nnU-Net | 1,184 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 64 | Medical Segmentation Decathlon | A100-SXM4-80GB | |
| 1.12.0a0 | EfficientNet-B4 | 944 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A100-SXM4-80GB | |
| 1.11.0a0 | BERT Large Pre-Training Phase 1 | 432 sequences/sec | 1x A100 | DGX A100 | 22.02-py3 | Mixed | 64 | Wikipedia+BookCorpus | A100-SXM4-80GB | |
| 1.11.0a0 | BERT Large Pre-Training Phase 2 | 84 sequences/sec | 1x A100 | DGX A100 | 22.01-py3 | Mixed | 16 | Wikipedia+BookCorpus | A100-SXM4-80GB | |
| 1.12.0a0 | EfficientDet-D0 | 273 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 150 | COCO 2017 | A100-SXM4-80GB | |
| 1.12.0a0 | EfficientNet-WideSE-B0 | 1,920 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
| 1.12.0a0 | EfficientNet-WideSE-B4 | 940 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A100-SXM4-80GB | |
| 1.12.0a0 | SE3 Transformer | 3,124 molecules/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | A100-SXM4-80GB | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 2,670 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | ImageNet2012 | A100-SXM4-80GB |
| 1.15.5 | ResNext101 | 1,323 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
| 1.15.5 | SE-ResNext101 | 1,151 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
| 1.15.5 | U-Net Industrial | 378 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 16 | DAGM2007 | A100-SXM4-40GB | |
| 2.8.0 | U-Net Medical | 149 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 8 | EM segmentation challenge | A100-SXM4-80GB | |
| 2.8.0 | Wide and Deep | 2,854,325 samples/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | A100-SXM4-40GB | |
| 1.15.5 | BERT-LARGE | 117 sentences/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 24 | SQuaD v1.1 | A100-SXM4-80GB | |
| 2.8.0 | Electra Base Fine Tuning | 371 sequences/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.15.5 | NCF | 44,423,032 samples/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 1048576 | MovieLens 20M | A100-SXM4-40GB | |
| 1.15.5 | 3D-UNet Medical | 19 volumes/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 2 | EM segmentation challenge | A100-SXM4-80GB | |
| 2.8.0 | EfficientNet-B0 | 3,084 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 1024 | Imagenet2012 | A100-SXM4-80GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
EfficientNet-B4: Basic Augmentation | cuDNN Version = 8.0.5.32 | NCCL Version = 2.7.8 | Installation Source = NGC catalog
Starting from 21.09-py3, ECC is enabled
A40 Training Performance
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 1,380 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 192 | ImageNet2012 | A40 |
| PyTorch | 1.12.0a0 | SSD v1.1 | 184 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | COCO 2017 | A40 |
| 1.12.0a0 | Mask R-CNN | 19 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 8 | COCO 2014 | A40 | |
| 1.12.0a0 | Tacotron2 | 35,551 total output mels/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | LJSpeech 1.1 | A40 | |
| 1.12.0a0 | WaveGlow | 144,465 output samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A40 | |
| 1.12.0a0 | GNMT V2 | 80,454 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | wmt16-en-de | A40 | |
| 1.12.0a0 | NCF | 18,508,924 samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 1048576 | MovieLens 20M | A40 | |
| 1.12.0a0 | Transformer-XL Large | 10,479 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | WikiText-103 | A40 | |
| 1.12.0a0 | BERT-LARGE | 58 sequences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | SQuaD v1.1 | A40 | |
| 1.12.0a0 | Transformer-XL Base | 44,053 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | WikiText-103 | A40 | |
| 1.12.0a0 | nnU-Net | 565 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 64 | Medical Segmentation Decathlon | A40 | |
| 1.12.0a0 | EfficientNet-B0 | 1,299 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A40 | |
| 1.12.0a0 | EfficientNet-B4 | 485 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A40 | |
| 1.11.0a0 | BERT Large Pre-Training Phase 1 | 202 sequences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.01-py3 | Mixed | 64 | Wikipedia+BookCorpus | A40 | |
| 1.11.0a0 | BERT Large Pre-Training Phase 2 | 38 sequences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.01-py3 | Mixed | 16 | Wikipedia+BookCorpus | A40 | |
| 1.12.0a0 | EfficientDet-D0 | 173 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 60 | COCO 2017 | A40 | |
| 1.12.0a0 | EfficientNet-WideSE-B0 | 1,314 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A40 | |
| 1.12.0a0 | EfficientNet-WideSE-B4 | 485 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A40 | |
| 1.12.0a0 | SE3 Transformer | 1,864 molecules/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | A40 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 1,214 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 256 | ImageNet2012 | A40 |
| 1.15.5 | U-Net Industrial | 122 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | DAGM2007 | A40 | |
| 1.15.5 | BERT-LARGE | 51 sentences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 24 | SQuaD v1.1 | A40 | |
| 2.8.0 | U-Net Medical | 71 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 8 | EM segmentation challenge | A40 | |
| 2.8.0 | Wide and Deep | 956,456 samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | A40 | |
| 1.15.5 | ResNext101 | 605 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A40 | |
| 1.15.5 | SE-ResNext101 | 550 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A40 | |
| 2.8.0 | Electra Base Fine Tuning | 165 sequences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | SQuaD v1.1 | A40 | |
| 1.15.5 | 3D-UNet Medical | 9 volumes/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 2 | EM segmentation challenge | A40 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
A30 Training Performance
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 1,596 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 192 | ImageNet2012 | A30 |
| PyTorch | 1.12.0a0 | SSD v1.1 | 223 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 64 | COCO 2017 | A30 |
| 1.12.0a0 | Mask R-CNN | 21 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 8 | COCO 2014 | A30 | |
| 1.12.0a0 | Tacotron2 | 33,410 total output mels/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 104 | LJSpeech 1.1 | A30 | |
| 1.12.0a0 | WaveGlow | 146,477 output samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A30 | |
| 1.12.0a0 | FastPitch | 65,893 frames/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | LJSpeech 1.1 | A30 | |
| 1.12.0a0 | NCF | 20,521,958 samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 1048576 | MovieLens 20M | A30 | |
| 1.12.0a0 | GNMT V2 | 90,710 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | wmt16-en-de | A30 | |
| 1.12.0a0 | Transformer-XL Base | 19,242 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | WikiText-103 | A30 | |
| 1.12.0a0 | ResNeXt101 | 545 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 112 | Imagenet2012 | A30 | |
| 1.12.0a0 | Transformer-XL Large | 7,264 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 4 | WikiText-103 | A30 | |
| 1.12.0a0 | BERT-LARGE | 54 sequences/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | SQuaD v1.1 | A30 | |
| 1.12.0a0 | nnU-Net | 585 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 64 | Medical Segmentation Decathlon | A30 | |
| 1.12.0a0 | EfficientNet-B4 | 370 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | Imagenet2012 | A30 | |
| 1.11.0a0 | BERT Large Pre-Training Phase 1 | 182 sequences/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.01-py3 | Mixed | 16 | Wikipedia+BookCorpus | A30 | |
| 1.11.0a0 | BERT Large Pre-Training Phase 2 | 35 sequences/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.01-py3 | Mixed | 4 | Wikipedia+BookCorpus | A30 | |
| 1.12.0a0 | EfficientDet-D0 | 148 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 30 | COCO 2017 | A30 | |
| 1.12.0a0 | EfficientNet-WideSE-B0 | 1,320 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A30 | |
| 1.12.0a0 | EfficientNet-WideSE-B4 | 348 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | Imagenet2012 | A30 | |
| 1.12.0a0 | SE3 Transformer | 2,017 molecules/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | A30 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 1,336 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 256 | ImageNet2012 | A30 |
| 1.15.5 | ResNext101 | 588 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A30 | |
| 1.15.5 | SE-ResNext101 | 495 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 96 | Imagenet2012 | A30 | |
| 1.15.5 | U-Net Industrial | 118 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | DAGM2007 | A30 | |
| 2.8.0 | U-Net Medical | 75 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 8 | EM segmentation challenge | A30 | |
| 1.15.5 | Transformer XL Base | 18,351 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | WikiText-103 | A30 | |
| 2.8.0 | Electra Base Fine Tuning | 164 sequences/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | SQuaD v1.1 | A30 | |
| 1.15.5 | 3D-UNet Medical | 9 volumes/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 2 | EM segmentation challenge | A30 | |
| 2.8.0 | EfficientNet-B0 | 1,516 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 512 | Imagenet2012 | A30 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
A10 Training Performance
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 1,064 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 192 | ImageNet2012 | A10 |
| PyTorch | 1.12.0a0 | SSD v1.1 | 156 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 64 | COCO 2017 | A10 |
| 1.12.0a0 | Mask R-CNN | 17 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 8 | COCO 2014 | A10 | |
| 1.12.0a0 | Tacotron2 | 28,336 total output mels/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 104 | LJSpeech 1.1 | A10 | |
| 1.12.0a0 | WaveGlow | 112,773 output samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A10 | |
| 1.12.0a0 | FastPitch | 58,245 frames/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | LJSpeech 1.1 | A10 | |
| 1.12.0a0 | Transformer-XL Base | 15,980 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | WikiText-103 | A10 | |
| 1.12.0a0 | GNMT V2 | 65,024 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | wmt16-en-de | A10 | |
| 1.12.0a0 | ResNeXt101 | 400 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 112 | Imagenet2012 | A10 | |
| 1.12.0a0 | NCF | 15,554,865 samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 1048576 | MovieLens 20M | A10 | |
| 1.12.0a0 | Transformer-XL Large | 6,074 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 4 | WikiText-103 | A10 | |
| 1.12.0a0 | BERT-LARGE | 38 sequences/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | SQuaD v1.1 | A10 | |
| 1.12.0a0 | nnU-Net | 453 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 64 | Medical Segmentation Decathlon | A10 | |
| 1.12.0a0 | EfficientNet-B0 | 1,003 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.03-py3 | Mixed | 128 | Imagenet2012 | A10 | |
| 1.12.0a0 | EfficientNet-B4 | 348 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | Imagenet2012 | A10 | |
| 1.11.0a0 | BERT Large Pre-Training Phase 1 | 147 sequences/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.01-py3 | Mixed | 16 | Wikipedia+BookCorpus | A10 | |
| 1.11.0a0 | BERT Large Pre-Training Phase 2 | 28 sequences/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.01-py3 | Mixed | 4 | Wikipedia+BookCorpus | A10 | |
| 1.12.0a0 | EfficientDet-D0 | 136 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 30 | COCO 2017 | A10 | |
| 1.12.0a0 | EfficientNet-WideSE-B0 | 1,044 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A10 | |
| 1.12.0a0 | EfficientNet-WideSE-B4 | 337 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | Imagenet2012 | A10 | |
| 1.12.0a0 | SE3 Transformer | 1,585 molecules/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | A10 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 956 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 256 | ImageNet2012 | A10 |
| 1.15.5 | ResNext101 | 451 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A10 | |
| 1.15.5 | SE-ResNext101 | 393 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 96 | Imagenet2012 | A10 | |
| 1.15.5 | U-Net Industrial | 100 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | DAGM2007 | A10 | |
| 2.8.0 | U-Net Medical | 52 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 8 | EM segmentation challenge | A10 | |
| 2.8.0 | Electra Base Fine Tuning | 122 sequences/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 16 | SQuaD v1.1 | A10 | |
| 1.15.5 | 3D-UNet Medical | 7 volumes/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 2 | EM segmentation challenge | A10 | |
| 2.8.0 | EfficientNet-B0 | 1,312 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 512 | Imagenet2012 | A10 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
T4 Training Performance +
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 473 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 192 | ImageNet2012 | NVIDIA T4 |
| PyTorch | 1.12.0a0 | ResNeXt101 | 186 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 112 | Imagenet2012 | NVIDIA T4 |
| 1.12.0a0 | SSD v1.1 | 75 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 64 | COCO 2017 | NVIDIA T4 | |
| 1.12.0a0 | Tacotron2 | 6,486 total output mels/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | FP32 | 48 | LJSpeech 1.1 | NVIDIA T4 | |
| 1.12.0a0 | WaveGlow | 55,615 output samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA T4 | |
| 1.12.0a0 | FastPitch | 30,273 frames/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 16 | LJSpeech 1.1 | NVIDIA T4 | |
| 1.12.0a0 | GNMT V2 | 31,371 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
| 1.12.0a0 | NCF | 7,320,853 samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 1048576 | MovieLens 20M | NVIDIA T4 | |
| 1.12.0a0 | BERT-LARGE | 17 sequences/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 10 | SQuaD v1.1 | NVIDIA T4 | |
| 1.12.0a0 | Transformer-XL Base | 9,225 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 32 | WikiText-103 | NVIDIA T4 | |
| 1.12.0a0 | SE-ResNeXt101 | 150 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 112 | Imagenet2012 | NVIDIA T4 | |
| 1.12.0a0 | Transformer-XL Large | 2,787 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 4 | WikiText-103 | NVIDIA T4 | |
| 1.12.0a0 | nnU-Net | 206 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 64 | Medical Segmentation Decathlon | NVIDIA T4 | |
| 1.12.0a0 | EfficientNet-B0 | 478 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
| 1.12.0a0 | EfficientNet-B4 | 174 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 32 | Imagenet2012 | NVIDIA T4 | |
| 1.11.0a0 | BERT Large Pre-Training Phase 1 | 14 sequences/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 22.02-py3 | Mixed | 8 | Wikipedia+BookCorpus | NVIDIA T4 | |
| 1.11.0a0 | BERT Large Pre-Training Phase 2 | 3 sequences/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 22.02-py3 | Mixed | 2 | Wikipedia+BookCorpus | NVIDIA T4 | |
| 1.12.0a0 | EfficientDet-D0 | 67 images/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 30 | COCO 2017 | NVIDIA T4 | |
| 1.12.0a0 | EfficientNet-WideSE-B0 | 473 images/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
| 1.12.0a0 | EfficientNet-WideSE-B4 | 170 images/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 32 | Imagenet2012 | NVIDIA T4 | |
| 1.12.0a0 | SE3 Transformer | 607 molecules/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | NVIDIA T4 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 424 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 256 | ImageNet2012 | NVIDIA T4 |
| 1.15.5 | U-Net Industrial | 44 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 16 | DAGM2007 | NVIDIA T4 | |
| 1.15.5 | U-Net Medical | 21 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 8 | EM segmentation challenge | NVIDIA T4 | |
| 1.15.5 | SE-ResNext101 | 163 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 96 | Imagenet2012 | NVIDIA T4 | |
| 1.15.5 | ResNext101 | 200 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
| 2.8.0 | Electra Base Fine Tuning | 56 sequences/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 16 | SQuaD v1.1 | NVIDIA T4 | |
| 1.15.5 | 3D-UNet Medical | 3 volumes/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 2 | EM segmentation challenge | NVIDIA T4 | |
| 2.8.0 | EfficientNet-B0 | 539 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 256 | Imagenet2012 | NVIDIA T4 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
V100 Training Performance +
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 1,582 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 256 | ImageNet2012 | V100-SXM3-32GB |
| PyTorch | 1.12.0a0 | ResNeXt101 | 567 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 112 | Imagenet2012 | V100-SXM3-32GB |
| 1.12.0a0 | Mask R-CNN | 21 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 8 | COCO 2014 | V100-SXM3-32GB | |
| 1.12.0a0 | SSD v1.1 | 235 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 64 | COCO 2017 | V100-SXM3-32GB | |
| 1.12.0a0 | Tacotron2 | 24,913 total output mels/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 104 | LJSpeech 1.1 | V100-SXM3-32GB | |
| 1.12.0a0 | WaveGlow | 144,081 output samples/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | V100-SXM3-32GB | |
| 1.12.0a0 | FastPitch | 68,742 frames/sec | 1x V100 | DGX-2 | 22.03-py3 | Mixed | 64 | LJSpeech 1.1 | V100-SXM3-32GB | |
| 1.12.0a0 | GNMT V2 | 77,710 total tokens/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 128 | wmt16-en-de | V100-SXM3-32GB | |
| 1.12.0a0 | NCF | 23,153,985 samples/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 1048576 | MovieLens 20M | V100-SXM3-32GB | |
| 1.12.0a0 | BERT-LARGE | 56 sequences/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
| 1.12.0a0 | Transformer-XL Base | 18,025 total tokens/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 32 | WikiText-103 | V100-SXM3-32GB | |
| 1.12.0a0 | Transformer-XL Large | 7,366 total tokens/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 8 | WikiText-103 | V100-SXM3-32GB | |
| 1.12.0a0 | nnU-Net | 664 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 64 | Medical Segmentation Decathlon | V100-SXM3-32GB | |
| 1.12.0a0 | EfficientNet-B0 | 1,280 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 256 | Imagenet2012 | V100-SXM3-32GB | |
| 1.12.0a0 | EfficientNet-B4 | 505 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 64 | Imagenet2012 | V100-SXM3-32GB | |
| 1.11.0a0 | BERT Large Pre-Training Phase 1 | 182 sequences/sec | 1x V100 | DGX-2 | 22.01-py3 | Mixed | 16 | Wikipedia+BookCorpus | V100-SXM3-32GB | |
| 1.11.0a0 | BERT Large Pre-Training Phase 2 | 37 sequences/sec | 1x V100 | DGX-2 | 22.01-py3 | Mixed | 4 | Wikipedia+BookCorpus | V100-SXM3-32GB | |
| 1.12.0a0 | EfficientDet-D0 | 165 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 60 | COCO 2017 | V100-SXM3-32GB | |
| 1.12.0a0 | EfficientNet-WideSE-B0 | 1,273 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 256 | Imagenet2012 | V100-SXM3-32GB | |
| 1.12.0a0 | EfficientNet-WideSE-B4 | 501 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 64 | Imagenet2012 | V100-SXM3-32GB | |
| 1.12.0a0 | SE3 Transformer | 1,873 molecules/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 240 | Quantum Machines 9 | V100-SXM3-32GB | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 1,385 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 256 | ImageNet2012 | V100-SXM3-32GB |
| 1.15.5 | ResNext101 | 635 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 128 | Imagenet2012 | V100-SXM3-32GB | |
| 1.15.5 | SE-ResNext101 | 555 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 96 | Imagenet2012 | V100-SXM3-32GB | |
| 1.15.5 | U-Net Industrial | 118 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 16 | DAGM2007 | V100-SXM3-32GB | |
| 1.15.5 | U-Net Medical | 67 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 8 | EM segmentation challenge | V100-SXM3-32GB | |
| 2.8.0 | Wide and Deep | 1,022,754 samples/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | V100-SXM3-32GB | |
| 1.15.5 | BERT-LARGE | 48 sentences/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
| 2.8.0 | Electra Base Fine Tuning | 188 sequences/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 32 | SQuaD v1.1 | V100-SXM3-32GB | |
| 1.15.5 | Transformer XL Base | 18,574 total tokens/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 16 | WikiText-103 | V100-SXM3-32GB | |
| 1.15.5 | 3D-UNet Medical | 10 volumes/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 2 | EM segmentation challenge | V100-SXM3-32GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
Single GPU Training Performance of NVIDIA GPU on Cloud
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance on Cloud
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | - | ResNet-50 v1.5 | 2,916 images/sec | 1x A100 | GCP A2-HIGHGPU-1G | 22.04-py3 | Mixed | 192 | ImageNet2012 | A100-SXM4-40GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
T4 Training Performance on Cloud +
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | - | ResNet-50 v1.5 | 457 images/sec | 1x T4 | AWS EC2 g4dn.4xlarge | 22.04-py3 | Mixed | 192 | ImageNet2012 | NVIDIA T4 |
| - | ResNet-50 v1.5 | 419 images/sec | 1x T4 | GCP N1-HIGHMEM-8 | 22.04-py3 | Mixed | 192 | ImageNet2012 | NVIDIA T4 | |
| PyTorch | - | BERT-LARGE | 16 sequences/sec | 1x T4 | AWS EC2 g4dn.4xlarge | 21.06-py3 | Mixed | 10 | SQuaD v1.1 | NVIDIA T4 |
| TensorFlow | - | ResNet-50 v1.5 | 417 images/sec | 1x T4 | AWS EC2 g4dn.4xlarge | 22.04-py3 | Mixed | 256 | Imagenet2012 | NVIDIA T4 |
| - | ResNet-50 v1.5 | 406 images/sec | 1x T4 | GCP N1-HIGHMEM-8 | 22.04-py3 | Mixed | 256 | Imagenet2012 | NVIDIA T4 |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
V100 Training Performance on Cloud +
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | - | ResNet-50 v1.5 | 1,519 images/sec | 1x V100 | AWS EC2 p3.2xlarge | 22.04-py3 | Mixed | 192 | ImageNet2012 | V100-SXM2-16GB |
| - | ResNet-50 v1.5 | 1,434 images/sec | 1x V100 | GCP N1-HIGHMEM-8 | 22.04-py3 | Mixed | 192 | ImageNet2012 | V100-SXM2-16GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
AI Inference
Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.
Related Resources
Learn how NVIDIA landed top performance spots on all MLPerf Inference 2.0 tests.
Read the inference whitepape to explore the evolving landscape and get an overview of inference platforms.
Learn how Dynamic Batching can increase throughput on Triton with Benefits of Triton.
For additional data on Triton performance in offline and online server, please refer to ResNet-50 v1.5
Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:
Achieve the most efficient inference performance with NVIDIA® TensorRT™ running on NVIDIA Tensor Core GPUs.
Maximize performance and simplify the deployment of AI models with the NVIDIA Triton™ Inference Server.
Pull software containers from NVIDIA NGC to race into production.
MLPerf Inference v2.0 Performance Benchmarks
Offline Scenario - Closed Division
| Network | Throughput | GPU | Server | GPU Version | Dataset | Target Accuracy |
|---|---|---|---|---|---|---|
| ResNet-50 v1.5 | 312,849 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet | 76.46% Top1 |
| 314,929 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | ImageNet | 76.46% Top1 | |
| 138,516 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | ImageNet | 76.46% Top1 | |
| 5,231 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | ImageNet | 76.46% Top1 | |
| 293,451 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | ImageNet | 76.46% Top1 | |
| 145,947 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | ImageNet | 76.46% Top1 | |
| 147,246 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | ImageNet | 76.46% Top1 | |
| 5,089 samples/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | ImageNet | 76.46% Top1 | |
| SSD ResNet-34 | 7,923 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | COCO | 0.2 mAP |
| 7,880 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | COCO | 0.2 mAP | |
| 3,397 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | COCO | 0.2 mAP | |
| 135 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | COCO | 0.2 mAP | |
| 7,297 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | COCO | 0.2 mAP | |
| 3,623 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | COCO | 0.2 mAP | |
| 3,827 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | COCO | 0.2 mAP | |
| 129 samples/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | COCO | 0.2 mAP | |
| 3D-UNet | 25 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | KiTS 2019 | 0.863 DICE mean |
| 24 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | KiTS 2019 | 0.863 DICE mean | |
| 11 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | KiTS 2019 | 0.863 DICE mean | |
| 24 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | KiTS 2019 | 0.863 DICE mean | |
| 12 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | KiTS 2019 | 0.863 DICE mean | |
| 13 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | KiTS 2019 | 0.863 DICE mean | |
| RNN-T | 106,753 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech | 7.45% WER |
| 107,399 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | LibriSpeech | 7.45% WER | |
| 49,789 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | LibriSpeech | 7.45% WER | |
| 1,612 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | LibriSpeech | 7.45% WER | |
| 101,788 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | LibriSpeech | 7.45% WER | |
| 52,752 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | LibriSpeech | 7.45% WER | |
| 52,453 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | LibriSpeech | 7.45% WER | |
| 1,432 samples/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | LibriSpeech | 7.45% WER | |
| BERT | 27,971 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 | 90.07% f1 |
| 27,894 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | SQuAD v1.1 | 90.07% f1 | |
| 11,387 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | SQuAD v1.1 | 90.07% f1 | |
| 484 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 | 90.07% f1 | |
| 25,035 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | SQuAD v1.1 | 90.07% f1 | |
| 12,595 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | SQuAD v1.1 | 90.07% f1 | |
| 13,340 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | SQuAD v1.1 | 90.07% f1 | |
| 502 samples/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | SQuAD v1.1 | 90.07% f1 | |
| DLRM | 2,499,040 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC |
| 2,477,270 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
| 1,065,600 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
| 40,424 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
| 2,313,280 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
| 1,125,130 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
| 1,105,550 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | Criteo 1TB Click Logs | 80.25% AUC | |
| 35,831 samples/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | Criteo 1TB Click Logs | 80.25% AUC |
Server Scenario - Closed Division
| Network | Throughput | GPU | Server | GPU Version | Target Accuracy | MLPerf Server Latency Constraints (ms) | Dataset |
|---|---|---|---|---|---|---|---|
| ResNet-50 v1.5 | 260,031 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet |
| 270,027 queries/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet | |
| 107,000 queries/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet | |
| 3,527 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet | |
| 200,007 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 76.46% Top1 | 15 | ImageNet | |
| 104,000 queries/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | 76.46% Top1 | 15 | ImageNet | |
| 116,002 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | 76.46% Top1 | 15 | ImageNet | |
| 3,398 queries/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | 76.46% Top1 | 15 | ImageNet | |
| SSD ResNet-34 | 7,575 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 0.2 mAP | 100 | COCO |
| 7,505 queries/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | 0.2 mAP | 100 | COCO | |
| 3,247 queries/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | 0.2 mAP | 100 | COCO | |
| 98 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 0.2 mAP | 100 | COCO | |
| 6,466 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 0.2 mAP | 100 | COCO | |
| 3,078 queries/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | 0.2 mAP | 100 | COCO | |
| 3,570 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | 0.2 mAP | 100 | COCO | |
| 95 queries/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | 0.2 mAP | 100 | COCO | |
| RNN-T | 104,000 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech |
| 104,000 queries/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech | |
| 44,989 queries/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech | |
| 1,350 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech | |
| 89,994 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 7.45% WER | 1,000 | LibriSpeech | |
| 42,989 queries/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | 7.45% WER | 1,000 | LibriSpeech | |
| 36,989 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | 7.45% WER | 1,000 | LibriSpeech | |
| 1,100 queries/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | 7.45% WER | 1,000 | LibriSpeech | |
| BERT | 25,792 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 90.07% f1 | 130 | SQuAD v1.1 |
| 25,391 queries/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | 90.07% f1 | 130 | SQuAD v1.1 | |
| 10,794 queries/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | 90.07% f1 | 130 | SQuAD v1.1 | |
| 380 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 90.07% f1 | 130 | SQuAD v1.1 | |
| 22,989 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 90.07% f1 | 130 | SQuAD v1.1 | |
| 10,394 queries/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | 90.07% f1 | 130 | SQuAD v1.1 | |
| 11,491 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | 90.07% f1 | 130 | SQuAD v1.1 | |
| 380 queries/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | 90.07% f1 | 130 | SQuAD v1.1 | |
| DLRM | 2,302,640 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs |
| 1,951,890 queries/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
| 950,448 queries/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
| 35,989 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
| 1,300,850 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
| 600,183 queries/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
| 960,456 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
| 30,987 queries/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | 80.25% AUC | 30 | Criteo 1TB Click Logs |
Power Efficiency Offline Scenario - Closed Division
| Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
|---|---|---|---|---|---|---|
| ResNet-50 v1.5 | 250,242 samples/sec | 86.74 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet |
| 268,462 samples/sec | 95.36 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | ImageNet | |
| 128,665 samples/sec | 113.68 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | ImageNet | |
| 211,065 samples/sec | 113.44 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | ImageNet | |
| 104,893 samples/sec | 105.20 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | ImageNet | |
| SSD ResNet-34 | 6,576 samples/sec | 2.11 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | COCO |
| 6,521 samples/sec | 2.31 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | COCO | |
| 3,307 samples/sec | 2.67 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | COCO | |
| 5,778 samples/sec | 2.75 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | COCO | |
| 2,894 samples/sec | 2.57 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | COCO | |
| 3D-UNet | 21 samples/sec | 0.007 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | KiTS 2019 |
| 20 samples/sec | 0.008 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | KiTS 2019 | |
| 11 samples/sec | 0.009 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | KiTS 2019 | |
| 19 samples/sec | 0.010 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | KiTS 2019 | |
| 10 samples/sec | 0.010 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | KiTS 2019 | |
| RNN-T | 90,730 samples/sec | 27.94 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech |
| 90,946 samples/sec | 31.89 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | LibriSpeech | |
| 44,966 samples/sec | 37.87 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | LibriSpeech | |
| 85,952 samples/sec | 39.16 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | LibriSpeech | |
| 42,945 samples/sec | 37.86 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | LibriSpeech | |
| BERT | 24,794 samples/sec | 6.99 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 |
| 20,706 samples/sec | 7.38 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | SQuAD v1.1 | |
| 10,828 samples/sec | 8.64 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | SQuAD v1.1 | |
| 19,993 samples/sec | 8.47 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | SQuAD v1.1 | |
| 10,047 samples/sec | 8.06 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | SQuAD v1.1 | |
| DLRM | 2,140,540 samples/sec | 646.23 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs |
| 1,940,830 samples/sec | 701.53 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | Criteo 1TB Click Logs | |
| 1,001,010 samples/sec | 797.59 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | Criteo 1TB Click Logs | |
| 1,845,900 samples/sec | 795.67 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | Criteo 1TB Click Logs | |
| 953,749 samples/sec | 768.81 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | Criteo 1TB Click Logs |
Power Efficiency Server Scenario - Closed Division
| Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
|---|---|---|---|---|---|---|
| ResNet-50 v1.5 | 229,016 queries/sec | 78.69 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet |
| 230,018 queries/sec | 81.52 queries/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | ImageNet | |
| 107,000 queries/sec | 94.59 queries/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | ImageNet | |
| 185,005 queries/sec | 88.70 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | ImageNet | |
| 92,496 queries/sec | 93.88 queries/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | ImageNet | |
| SSD ResNet-34 | 6,298 queries/sec | 2.01 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | COCO |
| 6,298 queries/sec | 2.23 queries/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | COCO | |
| 3,078 queries/sec | 2.50 queries/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | COCO | |
| 5,697 queries/sec | 2.72 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | COCO | |
| 2,748 queries/sec | 2.48 queries/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | COCO | |
| RNN-T | 87,992 queries/sec | 25.47 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech |
| 74,990 queries/sec | 26.37 queries/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | LibriSpeech | |
| 43,388 queries/sec | 33.53 queries/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | LibriSpeech | |
| 74,990 queries/sec | 34.09 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | LibriSpeech | |
| 37,489 queries/sec | 32.89 queries/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | LibriSpeech | |
| BERT | 21,492 queries/sec | 6.36 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 |
| 20,992 queries/sec | 6.47 queries/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | SQuAD v1.1 | |
| 10,195 queries/sec | 8.01 queries/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | SQuAD v1.1 | |
| 17,292 queries/sec | 8.09 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | SQuAD v1.1 | |
| 9,995 queries/sec | 7.99 queries/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | SQuAD v1.1 | |
| DLRM | 2,001,990 queries/sec | 593.66 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs |
| 1,831,680 queries/sec | 651.00 queries/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | Criteo 1TB Click Logs | |
| 870,363 queries/sec | 649.49 queries/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | Criteo 1TB Click Logs | |
| 750,272 queries/sec | 358.95 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | Criteo 1TB Click Logs | |
| 500,121 queries/sec | 408.95 queries/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | Criteo 1TB Click Logs |
MLPerf™ v2.0 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 2.0-073, 2.0-075, 2.0-077, 2.0-078, 2.0-080, 2.0-081, 2.0-083, 2.0-084, 2.0-090, 2.0-094, 2.0-095, 2.0-097, 2.0-098. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
1x1g.6gb and 1x1g.10gb is a notation used to refer to the MIG configuration. In this example, the workload is running on a single MIG slice, each with 6GB or 10GB of memory on a single A30 and A100 respectively.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here
NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v2.0
NVIDIA landed top performance spots on all MLPerf™ Inference 2.0 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.
MLPerf™ v2.0 A100 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, BERT 99% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 2.0-094, 2.0-096. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.
NVIDIA Client Batch Size 1 and 2 Performance with Triton Inference Server
A100 Triton Inference Server Performance
| Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 V1.5 Inference | A100-SXM4-40GB | PyTorch | TensorRT | TF32 | 2 | 1 | 64 | 256 | 48.35 | 5,294 inf/sec | - | 21.03-py3 |
| ResNet-50 V1.5 Inference | A100-PCIE-40GB | PyTorch | TensorRT | Mixed | 2 | 1 | 64 | 256 | 61.02 | 4,197 inf/sec | - | 20.07-py3 |
| BERT Large Inference | A100-SXM4-40GB | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 39.31 | 611 inf/sec | 384 | 22.03-py3 |
| BERT Large Inference | A100-SXM4-40GB | TensorRT | TensorRT | Mixed | 2 | 2 | 1 | 16 | 42.94 | 746 inf/sec | 384 | 22.03-py3 |
| BERT Large Inference | A100-PCIE-40GB | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 44.76 | 536 inf/sec | 384 | 22.03-py3 |
| BERT Large Inference | A100-PCIE-40GB | TensorRT | TensorRT | Mixed | 2 | 2 | 1 | 16 | 52.32 | 611 inf/sec | 384 | 22.03-py3 |
| BERT Base Inference | A100-SXM4-40GB | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 7.48 | 3,206 inf/sec | 128 | 22.03-py3 |
| BERT Base Inference | A100-SXM4-40GB | TensorRT | TensorRT | Mixed | 1 | 2 | 1 | 24 | 9.46 | 5,076 inf/sec | 128 | 22.03-py3 |
| BERT Base Inference | A100-PCIE-40GB | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 7.10 | 3,380 inf/sec | 128 | 22.03-py3 |
| BERT Base Inference | A100-PCIE-40GB | TensorRT | TensorRT | Mixed | 1 | 2 | 1 | 20 | 8.41 | 4,755 inf/sec | 128 | 22.03-py3 |
| DLRM Inference | A100-SXM4-40GB | PyTorch | TensorRT | Mixed | 4 | 1 | 65,536 | 30 | 3.27 | 9,183 inf/sec | - | 22.03-py3 |
| DLRM Inference | A100-SXM4-40GB | PyTorch | TensorRT | Mixed | 1 | 2 | 65,536 | 26 | 2.54 | 20,492 inf/sec | - | 22.03-py3 |
| DLRM Inference | A100-PCIE-40GB | PyTorch | TensorRT | Mixed | 2 | 1 | 65,536 | 30 | 2.7 | 11,120 inf/sec | - | 22.03-py3 |
| DLRM Inference | A100-PCIE-40GB | PyTorch | TensorRT | Mixed | 2 | 2 | 65,536 | 26 | 2.35 | 22,132 inf/sec | - | 22.03-py3 |
A30 Triton Inference Server Performance
| Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BERT Large Inference | A30 | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 74.10 | 324 inf/sec | 384 | 22.03-py3 |
| BERT Large Inference | A30 | TensorRT | TensorRT | Mixed | 2 | 2 | 1 | 16 | 92.55 | 346 inf/sec | 384 | 22.03-py3 |
| BERT Base Inference | A30 | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 10.31 | 2,328 inf/sec | 128 | 22.03-py3 |
| BERT Base Inference | A30 | TensorRT | TensorRT | Mixed | 1 | 2 | 1 | 20 | 13.28 | 3,012 inf/sec | 128 | 22.03-py3 |
A10 Triton Inference Server Performance
| Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BERT Large Inference | A10 | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 112.65 | 213 inf/sec | 384 | 22.03-py3 |
| BERT Large Inference | A10 | TensorRT | TensorRT | Mixed | 1 | 2 | 1 | 20 | 175 | 229 inf/sec | 384 | 22.03-py3 |
| BERT Base Inference | A10 | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 13.34 | 1,799 inf/sec | 128 | 22.03-py3 |
| BERT Base Inference | A10 | TensorRT | TensorRT | Mixed | 2 | 2 | 1 | 16 | 13.99 | 2,287 inf/sec | 128 | 22.03-py3 |
T4 Triton Inference Server Performance +
| Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 V1.5 Inference | NVIDIA T4 | PyTorch | TensorRT | Mixed | 1 | 1 | 64 | 256 | 257.91 | 992 inf/sec | - | 20.07-py3 |
| BERT Large Inference | NVIDIA T4 | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 85 | 283 inf/sec | 384 | 22.03-py3 |
| BERT Large Inference | NVIDIA T4 | TensorRT | TensorRT | Mixed | 2 | 2 | 1 | 24 | 84.47 | 568 inf/sec | 384 | 22.03-py3 |
| BERT Base Inference | NVIDIA T4 | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 27.90 | 860 inf/sec | 128 | 22.03-py3 |
| BERT Base Inference | NVIDIA T4 | TensorRT | TensorRT | Mixed | 1 | 2 | 1 | 24 | 49.38 | 972 inf/sec | 128 | 22.03-py3 |
V100 Triton Inference Server Performance +
| Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 V1.5 Inference | V100 SXM2-32GB | PyTorch | TensorRT | FP32 | 4 | 1 | 64 | 384 | 215.79 | 1,781 inf/sec | - | 21.03-py3 |
| BERT Large Inference | V100 SXM2-32GB | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 105.96 | 227 inf/sec | 384 | 22.03-py3 |
| BERT Large Inference | V100 SXM2-32GB | TensorRT | TensorRT | Mixed | 2 | 2 | 1 | 16 | 125.94 | 254 inf/sec | 384 | 22.03-py3 |
| BERT Base Inference | V100 SXM2-32GB | TensorRT | TensorRT | Mixed | 4 | 1 | 1 | 24 | 17.60 | 1,363 inf/sec | 128 | 22.03-py3 |
| BERT Base Inference | V100 SXM2-32GB | TensorRT | TensorRT | Mixed | 2 | 2 | 1 | 16 | 14.83 | 2,158 inf/sec | 128 | 22.03-py3 |
| DLRM Inference | V100-SXM2-32GB | PyTorch | TensorRT | Mixed | 2 | 1 | 65,536 | 30 | 4.15 | 7,228 inf/sec | - | 22.03-py3 |
| DLRM Inference | V100-SXM2-32GB | PyTorch | TensorRT | Mixed | 2 | 2 | 65,536 | 30 | 4.11 | 14,599 inf/sec | - | 22.03-py3 |
Inference Performance of NVIDIA A100, A40, A30, A10, A2, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
Inference Natural Langugage Processing
BERT Inference Throughput
DGX A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128
NVIDIA A100 BERT Inference Benchmarks
| Network | Network Type | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BERT-Large with Sparsity | Attention | 94 | 6,188 sequences/sec | - | - | 1x A100 | DGX A100 | - | INT8 | SQuaD v1.1 | - | A100 SXM4-40GB |
A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
Inference Image Classification on CNNs with TensorRT
ResNet-50 v1.5 Throughput
DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: Mixed | Dataset: Synthetic
ResNet-50 v1.5 Power Efficiency
DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.2.3 | Batch Size = 128 | 22.04-py3 | Precision: Mixed | Dataset: Synthetic
A100 Full Chip Inference Performance
| Network | Batch Size | Full Chip Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 11,516 images/sec | 59 images/sec/watt | 0.69 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
| 128 | 30,487 images/sec | 79 images/sec/watt | 4.2 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
| 223 | 30,530 images/sec | 81 images/sec/watt | 4.19 | 1x A100 | DGX A100 | 22.01-py3 | INT8 | Synthetic | TensorRT 8.2.2 | A100 SXM4-80GB | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 11,327 images/sec | 57 images/sec/watt | 0.71 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
| 128 | 29,434 images/sec | 75 images/sec/watt | 4.35 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
| ResNext101 | 32 | 7,674 samples/sec | - samples/sec/watt | 4.17 | 1x A100 | - | - | INT8 | Synthetic | TensorRT 7.2 | A100 SXM4-80GB |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 7,239 sequences/sec | 30 sequences/sec/watt | 1.11 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB | |
| 128 | 15,020 sequences/sec | 38 sequences/sec/watt | 8.52 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 2,653 sequences/sec | 3 sequences/sec/watt | 3.02 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB | |
| 128 | 4,819 sequences/sec | 4 sequences/sec/watt | 26.56 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
For BS=1 inference refer to the Triton Inference Server section
Starting from 21.09-py3, ECC is enabled
A100 1/7 MIG Inference Performance
| Network | Batch Size | 1/7 MIG Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 8 | 3,656 images/sec | 31 images/sec/watt | 2.18 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
| 128 | 4,589 images/sec | 38 images/sec/watt | 27.9 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
| ResNet-50v1.5 | 8 | 3,586 images/sec | 31 images/sec/watt | 2.23 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
| 128 | 4,436 images/sec | 36 images/sec/watt | 28.85 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
| BERT-BASE | 8 | 1,789 sequences/sec | 15 sequences/sec/watt | 4.47 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
| 128 | 2,176 sequences/sec | 17 sequences/sec/watt | 58.82 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
| BERT-LARGE | 8 | 586 sequences/sec | 5 sequences/sec/watt | 13.65 | 1x A100 | DGX A100 | 22.03-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
| 128 | 673 sequences/sec | 5 sequences/sec/watt | 190.1 | 1x A100 | DGX A100 | 22.03-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled
A100 7 MIG Inference Performance
| Network | Batch Size | 7 MIG Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 8 | 25,296 images/sec | 78 images/sec/watt | 2.24 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
| 128 | 31,883 images/sec | 82 images/sec/watt | 28.31 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
| ResNet-50v1.5 | 8 | 24,768 images/sec | 75 images/sec/watt | 2.27 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
| 128 | 30,919 images/sec | 80 images/sec/watt | 29.03 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
| BERT-BASE | 8 | 12,637 sequences/sec | 32 sequences/sec/watt | 4.46 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
| 128 | 14,304 sequences/sec | 36 sequences/sec/watt | 62.66 | 1x A100 | DGX A100 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
| BERT-LARGE | 8 | 3,991 sequences/sec | 10 sequences/sec/watt | 14.01 | 1x A100 | DGX A100 | 22.03-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
| 128 | 5,167 sequences/sec | 13 sequences/sec/watt | 173.5 | 1x A100 | DGX A100 | 22.01-py3 | INT8 | Synthetic | TensorRT 8.2.2 | A100 SXM4-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled
A40 Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 8 | 9,999 images/sec | 41 images/sec/watt | 0.8 | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
| 128 | 16,101 images/sec | 54 images/sec/watt | 7.95 | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 | |
| ResNet-50v1.5 | 8 | 9,726 images/sec | 39 images/sec/watt | 0.82 | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
| 128 | 15,378 images/sec | 51 images/sec/watt | 8.32 | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 | |
| BERT-BASE | 8 | 5,257 sequences/sec | 18 sequences/sec/watt | 1.52 | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
| 128 | 7,509 sequences/sec | 25 sequences/sec/watt | 17.05 | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 | |
| BERT-LARGE | 8 | 1,731 sequences/sec | 2 sequences/sec/watt | 4.62 | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
| 128 | 2,246 sequences/sec | 2 sequences/sec/watt | 57 | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
A30 Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 7,034 images/sec | 45 images/sec/watt | 1.14 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| 128 | 15,929 images/sec | 97 images/sec/watt | 8.04 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 7,221 images/sec | 51 images/sec/watt | 1.11 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| 128 | 15,428 images/sec | 94 images/sec/watt | 8.3 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 4,975 sequences/sec | 30 sequences/sec/watt | 1.61 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| 128 | 7,116 sequences/sec | 43 sequences/sec/watt | 17.99 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 1,688 sequences/sec | 3 sequences/sec/watt | 4.74 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| 128 | 2,561 sequences/sec | 16 sequences/sec/watt | 49.99 | 1x A30 | GIGABYTE G482-Z52-00 | 22.01-py3 | INT8 | Synthetic | TensorRT 8.2.2 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
A30 1/4 MIG Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 8 | 3,556 images/sec | 43 images/sec/watt | 2.25 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
| 128 | 4,500 images/sec | 50 images/sec/watt | 28.45 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| ResNet-50v1.5 | 8 | 3,463 images/sec | 42 images/sec/watt | 2.31 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
| 128 | 4,360 images/sec | 51 images/sec/watt | 29.36 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| BERT-BASE | 8 | 1,793 sequences/sec | 20 sequences/sec/watt | 4.46 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
| 128 | 2,143 sequences/sec | 22 sequences/sec/watt | 59.72 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| BERT-LARGE | 8 | 563 sequences/sec | 6 sequences/sec/watt | 14.22 | 1x A30 | GIGABYTE G482-Z52-00 | 22.03-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
| 128 | 674 sequences/sec | 7 sequences/sec/watt | 189.93 | 1x A30 | GIGABYTE G482-Z52-00 | 22.03-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
A30 4 MIG Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 8 | 13,969 images/sec | 85 images/sec/watt | 2.31 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
| 128 | 16,926 images/sec | 103 images/sec/watt | 30.4 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| ResNet-50v1.5 | 8 | 13,543 images/sec | 83 images/sec/watt | 2.36 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
| 128 | 16,413 images/sec | 100 images/sec/watt | 31.34 | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
| BERT-BASE | 8 | 6,479 sequences/sec | 39 sequences/sec/watt | 5.06 | 1x A30 | GIGABYTE G482-Z52-00 | 22.03-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
| 128 | 7,402 sequences/sec | 45 sequences/sec/watt | 70.3 | 1x A30 | GIGABYTE G482-Z52-00 | 22.03-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
A10 Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 7,874 images/sec | 53 images/sec/watt | 1.02 | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
| 128 | 11,513 images/sec | 77 images/sec/watt | 11.12 | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 7,681 images/sec | 52 images/sec/watt | 1.04 | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
| 128 | 10,575 images/sec | 71 images/sec/watt | 12.1 | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 4,005 sequences/sec | 27 sequences/sec/watt | 2 | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
| 128 | 4,852 sequences/sec | 33 sequences/sec/watt | 26.38 | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 1,286 sequences/sec | 3 sequences/sec/watt | 6.22 | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
| 128 | 1,449 sequences/sec | 10 sequences/sec/watt | 88.35 | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
A2 Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 8 | 2,602 images/sec | 43 images/sec/watt | 3.07 | 1x A2 | GIGABYTE MZ52-G41-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
| 128 | 2,999 images/sec | 50 images/sec/watt | 42.68 | 1x A2 | GIGABYTE MZ52-G41-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 | |
| ResNet-50v1.5 | 8 | 2,509 images/sec | 42 images/sec/watt | 3.19 | 1x A2 | GIGABYTE MZ52-G41-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
| 128 | 2,892 images/sec | 49 images/sec/watt | 44.27 | 1x A2 | GIGABYTE MZ52-G41-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 | |
| BERT-BASE | 8 | 1,052 sequences/sec | 18 sequences/sec/watt | 7.6 | 1x A2 | GIGABYTE MZ52-G41-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
| 128 | 1,114 sequences/sec | 19 sequences/sec/watt | 114.93 | 1x A2 | GIGABYTE MZ52-G41-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 | |
| BERT-LARGE | 8 | 316 sequences/sec | 5 sequences/sec/watt | 25.32 | 1x A2 | GIGABYTE MZ52-G41-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
| 128 | 335 sequences/sec | 6 sequences/sec/watt | 382.43 | 1x A2 | GIGABYTE MZ52-G41-00 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
T4 Inference Performance +
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 3,691 images/sec | 53 images/sec/watt | 2.17 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
| 128 | 4,732 images/sec | 68 images/sec/watt | 27.05 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 3,135 images/sec | 45 images/sec/watt | 2.55 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
| 128 | 4,426 images/sec | 63 images/sec/watt | 28.92 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 1,647 sequences/sec | 24 sequences/sec/watt | 4.86 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
| 128 | 1,780 sequences/sec | 25 sequences/sec/watt | 71.93 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 554 sequences/sec | 2 sequences/sec/watt | 14.43 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
| 128 | 546 sequences/sec | 8 sequences/sec/watt | 234.33 | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server section
Starting from 21.09-py3, ECC is enabled
V100 Inference Performance +
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 4,303 images/sec | 15 images/sec/watt | 1.86 | 1x V100 | DGX-2 | 22.04-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
| 128 | 7,833 images/sec | 23 images/sec/watt | 16.34 | 1x V100 | DGX-2 | 22.04-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 4,198 images/sec | 15 images/sec/watt | 1.91 | 1x V100 | DGX-2 | 22.04-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
| 128 | 7,514 images/sec | 22 images/sec/watt | 17.03 | 1x V100 | DGX-2 | 22.04-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 2,052 sequences/sec | 7 sequences/sec/watt | 3.9 | 1x V100 | DGX-2 | 22.04-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
| 128 | 3,153 sequences/sec | 9 sequences/sec/watt | 40.6 | 1x V100 | DGX-2 | 22.04-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
| 8 | 761 sequences/sec | 1 sequences/sec/watt | 10.52 | 1x V100 | DGX-2 | 22.04-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
| 128 | 967 sequences/sec | 1 sequences/sec/watt | 132.33 | 1x V100 | DGX-2 | 22.04-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server section
Starting from 21.09-py3, ECC is enabled
Inference Performance of NVIDIA GPU on Cloud
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Inference Performance on Cloud
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50v1.5 | 8 | 11,535 images/sec | 62 images/sec/watt | 0.69 | 1x A100 | GCP A2-HIGHGPU-1G | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB |
| 128 | 28,303 images/sec | 110 images/sec/watt | 4.52 | 1x A100 | GCP A2-HIGHGPU-1G | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB | |
| 128 | 28,343 images/sec | 109 images/sec/watt | 4.5 | 1x A100 | AWS EC2 p4d.24xlarge | 21.12-py3 | INT8 | Synthetic | TensorRT 8.2.1 | A100-SXM4-40GB | |
| BERT-LARGE | 8 | 2,729 sequences/sec | 10 sequences/sec/watt | 2.93 | 1x A100 | GCP A2-HIGHGPU-1G | 22.01-py3 | INT8 | Synthetic | TensorRT 8.2.2 | A100-SXM4-40GB |
| 128 | 5,078 sequences/sec | 13 sequences/sec/watt | 25.21 | 1x A100 | GCP A2-HIGHGPU-1G | 22.01-py3 | INT8 | Synthetic | TensorRT 8.2.2 | A100-SXM4-40GB |
BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled
T4 Inference Performance on Cloud +
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50v1.5 | 8 | 3,301 images/sec | 47 images/sec/watt | 2.42 | 1x T4 | GCP N1-HIGHMEM-8 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 |
| 128 | 4,304 images/sec | 62 images/sec/watt | 29.74 | 1x T4 | GCP N1-HIGHMEM-8 | 22.01-py3 | INT8 | Synthetic | TensorRT 8.2.2 | NVIDIA T4 | |
| 8 | 3,452 images/sec | 50 images/sec/watt | 2.3 | 1x T4 | AWS EC2 g4dn.4xlarge | 21.12-py3 | INT8 | Synthetic | TensorRT 8.2.1 | NVIDIA T4 | |
| 128 | 4,019 images/sec | 57 images/sec/watt | 32 | 1x T4 | AWS EC2 g4dn.4xlarge | 21.12-py3 | INT8 | Synthetic | TensorRT 8.2.1 | NVIDIA T4 | |
| BERT-LARGE | 8 | 512 sequences/sec | 7 sequences/sec/watt | 15.62 | 1x A100 | GCP N1-HIGHMEM-8 | 22.01-py3 | INT8 | Synthetic | TensorRT 8.2.2 | NVIDIA T4 |
| 128 | 467 sequences/sec | 7 sequences/sec/watt | 274.02 | 1x A100 | GCP N1-HIGHMEM-8 | 22.01-py3 | INT8 | Synthetic | TensorRT 8.2.2 | NVIDIA T4 | |
| 8 | 530 sequences/sec | - | 15 | 1x T4 | AWS EC2 g4dn.4xlarge | 21.12-py3 | INT8 | Synthetic | TensorRT 8.2.1 | NVIDIA T4 | |
| 128 | 532 sequences/sec | - | 240 | 1x T4 | AWS EC2 g4dn.4xlarge | 21.12-py3 | INT8 | Synthetic | TensorRT 8.2.1 | NVIDIA T4 |
BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled
V100 Inference Performance on Cloud +
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50v1.5 | 8 | 4,122 images/sec | 18 images/sec/watt | 1.94 | 1x V100 | GCP N1-HIGHMEM-8 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM2-16GB |
| 128 | 7,343 images/sec | 25 images/sec/watt | 17.43 | 1x V100 | GCP N1-HIGHMEM-8 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM2-16GB | |
| 8 | 4,176 images/sec | 17 images/sec/watt | 1.9 | 1x V100 | AWS EC2 p3.2xlarge | 21.12-py3 | INT8 | Synthetic | TensorRT 8.2.1 | V100-SXM2-16GB | |
| 128 | 7,282 images/sec | 25 images/sec/watt | 18 | 1x V100 | AWS EC2 p3.2xlarge | 21.12-py3 | INT8 | Synthetic | TensorRT 8.2.1 | V100-SXM2-16GB | |
| BERT-LARGE | 8 | 688 sequences/sec | 2 sequences/sec/watt | 11.63 | 1x A100 | GCP N1-HIGHMEM-8 | 22.01-py3 | INT8 | Synthetic | TensorRT 8.2.2 | V100-SXM2-16GB |
| 128 | 914 sequences/sec | 3 sequences/sec/watt | 140.03 | 1x A100 | GCP N1-HIGHMEM-8 | 22.01-py3 | INT8 | Synthetic | TensorRT 8.2.2 | V100-SXM2-16GB | |
| 8 | 745 sequences/sec | - | 11 | 1x V100 | AWS EC2 p3.2xlarge | 21.12-py3 | INT8 | Synthetic | TensorRT 8.2.1 | V100-SXM2-16GB | |
| 128 | 918 sequences/sec | - | 139 | 1x V100 | AWS EC2 p3.2xlarge | 21.12-py3 | INT8 | Synthetic | TensorRT 8.2.1 | V100-SXM2-16GB |
BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled
Conversational AI
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.
Related Resources
Download and get started with NVIDIA Riva.
Riva Benchmarks
A100 ASR Benchmarks
A100 Best Streaming Throughput Mode (800 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 10.3 | 1 | A100 SXM4-40GB |
| Citrinet | 256 | 167.4 | 253 | A100 SXM4-40GB |
| Citrinet | 512 | 293.8 | 503 | A100 SXM4-40GB |
| Citrinet | 1024 | 661.8 | 988 | A100 SXM4-40GB |
| Quartznet | 1 | 17.2 | 1 | A100 SXM4-40GB |
| Quartznet | 256 | 142.8 | 254 | A100 SXM4-40GB |
| Quartznet | 512 | 214.2 | 505 | A100 SXM4-40GB |
| Quartznet | 1024 | 377.8 | 998 | A100 SXM4-40GB |
| Jasper | 1 | 20.9 | 1 | A100 SXM4-40GB |
| Jasper | 256 | 173.3 | 254 | A100 SXM4-40GB |
| Jasper | 512 | 286 | 504 | A100 SXM4-40GB |
| Jasper | 1024 | 700.6 | 989 | A100 SXM4-40GB |
A100 Best Streaming Latency Mode (100 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 9.8 | 1 | A100 SXM4-40GB |
| Citrinet | 16 | 26.8 | 16 | A100 SXM4-40GB |
| Citrinet | 128 | 91.1 | 127 | A100 SXM4-40GB |
| Quartznet | 1 | 9.1 | 1 | A100 SXM4-40GB |
| Quartznet | 16 | 17.9 | 16 | A100 SXM4-40GB |
| Quartznet | 128 | 55.5 | 127 | A100 SXM4-40GB |
| Jasper | 1 | 13.5 | 1 | A100 SXM4-40GB |
| Jasper | 16 | 31.5 | 16 | A100 SXM4-40GB |
| Jasper | 128 | 98.5 | 127 | A100 SXM4-40GB |
A100 Offline Mode (3200 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 11.6 | 1 | A100 SXM4-40GB |
| Citrinet | 512 | 366.8 | 503 | A100 SXM4-40GB |
| Citrinet | 1,024 | 680.4 | 989 | A100 SXM4-40GB |
| Citrinet | 1,512 | 981.3 | 1,437 | A100 SXM4-40GB |
| Quartznet | 1 | 34.4 | 1 | A100 SXM4-40GB |
| Quartznet | 512 | 457.5 | 504 | A100 SXM4-40GB |
| Quartznet | 1,024 | 941.9 | 989 | A100 SXM4-40GB |
| Quartznet | 1,512 | 1592.7 | 1,421 | A100 SXM4-40GB |
| Jasper | 1 | 35.8 | 1 | A100 SXM4-40GB |
| Jasper | 512 | 631.3 | 503 | A100 SXM4-40GB |
| Jasper | 1,024 | 1,495.5 | 977 | A100 SXM4-40GB |
| Jasper | 1,512 | 2,544.2 | 1,395 | A100 SXM4-40GB |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
A30 ASR Benchmarks
A30 Best Streaming Throughput Mode (800 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 15.3 | 1 | A30 |
| Citrinet | 256 | 262.4 | 253 | A30 |
| Citrinet | 512 | 494.5 | 500 | A30 |
| Citrinet | 1024 | 12,500 | 690 | A30 |
| Quartznet | 1 | 19.7 | 1 | A30 |
| Quartznet | 256 | 177.9 | 254 | A30 |
| Quartznet | 512 | 293.7 | 504 | A30 |
| Quartznet | 1024 | 654.7 | 987 | A30 |
| Jasper | 1 | 22.3 | 1 | A30 |
| Jasper | 256 | 252.1 | 253 | A30 |
| Jasper | 512 | 454.1 | 502 | A30 |
| Jasper | 1024 | 7,770.8 | 722 | A30 |
A30 Best Streaming Latency Mode (100 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 14.1 | 1 | A30 |
| Citrinet | 16 | 44.9 | 16 | A30 |
| Citrinet | 128 | 177.6 | 127 | A30 |
| Quartznet | 1 | 10.2 | 1 | A30 |
| Quartznet | 16 | 25.8 | 16 | A30 |
| Quartznet | 128 | 65.5 | 127 | A30 |
| Jasper | 1 | 15.1 | 1 | A30 |
| Jasper | 16 | 40.3 | 16 | A30 |
| Jasper | 128 | 2,663.2 | 120 | A30 |
A30 Offline Mode (3200 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 16.9 | 1 | A30 |
| Citrinet | 512 | 574.1 | 501 | A30 |
| Citrinet | 1,024 | 1,166.1 | 979 | A30 |
| Citrinet | 1,512 | 8,992.4 | 1,108 | A30 |
| Quartznet | 1 | 41.4 | 1 | A30 |
| Quartznet | 512 | 696.4 | 502 | A30 |
| Quartznet | 1,024 | 1,536.5 | 974 | A30 |
| Quartznet | 1,512 | 2,712.4 | 1,392 | A30 |
| Jasper | 1 | 40.3 | 1 | A30 |
| Jasper | 512 | 1,149.5 | 498 | A30 |
| Jasper | 1,024 | 2,981.7 | 948 | A30 |
| Jasper | 1,512 | 18,136 | 978 | A30 |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
V100 ASR Benchmarks +
V100 Best Streaming Throughput Mode (800 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 12.5 | 1 | V100 SXM2-16GB |
| Citrinet | 256 | 284.5 | 253 | V100 SXM2-16GB |
| Citrinet | 512 | 553.9 | 499 | V100 SXM2-16GB |
| Citrinet | 768 | 4,443.9 | 650 | V100 SXM2-16GB |
| Quartznet | 1 | 13.7 | 1 | V100 SXM2-16GB |
| Quartznet | 256 | 196.4 | 254 | V100 SXM2-16GB |
| Quartznet | 512 | 308.1 | 502 | V100 SXM2-16GB |
| Quartznet | 768 | 458.1 | 748 | V100 SXM2-16GB |
| Jasper | 1 | 23.6 | 1 | V100 SXM2-16GB |
| Jasper | 128 | 191.7 | 127 | V100 SXM2-16GB |
| Jasper | 256 | 336.6 | 253 | V100 SXM2-16GB |
| Jasper | 512 | 937.1 | 497 | V100 SXM2-16GB |
V100 Best Streaming Latency Mode (100 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 11.2 | 1 | V100 SXM2-16GB |
| Citrinet | 16 | 34.9 | 16 | V100 SXM2-16GB |
| Citrinet | 128 | 213.3 | 127 | V100 SXM2-16GB |
| Quartznet | 1 | 8.3 | 1 | V100 SXM2-16GB |
| Quartznet | 16 | 17.7 | 16 | V100 SXM2-16GB |
| Quartznet | 128 | 183.1 | 127 | V100 SXM2-16GB |
| Jasper | 1 | 19.3 | 1 | V100 SXM2-16GB |
| Jasper | 16 | 38.9 | 16 | V100 SXM2-16GB |
| Jasper | 64 | 123.1 | 64 | V100 SXM2-16GB |
V100 Offline Mode (3200 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 12.7 | 1 | V100 SXM2-16GB |
| Citrinet | 256 | 350.9 | 252 | V100 SXM2-16GB |
| Citrinet | 512 | 627.1 | 499 | V100 SXM2-16GB |
| Citrinet | 768 | 936.7 | 738 | V100 SXM2-16GB |
| Citrinet | 1,024 | 1,500.5 | 972 | V100 SXM2-16GB |
| Quartznet | 1 | 29.5 | 1 | V100 SXM2-16GB |
| Quartznet | 256 | 365.4 | 253 | V100 SXM2-16GB |
| Quartznet | 512 | 669.9 | 501 | V100 SXM2-16GB |
| Quartznet | 768 | 1,199.8 | 737 | V100 SXM2-16GB |
| Quartznet | 1,024 | 1,662.2 | 965 | V100 SXM2-16GB |
| Jasper | 1 | 35.3 | 1 | V100 SXM2-16GB |
| Jasper | 256 | 740.5 | 251 | V100 SXM2-16GB |
| Jasper | 512 | 1,757.2 | 489 | V100 SXM2-16GB |
| Jasper | 768 | 3,138.9 | 711 | V100 SXM2-16GB |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
T4 ASR Benchmarks +
T4 Best Streaming Throughput Mode (800 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 26 | 1 | NVIDIA T4 |
| Citrinet | 64 | 178.7 | 64 | NVIDIA T4 |
| Citrinet | 128 | 300.4 | 127 | NVIDIA T4 |
| Citrinet | 256 | 710.4 | 249 | NVIDIA T4 |
| Citrinet | 384 | 8,847.0 | 290 | NVIDIA T4 |
| Quartznet | 1 | 28.4 | 1 | NVIDIA T4 |
| Quartznet | 64 | 144.1 | 64 | NVIDIA T4 |
| Quartznet | 128 | 190.3 | 127 | NVIDIA T4 |
| Quartznet | 256 | 296.5 | 252 | NVIDIA T4 |
| Quartznet | 384 | 422.7 | 376 | NVIDIA T4 |
| Jasper | 1 | 74.8 | 1 | NVIDIA T4 |
| Jasper | 64 | 218.8 | 64 | NVIDIA T4 |
| Jasper | 128 | 359.5 | 126 | NVIDIA T4 |
| Jasper | 256 | 1,030.6 | 249 | NVIDIA T4 |
T4 Best Streaming Latency Mode (100 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 22.6 | 1 | NVIDIA T4 |
| Citrinet | 16 | 66.1 | 16 | NVIDIA T4 |
| Citrinet | 64 | 1,803.7 | 62 | NVIDIA T4 |
| Quartznet | 1 | 16.1 | 1 | NVIDIA T4 |
| Quartznet | 16 | 40.7 | 16 | NVIDIA T4 |
| Quartznet | 64 | 104.5 | 64 | NVIDIA T4 |
| Jasper | 1 | 46.6 | 1 | NVIDIA T4 |
| Jasper | 8 | 47.4 | 8 | NVIDIA T4 |
| Jasper | 16 | 72 | 16 | NVIDIA T4 |
T4 Offline Mode (3200 ms chunk)
| Acoustic model | # of streams | Latency (ms) (avg) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Citrinet | 1 | 28.3 | 1 | NVIDIA T4 |
| Citrinet | 256 | 709.2 | 250 | NVIDIA T4 |
| Citrinet | 512 | 3,510.8 | 449 | NVIDIA T4 |
| Quartznet | 1 | 54.2 | 1 | NVIDIA T4 |
| Quartznet | 256 | 770.9 | 251 | NVIDIA T4 |
| Quartznet | 512 | 1,685.9 | 486 | NVIDIA T4 |
| Jasper | 1 | 96.7 | 1 | NVIDIA T4 |
| Jasper | 256 | 1,888.4 | 245 | NVIDIA T4 |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
A100 TTS Benchmarks
| Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|---|
| FastPitch + Hifi-GAN | 1 | 0.03 | 0.003 | 133 | A100 SXM4-40GB |
| FastPitch + Hifi-GAN | 4 | 0.04 | 0.006 | 340 | A100 SXM4-40GB |
| FastPitch + Hifi-GAN | 6 | 0.06 | 0.007 | 390 | A100 SXM4-40GB |
| FastPitch + Hifi-GAN | 8 | 0.07 | 0.009 | 443 | A100 SXM4-40GB |
| FastPitch + Hifi-GAN | 10 | 0.07 | 0.009 | 464 | A100 SXM4-40GB |
| Tacotron 2 + WaveGlow | 1 | 0.05 | 0.02 | 34 | A100 SXM4-40GB |
| Tacotron 2 + WaveGlow | 4 | 0.26 | 0.03 | 59 | A100 SXM4-40GB |
| Tacotron 2 + WaveGlow | 6 | 0.38 | 0.03 | 66 | A100 SXM4-40GB |
| Tacotron 2 + WaveGlow | 8 | 0.51 | 0.04 | 70 | A100 SXM4-40GB |
| Tacotron 2 + WaveGlow | 10 | 0.61 | 0.04 | 73 | A100 SXM4-40GB |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
A30 TTS Benchmarks
| Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|---|
| FastPitch + Hifi-GAN | 1 | 0.03 | 0.003 | 133 | A30 |
| FastPitch + Hifi-GAN | 4 | 0.04 | 0.006 | 340 | A30 |
| FastPitch + Hifi-GAN | 6 | 0.06 | 0.007 | 390 | A30 |
| FastPitch + Hifi-GAN | 8 | 0.07 | 0.009 | 443 | A30 |
| FastPitch + Hifi-GAN | 10 | 0.07 | 0.009 | 464 | A30 |
| Tacotron 2 + WaveGlow | 1 | 0.07 | 0.03 | 25 | A30 |
| Tacotron 2 + WaveGlow | 4 | 0.33 | 0.04 | 45 | A30 |
| Tacotron 2 + WaveGlow | 6 | 0.51 | 0.05 | 48 | A30 |
| Tacotron 2 + WaveGlow | 8 | 0.69 | 0.06 | 50 | A30 |
| Tacotron 2 + WaveGlow | 10 | 0.84 | 0.06 | 50 | A30 |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
V100 TTS Benchmarks +
| Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|---|
| FastPitch + Hifi-GAN | 1 | 0.03 | 0.005 | 107 | V100 SXM2-16GB |
| FastPitch + Hifi-GAN | 4 | 0.07 | 0.01 | 212 | V100 SXM2-16GB |
| FastPitch + Hifi-GAN | 6 | 0.10 | 0.01 | 226 | V100 SXM2-16GB |
| FastPitch + Hifi-GAN | 8 | 0.13 | 0.02 | 236 | V100 SXM2-16GB |
| FastPitch + Hifi-GAN | 10 | 0.15 | 0.02 | 232 | V100 SXM2-16GB |
| Tacotron 2 + WaveGlow | 1 | 0.06 | 0.03 | 25 | V100 SXM2-16GB |
| Tacotron 2 + WaveGlow | 4 | 0.39 | 0.05 | 37 | V100 SXM2-16GB |
| Tacotron 2 + WaveGlow | 6 | 0.60 | 0.06 | 40 | V100 SXM2-16GB |
| Tacotron 2 + WaveGlow | 8 | 0.81 | 0.06 | 43 | V100 SXM2-16GB |
| Tacotron 2 + WaveGlow | 10 | 0.98 | 0.07 | 43 | V100 SXM2-16GB |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
T4 TTS Benchmarks +
| Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|---|
| FastPitch + Hifi-GAN | 1 | 0.05 | 0.006 | 73 | NVIDIA T4 |
| FastPitch + Hifi-GAN | 4 | 0.11 | 0.02 | 132 | NVIDIA T4 |
| FastPitch + Hifi-GAN | 6 | 0.15 | 0.02 | 141 | NVIDIA T4 |
| FastPitch + Hifi-GAN | 8 | 0.19 | 0.03 | 148 | NVIDIA T4 |
| FastPitch + Hifi-GAN | 10 | 0.21 | 0.03 | 150 | NVIDIA T4 |
| Tacotron 2 + WaveGlow | 1 | 0.11 | 0.05 | 15 | NVIDIA T4 |
| Tacotron 2 + WaveGlow | 4 | 0.72 | 0.11 | 18 | NVIDIA T4 |
| Tacotron 2 + WaveGlow | 6 | 1.16 | 0.14 | 19 | NVIDIA T4 |
| Tacotron 2 + WaveGlow | 8 | 1.64 | 0.16 | 19 | NVIDIA T4 |
| Tacotron 2 + WaveGlow | 10 | 2.07 | 0.17 | 19 | NVIDIA T4 |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Last updated: May 13th, 2022