NVIDIA Data Center Deep Learning Product Performance
Reproducible Performance
Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide
Related Resources
Read why training to convergence is essential for enterprise AI adoption.
Learn how Cloud Service, OEMs Raise the Bar on AI Training with NVIDIA AI in the MLPerf training.
Access containers in the NVIDIA NGC™ catalog.
Learn how MLPerf Benchmarks show why AI is the future of HPC.
HPC Performance
Review the latest GPU-acceleration factors of popular HPC applications.
Training to Convergence
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Related Resources
Read our blog on convergence for more details.
Get up and running quickly with NVIDIA’s complete solution stack:
Pull software containers from NVIDIA NGC.
Learn how NVIDIA A100 Tensor Core GPUs provide unprecedented acceleration at every scale, setting records in MLPerf.
NVIDIA Performance on MLPerf 1.1 Training Benchmarks
BERT Time to Train on A100
PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements
MLPerf Training Performance
NVIDIA A100 Performance on MLPerf 1.1 AI Benchmarks - Closed Division
| Framework | Network | Time to Train (mins) | MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| MXNet | ResNet-50 v1.5 | 27.568 | 75.90% classification | 8x A100 | Inspur: NF5488A5 | 1.1-2050 | Mixed | ImageNet2012 | A100-SXM4-80GB |
| 4.534 | 75.90% classification | 64x A100 | DGX A100 | 1.1-2070 | Mixed | ImageNet2012 | A100-SXM4-80GB | ||
| 0.575 | 75.90% classification | 1,024x A100 | DGX A100 | 1.1-2078 | Mixed | ImageNet2012 | A100-SXM4-80GB | ||
| 0.347 | 75.90% classification | 4,320x A100 | DGX A100 | 1.1-2082 | Mixed | ImageNet2012 | A100-SXM4-80GB | ||
| SSD | 7.979 | 23.0% mAP | 8x A100 | Inspur: NF5488A5 | 1.1-2050 | Mixed | COCO2017 | A100-SXM4-80GB | |
| 1.517 | 23.0% mAP | 64x A100 | Azure: 8x Standard_ND96amsr_v4 | 1.1-2005 | Mixed | COCO2017 | A100-SXM4-80GB | ||
| 0.454 | 23.0% mAP | 1,024x A100 | DGX A100 | 1.1-2078 | Mixed | COCO2017 | A100-SXM4-80GB | ||
| 3D U-Net | 23.464 | 0.908 Mean DICE score | 8x A100 | Inspur: NF5688M6 | 1.1-2051 | Mixed | KiTS19 | A100-SXM4-80GB | |
| 3.757 | 0.908 Mean DICE score | 72x A100 | DGX A100 | 1.1-2072 | Mixed | KiTS19 | A100-SXM4-80GB | ||
| 1.262 | 0.908 Mean DICE score | 768x A100 | Azure: 96x Standard_ND96amsr_v4 | 1.1-2012 | Mixed | KiTS19 | A100-SXM4-80GB | ||
| PyTorch | BERT | 19.389 | 0.712 Mask-LM accuracy | 8x A100 | Inspur: NF5688M6 | 1.1-2053 | Mixed | Wikipedia 2020/01/01 | A100-SXM4-80GB |
| 3.038 | 0.712 Mask-LM accuracy | 64x A100 | DGX A100 | 1.1-2071 | Mixed | Wikipedia 2020/01/01 | A100-SXM4-80GB | ||
| 0.558 | 0.712 Mask-LM accuracy | 1,024x A100 | DGX A100 | 1.1-2079 | Mixed | Wikipedia 2020/01/01 | A100-SXM4-80GB | ||
| 0.226 | 0.712 Mask-LM accuracy | 4,320x A100 | DGX A100 | 1.1-2083 | Mixed | Wikipedia 2020/01/01 | A100-SXM4-80GB | ||
| Mask R-CNN | 45.667 | 0.377 Box min AP and 0.339 Mask min AP | 8x A100 | Inspur: NF5688M6 | 1.1-2053 | Mixed | COCO2017 | A100-SXM4-80GB | |
| 14.499 | 0.377 Box min AP and 0.339 Mask min AP | 32x A100 | DGX A100 | 1.1-2068 | Mixed | COCO2017 | A100-SXM4-80GB | ||
| 3.242 | 0.377 Box min AP and 0.339 Mask min AP | 408x A100 | DGX A100 | 1.1-2076 | Mixed | COCO2017 | A100-SXM4-80GB | ||
| RNN-T | 33.377 | 0.058 Word Error Rate | 8x A100 | Inspur: NF5488A5 | 1.1-2052 | Mixed | LibriSpeech | A100-SXM4-80GB | |
| 4.408 | 0.058 Word Error Rate | 128x A100 | DGX A100 | 1.1-2074 | Mixed | LibriSpeech | A100-SXM4-80GB | ||
| 2.375 | 0.058 Word Error Rate | 1,536x A100 | DGX A100 | 1.1-2080 | Mixed | LibriSpeech | A100-SXM4-80GB | ||
| TensorFlow | MiniGo | 264.868 | 50% win rate vs. checkpoint | 8x A100 | DGX A100 | 1.1-2067 | Mixed | Go | A100-SXM4-80GB |
| 29.093 | 50% win rate vs. checkpoint | 256x A100 | DGX A100 | 1.1-2075 | Mixed | Go | A100-SXM4-80GB | ||
| 15.465 | 50% win rate vs. checkpoint | 1,792x A100 | DGX A100 | 1.1-2081 | Mixed | Go | A100-SXM4-80GB | ||
| NVIDIA Merlin HugeCTR | DLRM | 1.698 | 0.8025 AUC | 8x A100 | Inspur: NF5688M6 | 1.1-2049 | Mixed | Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) | A100-SXM4-80GB |
| 0.685 | 0.8025 AUC | 64x A100 | DGX A100 | 1.1-2069 | Mixed | Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) | A100-SXM4-80GB | ||
| 0.633 | 0.8025 AUC | 112x A100 | DGX A100 | 1.1-2073 | Mixed | Criteo AI Lab’s Terabyte Click-Through-Rate (CTR) | A100-SXM4-80GB |
MLPerf™ v1.1 Training Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Strong Scaling - Closed Division
| Framework | Network | Time to Train (mins) | MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| MXNet | CosmoFlow | 8.04 | Mean average error 0.124 | 1,024x A100 | DGX A100 | 1.0-1120 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | A100-SXM4-80GB |
| 25.78 | Mean average error 0.124 | 128x A100 | DGX A100 | 1.0-1121 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | A100-SXM4-80GB | ||
| PyTorch | DeepCAM | 1.67 | IOU 0.82 | 2,048x A100 | DGX A100 | 1.0-1122 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | A100-SXM4-80GB |
| 2.65 | IOU 0.82 | 512x A100 | DGX A100 | 1.0-1123 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | A100-SXM4-80GB |
NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Weak Scaling - Closed Division
| Framework | Network | Throughput | MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|
| MXNet | CosmoFlow | 0.73 models/min | Mean average error 0.124 | 4,096x A100 | DGX A100 | 1.0-1131 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | A100-SXM4-80GB |
| PyTorch | DeepCAM | 5.27 models/min | IOU 0.82 | 4,096x A100 | DGX A100 | 1.0-1132 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | A100-SXM4-80GB |
MLPerf™ v1.0 Training HPC Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v1.0 Training HPC rules and guidelines, click here
Converged Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 85 | 77.14 Top 1 | 23,184 images/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 192 | ImageNet2012 | A100-SXM4-80GB |
| PyTorch | 1.8.0a0 | Mask R-CNN | 176 | .34 AP Segm | 167 images/sec | 8x A100 | DGX A100 | 20.12-py3 | TF32 | 8 | COCO 2014 | A100-SXM-80GB |
| 1.6.0a0 | SSD v1.1 | 43 | .25 mAP | 3,092 images/sec | 8x A100 | DGX A100 | 20.06-py3 | Mixed | 128 | COCO 2017 | A100-SXM-80GB | |
| 1.10.0a0 | Tacotron2 | 99 | .56 Training Loss | 306,044 total output mels/sec | 8x A100 | DGX A100 | 21.08-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM-80GB | |
| 1.10.0a0 | WaveGlow | 285 | -5.84 Training Loss | 1,486,357 output samples/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 10 | LJSpeech 1.1 | A100-SXM4-80GB | |
| 1.6.0a0 | Jasper | 3,600 | 3.53 dev-clean WER | 603 sequences/sec | 8x A100 | DGX A100 | 20.06-py3 | Mixed | 64 | LibriSpeech | A100 SXM4-40GB | |
| 1.10.0a0 | Transformer | 214 | 27.78 BLEU Score | 470,914 tokens/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 10240 | wmt14-en-de | A100-SXM4-80GB | |
| 1.6.0a0 | FastPitch | 216 | .18 Training Loss | 1,040,206 frames/sec | 8x A100 | DGX A100 | 20.06-py3 | Mixed | 32 | LJSpeech 1.1 | A100 SXM4-40GB | |
| 1.10.0a0 | GNMT V2 | 17 | 24.3 BLEU Score | 916,105 total tokens/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 128 | wmt16-en-de | A100-SXM4-80GB | |
| 1.10.0a0 | NCF | 0.38 | .96 Hit Rate at 10 | 152,159,129 samples/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 131072 | MovieLens 20M | A100-SXM4-80GB | |
| 1.10.0a0 | BERT-LARGE | 3 | 91.05 F1 | 922 sequences/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
| 1.9.0a0 | Transformer-XL Large | 408 | 14.03 Perplexity | 202,130 total tokens/sec | 8x A100 | DGX A100 | 21.06-py3 | Mixed | 16 | WikiText-103 | A100-SXM-80GB | |
| 1.10.0a0 | Transformer-XL Base | 210 | 22.53 Perplexity | 628,500 total tokens/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 128 | WikiText-103 | A100-SXM4-80GB | |
| 1.6.0a0 | BERT-Large Pre-Training P1 | 2,379 | - | 3,231 sequences/sec | 8x A100 | DGX A100 | 20.06-py3 | Mixed | - | Wikipedia+BookCorpus | A100-SXM4-40GB | |
| 1.6.0a0 | BERT-Large Pre-Training P2 | 1,377 | 1.34 Final Loss | 630 sequences/sec | 8x A100 | DGX A100 | 20.06-py3 | Mixed | - | Wikipedia+BookCorpus | A100-SXM4-40GB | |
| 1.6.0a0 | BERT-Large Pre-Training E2E | 3,756 | 1.34 Final Loss | - | 8x A100 | DGX A100 | 20.06-py3 | Mixed | - | Wikipedia+BookCorpus | A100-SXM4-40GB | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 95 | 76.97 Top 1 | 20,388 images/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 256 | ImageNet2012 | A100-SXM4-80GB |
| 1.15.5 | ResNext101 | 192 | 79.34 Top 1 | 10,078 images/sec | 8x A100 | DGX A100 | 21.08-py3 | Mixed | 256 | Imagenet2012 | A100-SXM-80GB | |
| 1.15.5 | SE-ResNext101 | 226 | 79.71 Top 1 | 8,578 images/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
| 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 1,051 images/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 2 | DAGM2007 | A100-SXM4-80GB | |
| 1.15.5 | U-Net Medical | 6 | .9 Dice Score | 952 images/sec | 8x A100 | DGX A100 | 21.03-py3 | Mixed | 8 | EM segmentation challenge | A100-SXM-80GB | |
| 1.15.5 | VAE-CF | 1 | .43 NDCG@100 | 1,642,346 users processed/sec | 8x A100 | DGX A100 | 21.09-py3 | TF32 | 3072 | MovieLens 20M | A100-SXM4-80GB | |
| 2.6.0 | Wide and Deep | 7 | .66 MAP at 12 | 3,662,293 samples/sec | 8x A100 | DGX A100 | 21.09-py3 | TF32 | 16384 | Kaggle Outbrain Click Prediction | A100-SXM4-80GB | |
| 1.15.5 | BERT-LARGE | 11 | 91.48 F1 | 859 sequences/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 24 | SQuaD v1.1 | A100-SXM4-80GB | |
| 2.6.0 | Electra Base Fine Tuning | 3 | 92.58 F1 | 2,843 sequences/sec | 8x A100 | DGX A100 | 21.09-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
| 2.2.0 | EfficientNet-B4 | 4,231 | 82.81 Top 1 | 2,535 images/sec | 8x A100 | DGX A100 | 20.08-py3 | Mixed | 160 | ImageNet2012 | A100-SXM-80GB | |
| 1.15.5 | V-Net Medical | 2 | .84 Anterior DICE | 1,227 images/sec | 8x A100 | DGX A100 | 21.09-py3 | TF32 | 2 | Hippocampus head and body from Medical Segmentation Decathlon | A100-SXM4-40GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
BERT-Large Pre-Training Sequence Length for Phase 1 = 128 and Phase 2 = 512 | Batch Size for Phase 1 = 65,536 and Phase 2 = 32,768
EfficientNet-B4: Mixup = 0.2 | Auto-Augmentation | cuDNN Version = 8.0.5.39 | NCCL Version = 2.7.8
Starting from 21.09-py3, ECC is enabled
A40 Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 203 | 77.2 Top 1 | 9,650 images/sec | 8x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 192 | ImageNet2012 | A40 |
| PyTorch | 1.9.0a0 | NCF | 1 | .96 Hit Rate at 10 | 59,667,265 samples/sec | 8x A40 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 131072 | MovieLens 20M | A40 |
| 1.10.0a0 | BERT-LARGE | 8 | 91.03 Top 1 | 387 sequences/sec | 8x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 32 | SQuaD v1.1 | A40 | |
| 1.10.0a0 | Tacotron2 | 131 | .55 Training Loss | 234,625 total output mels/sec | 8x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 128 | LJSpeech 1.1 | A40 | |
| 1.10.0a0 | WaveGlow | 562 | -5.8 Training Loss | 747,941 output samples/sec | 8x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 10 | LJSpeech 1.1 | A40 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 224 | 76.88 Top 1 | 8,626 images/sec | 8x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 256 | ImageNet2012 | A40 |
| 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.95 | 626 images/sec | 8x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 2 | DAGM2007 | A40 | |
| 1.15.5 | SE-ResNext101 | 542 | 79.52 Top 1 | 3,553 images/sec | 8x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 256 | Imagenet2012 | A40 | |
| 2.6.0 | Electra Base Fine Tuning | 4 | 92.52 Top 1 | 1,090 sequences/sec | 8x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 32 | SQuaD v1.1 | A40 |
Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
A30 Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.8.0 | ResNet-50 v1.5 | 182 | 77.34 Top1 | 10,739 images/sec | 8x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 192 | ImageNet2012 | A30 |
| PyTorch | 1.9.0a0 | Tacotron2 | 215 | .54 Training Loss | 144,326 total output mels/sec | 8x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 104 | LJSpeech 1.1 | A30 |
| 1.9.0a0 | WaveGlow | 533 | -5.82 Training Loss | 794,511 output samples/sec | 8x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 10 | LJSpeech 1.1 | A30 | |
| 1.9.0a0 | Transformer | 1,108 | 27.58 BLEU Score | 87,584 words/sec | 8x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 2560 | wmt14-en-de | A30 | |
| 1.9.0a0 | GNMT V2 | 81 | 24.65 BLEU Score | 219,582 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | TF32 | 128 | wmt16-en-de | A30 | |
| 1.10.0a0 | NCF | 1 | .96 Hit Rate at 10 | 55,275,748 samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 131072 | MovieLens 20M | A30 | |
| 1.10.0a0 | BERT-LARGE | 11 | 91.11 F1 | 278 sequences/sec | 8x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 10 | SQuaD v1.1 | A30 | |
| 1.9.0a0 | Transformer-XL Base | 151 | 22.16 Perplexity | 219,994 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 32 | WikiText-103 | A30 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 198 | 76.78 Top1 | 9,798 images/sec | 8x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 256 | ImageNet2012 | A30 |
| 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 576 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 2 | DAGM2007 | A30 | |
| 1.15.5 | U-Net Medical | 9 | .9 DICE Score | 461 images/sec | 8x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 8 | EM segmentation challenge | A30 | |
| 1.15.5 | VAE-CF | 1 | .43 NDCG@100 | 828,913 users processed/sec | 8x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | TF32 | 3072 | MovieLens 20M | A30 | |
| 1.15.5 | SE-ResNext101 | 573 | 79.83 Top1 | 3,399 images/sec | 8x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 96 | Imagenet2012 | A30 | |
| 2.4.0 | Electra Base Fine Tuning | 6 | 92.65 F1 | 904 sequences/sec | 8x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 16 | SQuaD v1.1 | A30 | |
| 1.15.5 | V-Net Medical | 3 | .84 Anterior DICE | 509 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 2 | Hippocampus head and body from Medical Segmentation Decathlon | A30 |
Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
A10 Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.8.0 | ResNet-50 v1.5 | 242 | 77.25 Top1 | 8,117 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 192 | ImageNet2012 | A10 |
| PyTorch | 1.9.0a0 | SE-ResNeXt101 | 996 | 80.24 Top1 | 1,953 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 112 | Imagenet2012 | A10 |
| 1.9.0a0 | Tacotron2 | 204 | .5 Training Loss | 151,946 total output mels/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 104 | LJSpeech 1.1 | A10 | |
| 1.9.0a0 | WaveGlow | 637 | -5.84 Training Loss | 664,022 output samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 10 | LJSpeech 1.1 | A10 | |
| 1.9.0a0 | Transformer | 1,365 | 27.8 BLEU Score | 70,844 words/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 2560 | wmt14-en-de | A10 | |
| 1.9.0a0 | FastPitch | 177 | .25 Training Loss | 467,464 frames/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 32 | LJSpeech 1.1 | A10 | |
| 1.9.0a0 | GNMT V2 | 61 | 24.49 BLEU Score | 292,052 total tokens/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 128 | wmt16-en-de | A10 | |
| 1.10.0a0 | NCF | 1 | .96 Hit Rate at 10 | 44,205,872 samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 131072 | MovieLens 20M | A10 | |
| 1.10.0a0 | BERT-LARGE | 13 | 91.54 F1 | 224 sequences/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 10 | SQuaD v1.1 | A10 | |
| 1.9.0a0 | Transformer-XL Base | 176 | 22.16 Perplexity | 187,731 total tokens/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 32 | WikiText-103 | A10 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 266 | 76.74 Top1 | 7,283 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 256 | ImageNet2012 | A10 |
| 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 526 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 2 | DAGM2007 | A10 | |
| 1.15.5 | U-Net Medical | 14 | .9 DICE Score | 324 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 8 | EM segmentation challenge | A10 | |
| 1.15.5 | VAE-CF | 1 | .43 NDCG@100 | 614,709 users processed/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | TF32 | 3072 | MovieLens 20M | A10 | |
| 1.15.5 | SE-ResNext101 | 866 | 79.65 Top1 | 2,240 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 96 | Imagenet2012 | A10 | |
| 2.4.0 | Electra Base Fine Tuning | 6 | 92.62 F1 | 745 sequences/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 16 | SQuaD v1.1 | A10 | |
| 1.15.5 | V-Net Medical | 3 | .84 Anterior DICE | 576 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 2 | Hippocampus head and body from Medical Segmentation Decathlon | A10 |
Server with a hyphen is a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
T4 Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 507 | 77.28 Top 1 | 3,860 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.08-py3 | Mixed | 192 | ImageNet2012 | NVIDIA T4 |
| PyTorch | 1.9.0a0 | SE-ResNeXt101 | 1,770 | 79.94 Top1 | 1,102 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.05-py3 | Mixed | 112 | Imagenet2012 | NVIDIA T4 |
| 1.10.0a0 | Tacotron2 | 241 | .53 Training Loss | 125,992 total output mels/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.08-py3 | Mixed | 104 | LJSpeech 1.1 | NVIDIA T4 | |
| 1.10.0a0 | WaveGlow | 1,041 | -5.82 Training Loss | 400,494 output samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.08-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA T4 | |
| 1.10.0a0 | Transformer | 2,234 | 27.56 BLEU Score | 42,963 tokens/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.09-py3 | Mixed | 2560 | wmt14-en-de | NVIDIA T4 | |
| 1.7.0a0 | FastPitch | 319 | .21 Training Loss | 281,406 frames/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.10-py3 | Mixed | 32 | LJSpeech 1.1 | NVIDIA T4 | |
| 1.10.0a0 | GNMT V2 | 92 | 24.12 BLEU Score | 157,601 total tokens/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.09-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
| 1.10.0a0 | NCF | 2 | .96 Hit Rate at 10 | 28,643,324 samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.08-py3 | Mixed | 131072 | MovieLens 20M | NVIDIA T4 | |
| 1.10.0a0 | BERT-LARGE | 24 | 91.34 F1 | 125 sequences/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.09-py3 | Mixed | 10 | SQuaD v1.1 | NVIDIA T4 | |
| 1.9.0a0 | Transformer-XL Base | 318 | 22.12 Perplexity | 103,740 total tokens/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.06-py3 | Mixed | 32 | WikiText-103 | NVIDIA T4 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 550 | 76.81 Top 1 | 3,496 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.08-py3 | Mixed | 256 | ImageNet2012 | NVIDIA T4 |
| 1.15.5 | U-Net Industrial | 2 | .99 IoU Threshold | 284 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.09-py3 | Mixed | 2 | DAGM2007 | NVIDIA T4 | |
| 1.15.5 | U-Net Medical | 31 | .9 DICE Score | 155 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.08-py3 | Mixed | 8 | EM segmentation challenge | NVIDIA T4 | |
| 1.15.5 | VAE-CF | 2 | .43 NDCG@100 | 354,626 users processed/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.09-py3 | Mixed | 3072 | MovieLens 20M | NVIDIA T4 | |
| 1.15.4 | SSD | 112 | .28 mAP | 549 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 20.12-py3 | Mixed | 32 | COCO 2017 | NVIDIA T4 | |
| 1.15.5 | Mask R-CNN | 492 | .34 AP Segm | 53 samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.03-py3 | Mixed | 4 | COCO 2014 | NVIDIA T4 | |
| 1.15.5 | ResNext101 | 1,222 | 79.22 Top 1 | 1,577 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.08-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
| 1.15.5 | SE-ResNext101 | 1,648 | 79.86 Top 1 | 1,169 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.09-py3 | Mixed | 96 | Imagenet2012 | NVIDIA T4 | |
| 2.6.0 | Electra Base Fine Tuning | 9 | 92.48 F1 | 396 sequences/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | NVIDIA T4 | |
| 2.6.0 | Wide and Deep | 27 | .66 MAP at 12 | 764,329 samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 21.09-py3 | Mixed | 16384 | Kaggle Outbrain Click Prediction | NVIDIA T4 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
V100 Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 171 | 77.3 Top 1 | 11,633 images/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 256 | ImageNet2012 | V100-SXM3-32GB |
| PyTorch | 1.10.0a0 | Mask R-CNN | 265 | .16 AP Segm | 111 images/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 8 | COCO 2014 | V100-SXM3-32GB |
| 1.10.0a0 | Tacotron2 | 191 | .53 Training Loss | 152,506 total output mels/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 104 | LJSpeech 1.1 | V100-SXM3-32GB | |
| 1.10.0a0 | WaveGlow | 456 | -5.74 Training Loss | 931,487 output samples/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 10 | LJSpeech 1.1 | V100-SXM3-32GB | |
| 1.6.0a0 | Jasper | 6,300 | 3.49 dev-clean WER | 312 sequences/sec | 8x V100 | DGX-2 | 20.06-py3 | Mixed | 64 | LibriSpeech | V100 SXM2-32GB | |
| 1.10.0a0 | Transformer | 474 | 27.58 BLEU Score | 208,238 tokens/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 5120 | wmt14-en-de | V100-SXM3-32GB | |
| 1.6.0a0 | FastPitch | 354 | .18 Training Loss | 570,968 frames/sec | 8x V100 | DGX-1 | 20.06-py3 | Mixed | 32 | LJSpeech 1.1 | V100 SXM2-16GB | |
| 1.10.0a0 | GNMT V2 | 34 | 24.35 BLEU Score | 438,681 total tokens/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 128 | wmt16-en-de | V100-SXM3-32GB | |
| 1.10.0a0 | NCF | 1 | .96 Hit Rate at 10 | 97,664,714 samples/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 131072 | MovieLens 20M | V100-SXM3-32GB | |
| 1.10.0a0 | BERT-LARGE | 8 | 91.31 F1 | 370 sequences/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
| 1.9.0a0 | Transformer-XL Base | 116 | 22.05 Perplexity | 284,635 total tokens/sec | 8x V100 | DGX-2 | 21.06-py3 | Mixed | 32 | WikiText-103 | V100-SXM3-32GB | |
| 1.10.0a0 | ResNeXt101 | 500 | 79.5 Top 1 | 3,915 images/sec | 8x V100 | DGX-2 | 21.08-py3 | Mixed | 112 | Imagenet2012 | V100-SXM3-32GB | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 189 | 77.07 Top 1 | 10,230 images/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 256 | ImageNet2012 | V100-SXM3-32GB |
| 1.15.5 | ResNext101 | 425 | 79.3 Top 1 | 4,558 images/sec | 8x V100 | DGX-2 | 21.08-py3 | Mixed | 128 | Imagenet2012 | V100-SXM3-32GB | |
| 1.15.5 | SE-ResNext101 | 534 | 79.8 Top 1 | 3,633 images/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 96 | Imagenet2012 | V100-SXM3-32GB | |
| 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 665 images/sec | 8x V100 | DGX-2 | 21.06-py3 | Mixed | 2 | DAGM2007 | V100-SXM3-32GB | |
| 1.15.5 | U-Net Medical | 12 | .89 DICE Score | 465 images/sec | 8x V100 | DGX-2 | 21.06-py3 | Mixed | 8 | EM segmentation challenge | V100-SXM3-32GB | |
| 1.15.5 | VAE-CF | 1 | .43 NDCG@100 | 907,352 users processed/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 3072 | MovieLens 20M | V100-SXM3-32GB | |
| 2.6.0 | Wide and Deep | 12 | .66 MAP at 12 | 2,132,710 samples/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 16384 | Kaggle Outbrain Click Prediction | V100-SXM3-32GB | |
| 1.15.5 | BERT-LARGE | 18 | 91.54 F1 | 333 sequences/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
| 2.6.0 | Electra Base Fine Tuning | 4 | 92.49 F1 | 1,459 sequences/sec | 8x V100 | DGX-2 | 21.09-py3 | Mixed | 32 | SQuaD v1.1 | V100-SXM3-32GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
Converged Training Performance of NVIDIA GPU on Cloud
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance on Cloud
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PyTorch | 1.9.0a0 | BERT-LARGE | 3 | 91.31 F1 | 876 sequences/sec | 8x A100 | AWS EC2 p4d.24xlarge | 21.06-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-40GB |
| 1.10.0a0 | BERT-LARGE | 3 | 91.05 F1 | 877 sequences/sec | 8x A100 | GCP A2-HIGHGPU-8G | 21.09-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-40GB | |
| Tensorflow | 1.15.5 | BERT-LARGE | 13 | 91.4 F1 | 759 sequences/sec | 8x A100 | GCP A2-HIGHGPU-8G | 21.09-py3 | Mixed | 24 | SQuaD v1.1 | A100-SXM4-40GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
V100 Training Performance on Cloud
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PyTorch | 1.10.0a0 | BERT-LARGE | 8 | 91.25 F1 | 355 sequences/sec | 8x V100 | GCP N1-HIGHMEM-64 | 21.09-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM2-16GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
Converged Multi-Node Training Performance of NVIDIA GPU
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Multi-Node Training Performance
| Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Nodes | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PyTorch | 1.10.0a0 | BERT-LARGE Pre-Training P1 | 322 | 1.54 Training Loss | 25,927 sequences/sec | 8x A100 | 8 | Selene | 21.09-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM-80GB |
| 1.10.0a0 | BERT-LARGE Pre-Training P2 | 168 | 1.36 Training Loss | 5,146 sequences/sec | 8x A100 | 8 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training E2E | 271 | 1.36 Training Loss | - | 8x A100 | 8 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training P1 | 165 | 1.52 Training Loss | 50,548 sequences/sec | 8x A100 | 16 | Selene | 21.09-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training P2 | 86 | 1.35 Training Loss | 10,101 sequences/sec | 8x A100 | 16 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training E2E | 138 | 1.35 Training Loss | - | 8x A100 | 16 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training P1 | 83 | 1.49 Training Loss | 93,888 sequences/sec | 8x A100 | 32 | Selene | 21.09-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training P2 | 45 | 1.34 Training Loss | 19,597 sequences/sec | 8x A100 | 32 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training E2E | 70 | 1.34 Training Loss | - | 8x A100 | 32 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training P1 | 46 | 1.5 Training Loss | 167,820 sequences/sec | 8x A100 | 64 | Selene | 21.09-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training P2 | 25 | 1.33 Training Loss | 37,847 sequences/sec | 8x A100 | 64 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training E2E | 39 | 1.33 Training Loss | - | 8x A100 | 64 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training P1 | 26 | 1.5 Training Loss | 300,769 sequences/sec | 8x A100 | 128 | Selene | 21.09-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training P2 | 13 | 1.35 Training Loss | 74,498 sequences/sec | 8x A100 | 128 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE Pre-Training E2E | 22 | 1.35 Training Loss | - | 8x A100 | 128 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | Transformer | 190 | 18.35 Perplexity | 444,469 total tokens/sec | 8x A100 | 2 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | Transformer | 106 | 18.31 Perplexity | 799,988 total tokens/sec | 8x A100 | 4 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.10.0a0 | Transformer | 65 | 18.26 Perplexity | 1,333,045 total tokens/sec | 8x A100 | 8 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM-80GB |
BERT-Large Pre-Training Phase 1 Sequence Length = 128
BERT-Large Pre-Training Phase 2 Sequence Length = 512
Starting from 21.09-py3, ECC is enabled
Single-GPU Training
Some scenarios aren’t used in real-world training, such as single-GPU throughput. The table below provides an indication of a platform’s single-chip throughput.
Related Resources
Achieve unprecedented acceleration at every scale with NVIDIA’s complete solution stack.
Pull software containers from NVIDIA NGC.
Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit the NVIDIA NGC catalog to pull containers and quickly get up and running with deep learning.
Single GPU Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 2,963 images/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 192 | ImageNet2012 | A100-SXM-80GB |
| PyTorch | 1.10.0a0 | Mask R-CNN | 29 images/sec | 1x A100 | DGX A100 | 21.09-py3 | TF32 | 8 | COCO 2014 | A100-SXM-80GB |
| 1.9.0a0 | SSD v1.1 | 447 images/sec | 1x A100 | DGX A100 | 21.06-py3 | Mixed | 128 | COCO 2017 | A100-SXM-80GB | |
| 1.10.0a0 | Tacotron2 | 40,553 total output mels/sec | 1x A100 | DGX A100 | 21.09-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM-80GB | |
| 1.10.0a0 | WaveGlow | 202,578 output samples/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 10 | LJSpeech 1.1 | A100-SXM-80GB | |
| 1.10.0a0 | Jasper | 80 sequences/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 64 | LibriSpeech | A100-SXM-80GB | |
| 1.6.0a0 | Transformer | 82,618 words/sec | 1x A100 | DGX A100 | 20.06-py3 | Mixed | 10240 | wmt14-en-de | A100 SXM4-40GB | |
| 1.10.0a0 | FastPitch | 180,663 frames/sec | 1x A100 | DGX A100 | 21.08-py3 | Mixed | 128 | LJSpeech 1.1 | A100-SXM4-80GB | |
| 1.10.0a0 | GNMT V2 | 158,825 total tokens/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 128 | wmt16-en-de | A100-SXM-80GB | |
| 1.10.0a0 | NCF | 37,073,053 samples/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 1048576 | MovieLens 20M | A100-SXM-80GB | |
| 1.10.0a0 | BERT-LARGE | 122 sequences/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM-80GB | |
| 1.9.0a0 | Transformer-XL Large | 28,503 total tokens/sec | 1x A100 | DGX A100 | 21.05-py3 | Mixed | 16 | WikiText-103 | A100-SXM-80GB | |
| 1.10.0a0 | Transformer-XL Base | 76,960 total tokens/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 128 | WikiText-103 | A100-SXM-80GB | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 2,649 images/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 256 | ImageNet2012 | A100-SXM-80GB |
| 1.15.5 | ResNext101 | 1,297 images/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 256 | Imagenet2012 | A100-SXM-80GB | |
| 1.15.5 | SE-ResNext101 | 1,120 images/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 256 | Imagenet2012 | A100-SXM-80GB | |
| 1.15.5 | U-Net Industrial | 346 images/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 16 | DAGM2007 | A100-SXM4-40GB | |
| 2.6.0 | U-Net Medical | 150 images/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 8 | EM segmentation challenge | A100-SXM-80GB | |
| 1.15.5 | VAE-CF | 403,011 users processed/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 24576 | MovieLens 20M | A100-SXM-80GB | |
| 2.6.0 | Wide and Deep | 2,417,876 samples/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | A100-SXM4-40GB | |
| 1.15.5 | BERT-LARGE | 116 sequences/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 24 | SQuaD v1.1 | A100-SXM-80GB | |
| 2.6.0 | Electra Base Fine Tuning | 380 sequences/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM-80GB | |
| - | EfficientNet-B4 | 332 images/sec | 1x A100 | DGX A100 | - | Mixed | 160 | ImageNet2012 | A100-SXM-80GB | |
| 1.15.5 | NCF | 42,093,685 samples/sec | 1x A100 | DGX A100 | 21.08-py3 | Mixed | 1048576 | MovieLens 20M | A100-SXM4-40GB | |
| 2.4.0 | Mask R-CNN | 30 samples/sec | 1x A100 | DGX A100 | 21.05-py3 | Mixed | 4 | COCO 2014 | A100-SXM4-40GB | |
| 1.15.5 | V-Net Medical | 1,723 images/sec | 1x A100 | DGX A100 | 21.09-py3 | Mixed | 32 | Hippocampus head and body from Medical Segmentation Decathlon | A100-SXM4-40GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
EfficientNet-B4: Basic Augmentation | cuDNN Version = 8.0.5.32 | NCCL Version = 2.7.8 | Installation Source = NGC catalog
Starting from 21.09-py3, ECC is enabled
A40 Training Performance
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 1,192 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 192 | ImageNet2012 | A40 |
| PyTorch | 1.10.0a0 | Mask R-CNN | 14 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | TF32 | 8 | COCO 2014 | A40 |
| 1.9.0a0 | SSD v1.1 | 222 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.06-py3 | Mixed | 128 | COCO 2017 | A40 | |
| 1.10.0a0 | Tacotron2 | 24,495 total output mels/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 128 | LJSpeech 1.1 | A40 | |
| 1.10.0a0 | WaveGlow | 120,308 output samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 10 | LJSpeech 1.1 | A40 | |
| 1.10.0a0 | GNMT V2 | 82,036 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 128 | wmt16-en-de | A40 | |
| 1.10.0a0 | NCF | 20,388,435 samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 1048576 | MovieLens 20M | A40 | |
| 1.9.0a0 | Transformer-XL Large | 15,301 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 16 | WikiText-103 | A40 | |
| 1.10.0a0 | BERT-LARGE | 58 sequences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 32 | SQuaD v1.1 | A40 | |
| 1.10.0a0 | FastPitch | 122,507 frames/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 128 | LJSpeech 1.1 | A40 | |
| 1.10.0a0 | Transformer-XL Base | 44,361 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 128 | WikiText-103 | A40 | |
| 1.10.0a0 | Jasper | 43 sequences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 64 | LibriSpeech | A40 | |
| 1.10.0a0 | Transformer | 30,906 tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 5120 | wmt14-en-de | A40 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 1,325 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 256 | ImageNet2012 | A40 |
| 1.15.5 | SSD | 214 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.06-py3 | Mixed | 32 | COCO 2017 | A40 | |
| 1.15.5 | U-Net Industrial | 112 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 16 | DAGM2007 | A40 | |
| 1.15.5 | BERT-LARGE | 55 sentences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 24 | SQuaD v1.1 | A40 | |
| 1.15.5 | VAE-CF | 214,146 users processed/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 24576 | MovieLens 20M | A40 | |
| 2.6.0 | U-Net Medical | 68 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 8 | EM segmentation challenge | A40 | |
| 2.6.0 | Wide and Deep | 879,070 samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | A40 | |
| 1.15.5 | ResNext101 | 571 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 256 | Imagenet2012 | A40 | |
| 1.15.5 | SE-ResNext101 | 525 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 256 | Imagenet2012 | A40 | |
| 2.5.0 | Electra Base Fine Tuning | 180 sequences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 32 | SQuaD v1.1 | A40 | |
| 1.15.5 | V-Net Medical | 878 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 32 | Hippocampus head and body from Medical Segmentation Decathlon | A40 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
A30 Training Performance
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 1,452 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 192 | ImageNet2012 | A30 |
| PyTorch | 1.9.0a0 | SSD v1.1 | 226 images/sec | 1x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.06-py3 | Mixed | 64 | COCO 2017 | A30 |
| 1.10.0a0 | Tacotron2 | 19,716 total output mels/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 104 | LJSpeech 1.1 | A30 | |
| 1.10.0a0 | WaveGlow | 117,277 output samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 10 | LJSpeech 1.1 | A30 | |
| 1.10.0a0 | Transformer | 24,302 tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 2560 | wmt14-en-de | A30 | |
| 1.10.0a0 | FastPitch | 98,275 frames/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 64 | LJSpeech 1.1 | A30 | |
| 1.10.0a0 | NCF | 19,512,965 samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 1048576 | MovieLens 20M | A30 | |
| 1.10.0a0 | GNMT V2 | 84,768 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 128 | wmt16-en-de | A30 | |
| 1.9.0a0 | Transformer-XL Base | 41,124 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.04-py3 | Mixed | 32 | WikiText-103 | A30 | |
| 1.10.0a0 | ResNeXt101 | 539 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 112 | Imagenet2012 | A30 | |
| 1.10.0a0 | Jasper | 34 sequences/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 16 | LibriSpeech | A30 | |
| 1.9.0a0 | Transformer-XL Large | 12,617 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 4 | WikiText-103 | A30 | |
| 1.10.0a0 | BERT-LARGE | 51 sequences/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 10 | SQuaD v1.1 | A30 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 1,336 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 256 | ImageNet2012 | A30 |
| 1.15.5 | ResNext101 | 592 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 128 | Imagenet2012 | A30 | |
| 1.15.5 | SE-ResNext101 | 489 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 96 | Imagenet2012 | A30 | |
| 1.15.5 | U-Net Industrial | 109 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 16 | DAGM2007 | A30 | |
| 2.6.0 | U-Net Medical | 71 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 8 | EM segmentation challenge | A30 | |
| 1.15.5 | VAE-CF | 200,126 users processed/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 24576 | MovieLens 20M | A30 | |
| 2.6.0 | Wide and Deep | 851,130 samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | A30 | |
| 2.4.0 | Mask R-CNN | 21 samples/sec | 1x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.05-py3 | Mixed | 4 | COCO 2014 | A30 | |
| 2.6.0 | Electra Base Fine Tuning | 153 sequences/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A30 | |
| 1.15.5 | SSD | 201 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.06-py3 | Mixed | 32 | COCO 2017 | A30 | |
| 1.15.5 | V-Net Medical | 959 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 32 | Hippocampus head and body from Medical Segmentation Decathlon | A30 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
A10 Training Performance
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 1,017 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.06-py3 | Mixed | 192 | ImageNet2012 | A10 |
| PyTorch | 1.9.0a0 | SSD v1.1 | 173 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.06-py3 | Mixed | 64 | COCO 2017 | A10 |
| 1.10.0a0 | Tacotron2 | 19,636 total output mels/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 104 | LJSpeech 1.1 | A10 | |
| 1.10.0a0 | WaveGlow | 96,531 output samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 10 | LJSpeech 1.1 | A10 | |
| 1.10.0a0 | Transformer | 20,756 tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 2560 | wmt14-en-de | A10 | |
| 1.10.0a0 | FastPitch | 94,160 frames/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 64 | LJSpeech 1.1 | A10 | |
| 1.9.0a0 | Transformer-XL Base | 34,901 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.04-py3 | Mixed | 32 | WikiText-103 | A10 | |
| 1.10.0a0 | GNMT V2 | 61,526 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 128 | wmt16-en-de | A10 | |
| 1.10.0a0 | ResNeXt101 | 298 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 112 | Imagenet2012 | A10 | |
| 1.10.0a0 | SE-ResNeXt101 | 246 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 112 | Imagenet2012 | A10 | |
| 1.10.0a0 | NCF | 16,806,585 samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 1048576 | MovieLens 20M | A10 | |
| 1.10.0a0 | Jasper | 25 sequences/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 16 | LibriSpeech | A10 | |
| 1.9.0a0 | Transformer-XL Large | 10,699 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 4 | WikiText-103 | A10 | |
| 1.10.0a0 | BERT-LARGE | 40 sequences/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 10 | SQuaD v1.1 | A10 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 995 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 256 | ImageNet2012 | A10 |
| 1.15.5 | ResNext101 | 412 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 128 | Imagenet2012 | A10 | |
| 1.15.5 | SE-ResNext101 | 293 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 96 | Imagenet2012 | A10 | |
| 1.15.5 | U-Net Industrial | 90 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 16 | DAGM2007 | A10 | |
| 2.6.0 | U-Net Medical | 47 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 8 | EM segmentation challenge | A10 | |
| 1.15.5 | VAE-CF | 175,029 users processed/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.08-py3 | Mixed | 24576 | MovieLens 20M | A10 | |
| 2.6.0 | Wide and Deep | 721,854 samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | A10 | |
| 2.6.0 | Electra Base Fine Tuning | 119 sequences/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A10 | |
| 2.4.0 | Mask R-CNN | 18 samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.05-py3 | Mixed | 4 | COCO 2014 | A10 | |
| 1.15.5 | SSD | 180 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 21.06-py3 | Mixed | 32 | COCO 2017 | A10 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
T4 Training Performance
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 444 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 192 | ImageNet2012 | NVIDIA T4 |
| PyTorch | 1.10.0a0 | ResNeXt101 | 180 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | Mixed | 112 | Imagenet2012 | NVIDIA T4 |
| 1.10.0a0 | Tacotron2 | 17,331 total output mels/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | Mixed | 104 | LJSpeech 1.1 | NVIDIA T4 | |
| 1.10.0a0 | WaveGlow | 53,856 output samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA T4 | |
| 1.10.0a0 | Transformer | 9,816 tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 2560 | wmt14-en-de | NVIDIA T4 | |
| 1.10.0a0 | FastPitch | 40,379 frames/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | Mixed | 64 | LJSpeech 1.1 | NVIDIA T4 | |
| 1.10.0a0 | GNMT V2 | 30,244 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
| 1.10.0a0 | NCF | 8,091,013 samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | Mixed | 1048576 | MovieLens 20M | NVIDIA T4 | |
| 1.10.0a0 | BERT-LARGE | 19 sequences/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | Mixed | 10 | SQuaD v1.1 | NVIDIA T4 | |
| 1.9.0a0 | Transformer-XL Base | 17,182 total tokens/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 21.04-py3 | Mixed | 32 | WikiText-103 | NVIDIA T4 | |
| 1.10.0a0 | Jasper | 11 sequences/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 16 | LibriSpeech | NVIDIA T4 | |
| 1.10.0a0 | SE-ResNeXt101 | 146 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | Mixed | 112 | Imagenet2012 | NVIDIA T4 | |
| 1.9.0a0 | Transformer-XL Large | 5,231 total tokens/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 21.05-py3 | Mixed | 4 | WikiText-103 | NVIDIA T4 | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 412 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 256 | ImageNet2012 | NVIDIA T4 |
| 1.15.5 | U-Net Industrial | 46 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | Mixed | 16 | DAGM2007 | NVIDIA T4 | |
| 2.6.0 | U-Net Medical | 21 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 8 | EM segmentation challenge | NVIDIA T4 | |
| 1.15.5 | VAE-CF | 81,989 users processed/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | Mixed | 24576 | MovieLens 20M | NVIDIA T4 | |
| 1.15.5 | SSD | 98 images/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 21.06-py3 | Mixed | 32 | COCO 2017 | NVIDIA T4 | |
| 2.4.0 | Mask R-CNN | 9 samples/sec | 1x T4 | Supermicro SYS-4029GP-TRT | 21.05-py3 | Mixed | 4 | COCO 2014 | NVIDIA T4 | |
| 2.6.0 | Wide and Deep | 330,004 samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | NVIDIA T4 | |
| 1.15.5 | SE-ResNext101 | 155 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 96 | Imagenet2012 | NVIDIA T4 | |
| 1.15.5 | ResNext101 | 188 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
| 2.6.0 | Electra Base Fine Tuning | 59 sequences/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | NVIDIA T4 | |
| 1.15.5 | V-Net Medical | 427 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | Mixed | 32 | Hippocampus head and body from Medical Segmentation Decathlon | NVIDIA T4 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
V100 Training Performance
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 1,490 images/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 256 | ImageNet2012 | V100-SXM3-32GB |
| PyTorch | 1.10.0a0 | ResNeXt101 | 548 images/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 112 | Imagenet2012 | V100-SXM3-32GB |
| 1.9.0a0 | SSD v1.1 | 233 images/sec | 1x V100 | DGX-2 | 21.06-py3 | Mixed | 64 | COCO 2017 | V100-SXM3-32GB | |
| 1.10.0a0 | Tacotron2 | 22,417 total output mels/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 104 | LJSpeech 1.1 | V100-SXM3-32GB | |
| 1.10.0a0 | WaveGlow | 129,673 output samples/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 10 | LJSpeech 1.1 | V100-SXM3-32GB | |
| 1.10.0a0 | Jasper | 42 sequences/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 64 | LibriSpeech | V100-SXM3-32GB | |
| 1.10.0a0 | Transformer | 31,644 tokens/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 5120 | wmt14-en-de | V100-SXM3-32GB | |
| 1.10.0a0 | FastPitch | 121,642 frames/sec | 1x V100 | DGX-2 | 21.08-py3 | Mixed | 64 | LJSpeech 1.1 | V100-SXM3-32GB | |
| 1.10.0a0 | GNMT V2 | 76,246 total tokens/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 128 | wmt16-en-de | V100-SXM3-32GB | |
| 1.10.0a0 | NCF | 22,004,456 samples/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 1048576 | MovieLens 20M | V100-SXM3-32GB | |
| 1.10.0a0 | BERT-LARGE | 53 sequences/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
| 1.9.0a0 | Transformer-XL Base | 44,072 total tokens/sec | 1x V100 | DGX-2 | 21.05-py3 | Mixed | 32 | WikiText-103 | V100-SXM3-32GB | |
| 1.9.0a0 | Transformer-XL Large | 15,360 total tokens/sec | 1x V100 | DGX-2 | 21.05-py3 | Mixed | 8 | WikiText-103 | V100-SXM3-32GB | |
| Tensorflow | 1.15.5 | ResNet-50 v1.5 | 1,369 images/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 256 | ImageNet2012 | V100-SXM3-32GB |
| 1.15.5 | ResNext101 | 617 images/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 128 | Imagenet2012 | V100-SXM3-32GB | |
| 1.15.5 | SE-ResNext101 | 516 images/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 96 | Imagenet2012 | V100-SXM3-32GB | |
| 1.15.5 | U-Net Industrial | 115 images/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 16 | DAGM2007 | V100-SXM3-32GB | |
| 2.6.0 | U-Net Medical | 67 images/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 8 | EM segmentation challenge | V100-SXM3-32GB | |
| 1.15.5 | VAE-CF | 221,975 users processed/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 24576 | MovieLens 20M | V100-SXM3-32GB | |
| 2.6.0 | Wide and Deep | 968,852 samples/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | V100-SXM3-32GB | |
| 1.15.5 | BERT-LARGE | 48 sequences/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 10 | SQuaD v1.1 | V100-SXM3-32GB | |
| 2.6.0 | Electra Base Fine Tuning | 195 sequences/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 32 | SQuaD v1.1 | V100-SXM3-32GB | |
| 2.4.0 | Mask R-CNN | 22 samples/sec | 1x V100 | DGX-2 | 21.05-py3 | Mixed | 4 | COCO 2014 | V100-SXM3-32GB | |
| 1.15.5 | SSD | 222 images/sec | 1x V100 | DGX-2 | 21.06-py3 | Mixed | 32 | COCO 2017 | V100-SXM3-32GB | |
| 1.15.5 | V-Net Medical | 1,070 images/sec | 1x V100 | DGX-2 | 21.09-py3 | Mixed | 32 | Hippocampus head and body from Medical Segmentation Decathlon | V100-SXM3-32GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
Single GPU Training Performance of NVIDIA GPU on Cloud
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance on Cloud
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 2,724 images/sec | 1x A100 | GCP A2-HIGHGPU-1G | 21.09-py3 | Mixed | 192 | ImageNet2012 | A100-SXM4-40GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
T4 Training Performance on Cloud
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 425 images/sec | 1x T4 | AWS EC2 g4dn.4xlarge | 21.06-py3 | Mixed | 192 | ImageNet2012 | NVIDIA T4 |
| 1.9.0 | ResNet-50 v1.5 | 389 images/sec | 1x T4 | GCP N1-HIGHMEM-8 | 21.09-py3 | Mixed | 192 | ImageNet2012 | NVIDIA T4 | |
| PyTorch | 1.9.0a0 | BERT-LARGE | 16 sequences/sec | 1x T4 | AWS EC2 g4dn.4xlarge | 21.06-py3 | Mixed | 10 | SQuaD v1.1 | NVIDIA T4 |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
V100 Training Performance on Cloud
| Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| MXNet | 1.9.0 | ResNet-50 v1.5 | 1,397 images/sec | 1x V100 | GCP N1-HIGHMEM-8 | 21.09-py3 | Mixed | 192 | ImageNet2012 | V100-SXM2-16GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Starting from 21.09-py3, ECC is enabled
AI Inference
Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.
Related Resources
Learn how NVIDIA landed top performance spots on all MLPerf Inference 1.1 tests.
Read the inference whitepape to explore the evolving landscape and get an overview of inference platforms.
Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:
Achieve the most efficient inference performance with NVIDIA® TensorRT™ running on NVIDIA Tensor Core GPUs.
Maximize performance and simplify the deployment of AI models with the NVIDIA Triton™ Inference Server.
Pull software containers from NVIDIA NGC to race into production.
MLPerf Inference v1.1 Performance Benchmarks
Offline Scenario - Closed Division
| Network | Throughput | GPU | Server | GPU Version | Dataset | Target Accuracy |
|---|---|---|---|---|---|---|
| ResNet-50 v1.5 | 313,516 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet | 76.46% Top1 |
| 283,469 samples/sec | 8x (7x1g.10gb A100) | DGX A100 | A100 SXM-80GB | |||
| 145,742 samples/sec | 4x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | |||
| 149,178 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | |||
| 150,315 samples/sec | 8x (4x1g.6gb A30) | Gigabyte G482-Z54 | A30 | |||
| 110,197 samples/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 | |||
| SSD ResNet-34 | 7,851 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | COCO | 0.2 mAP |
| 7,316 samples/sec | 8x (7x1g.10gb A100) | DGX A100 | A100 SXM-80GB | |||
| 3,606 samples/sec | 4x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | |||
| 3,788 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | |||
| 3,727 samples/sec | 8x (4x1g.6gb A30) | Gigabyte G482-Z54 | A30 | |||
| 2,473 samples/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 | |||
| 3D-UNet | 487 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | BraTS 2019 | 0.853 DICE mean |
| 421 samples/sec | 8x (7x1g.10gb A100) | DGX A100 | A100 SXM-80GB | |||
| 227 samples/sec | 4x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | |||
| 241 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | |||
| 225 samples/sec | 8x (4x1g.6gb A30) | Gigabyte G482-Z54 | A30 | |||
| 173 samples/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 | |||
| RNN-T | 106,918 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech | 7.45% WER |
| 50,561 samples/sec | 8x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | |||
| 52,596 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | |||
| 36,461 samples/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 | |||
| BERT | 28,302 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 | 90.07% f1 |
| 25,677 samples/sec | 8x (7x1g.10gb A100) | DGX A100 | A100 SXM-80GB | |||
| 12,606 samples/sec | 4x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | |||
| 13,385 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | |||
| 12,867 samples/sec | 8x (4x1g.6gb A30) | Gigabyte G482-Z54 | A30 | |||
| 8,757 samples/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 | |||
| DLRM | 2,421,440 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC |
| 1,097,730 samples/sec | 4x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | |||
| 1,083,600 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | |||
| 772,521 samples/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 |
Server Scenario - Closed Division
| Network | Throughput | GPU | Server | GPU Version | Target Accuracy | MLPerf Server Latency Constraints (ms) | Dataset |
|---|---|---|---|---|---|---|---|
| ResNet-50 v1.5 | 260,042 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet |
| 70,007 queries/sec | 8x (7x1g.10gb A100) | DGX A100 | A100 SXM-80GB | ||||
| 104,012 queries/sec | 4x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | ||||
| 116,014 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | ||||
| 65,004 queries/sec | 8x (4x1g.6gb A30) | Gigabyte G482-Z54 | A30 | ||||
| 88,014 queries/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 | ||||
| SSD ResNet-34 | 7,581 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 0.2 mAP | 100 | COCO |
| 5,802 queries/sec | 8x (7x1g.10gb A100) | DGX A100 | A100 SXM-80GB | ||||
| 3,083 queries/sec | 4x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | ||||
| 3,575 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | ||||
| 3,002 queries/sec | 8x (4x1g.6gb A30) | Gigabyte G482-Z54 | A30 | ||||
| 2,000 queries/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 | ||||
| RNN-T | 104,012 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech |
| 43,005 queries/sec | 4x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | ||||
| 36,999 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | ||||
| 22,600 queries/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 | ||||
| BERT | 25,795 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 90.07% f1 | 130 | SQuAD v1.1 |
| 20,497 queries/sec | 8x (7x1g.10gb A100) | DGX A100 | A100 SXM-80GB | ||||
| 10,402 queries/sec | 4x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | ||||
| 11,501 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | ||||
| 8,301 queries/sec | 8x (4x1g.6gb A30) | Gigabyte G482-Z54 | A30 | ||||
| 7,202 queries/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 | ||||
| DLRM | 2,302,660 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs |
| 600,198 queries/sec | 4x A100 | Gigabyte G242-P31 | A100-PCIe-80GB | ||||
| 1,000,530 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | ||||
| 680,257 queries/sec | 8x A10 | Supermicro 4029GP-TRT-OTO-28 | A10 |
Power Efficiency Offline Scenario - Closed Division
| Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
|---|---|---|---|---|---|---|
| ResNet-50 v1.5 | 244,537 samples/sec | 83 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet |
| 125,232 samples/sec | 110.9 samples/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB | ||
| 211,436 samples/sec | 112.03 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-40GB | ||
| SSD ResNet-34 | 6,482 samples/sec | 2.04 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | COCO |
| 3,295 samples/sec | 2.65 samples/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB | ||
| 5,866 samples/sec | 2.71 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-40GB | ||
| 3D-UNet | 399 samples/sec | 0.13 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | BraTS 2019 |
| 203 samples/sec | 0.18 samples/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB | ||
| 345 samples/sec | 0.18 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-40GB | ||
| RNN-T | 90,243 samples/sec | 27.73 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech |
| 44,495 samples/sec | 37.7 samples/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB | ||
| 84,727 samples/sec | 38.44 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-40GB | ||
| BERT | 24,667 samples/sec | 6.95 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 |
| 10,573 samples/sec | 8.5 samples/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB | ||
| 20,401 samples/sec | 8.19 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-40GB | ||
| DLRM | 2,091,060 samples/sec | 629.03 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs |
| 987,260 samples/sec | 786.67 samples/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB |
Power Efficiency Server Scenario - Closed Division
| Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
|---|---|---|---|---|---|---|
| ResNet-50 v1.5 | 232,036 queries/sec | 79.14 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet |
| 107,013 queries/sec | 94.74 queries/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB | ||
| 185,034 queries/sec | 87.76 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-40GB | ||
| SSD ResNet-34 | 6,301 queries/sec | 1.99 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | COCO |
| 3,083 queries/sec | 2.49 queries/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB | ||
| 5,703 queries/sec | 2.62 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-40GB | ||
| RNN-T | 88,014 queries/sec | 25.46 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech |
| 43,406 queries/sec | 33.55 queries/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB | ||
| 75,012 queries/sec | 33.11 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-40GB | ||
| BERT | 21,497 queries/sec | 6.22 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 |
| 10,203 queries/sec | 8.01 queries/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB | ||
| 17,496 queries/sec | 7.99 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-40GB | ||
| DLRM | 2,002,040 queries/sec | 591.77 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs |
| 890,424 queries/sec | 672.18 queries/sec/watt | 4x A100 | DGX-Station-A100 | A100 SXM-80GB |
MLPerf™ v1.1 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net, DLRM 99% of FP32 accuracy target: 1.1-033, 1.1-037, 1.1-039, 1.1-042, 1.1-043, 1.1-046, 1.1-047, 1.1-048, 1.1-051. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
4x1g.6gb and 7x1g.10gb is a notation used to refer to the MIG configuration. In this example, the workload is running on 4 or 7 single GPC slices, each with 6GB or 10GB of memory on a single A30 and A100 respectively.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here
NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v1.1
NVIDIA landed top performance spots on all MLPerf™ Inference 1.1 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.
MLPerf™ v1.1 A100 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net, DLRM 99% of FP32 accuracy target: 1.1-047, 1.1-049. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.
The chart compares the performance of Triton to the custom MLPerf™ serving harness across five different TensorRT networks on A100 SXM-80GB on bare metal. The results show that Triton is highly efficient and delivers nearly equal or identical performance to the highly optimized MLPerf™ harness.
NVIDIA Client Batch Size 1 Performance with Triton Inference Server
A100 Triton Inference Server Performance
| Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 V1.5 Inference | A100-SXM4-40GB | PyTorch | TensorRT | TF32 | 2 | 1 | 64 | 256 | 48.35 | 5,294 inf/sec | - | 21.03-py3 |
| ResNet-50 V1.5 Inference | A100-PCIE-40GB | PyTorch | TensorRT | Mixed | 2 | 1 | 64 | 256 | 61.02 | 4,197 inf/sec | - | 20.07-py3 |
| BERT Large Inference | A100-SXM4-40GB | TensorFlow | TensorRT | INT8 | 2 | 1 | 8 | 64 | 56.34 | 1,136 inf/sec | 384 | 20.09-py3 |
| BERT Large Inference | A100-PCIE-40GB | TensorFlow | TensorRT | Mixed | 1 | 1 | 8 | 16 | 17.48 | 915 inf/sec | 384 | 20.09-py3 |
| DLRM Inference | A100-SXM4-40GB | PyTorch | Torchscript | Mixed | 2 | 1 | 65,536 | 30 | 2.71 | 11,076 inf/sec | - | 21.03-py3 |
| DLRM Inference | A100-PCIE-40GB | PyTorch | Torchscript | Mixed | 2 | 1 | 65,536 | 24 | 2.52 | 9,521 inf/sec | - | 21.05-py3 |
T4 Triton Inference Server Performance
| Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 V1.5 Inference | NVIDIA T4 | PyTorch | TensorRT | Mixed | 1 | 1 | 64 | 256 | 257.91 | 992 inf/sec | - | 20.07-py3 |
| BERT Large Inference | NVIDIA T4 | TensorFlow | TensorRT | Mixed | 1 | 1 | 8 | 16 | 81.14 | 197 inf/sec | 384 | 20.09-py3 |
V100 Triton Inference Server Performance
| Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 V1.5 Inference | V100 SXM2-32GB | PyTorch | TensorRT | FP32 | 4 | 1 | 64 | 384 | 215.79 | 1,781 inf/sec | - | 21.03-py3 |
| DLRM Inference | V100-SXM2-32GB | PyTorch | Torchscript | Mixed | 2 | 1 | 65,536 | 26 | 3.67 | 7,083 inf/sec | - | 21.06-py3 |
Inference Performance of NVIDIA A100, A40, A30, A10, A2, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
Inference Natural Langugage Processing
BERT Inference Throughput
DGX A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128
NVIDIA A100 BERT Inference Benchmarks
| Network | Network Type | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BERT-Large with Sparsity | Attention | 94 | 6,188 sequences/sec | - | - | 1x A100 | DGX A100 | - | INT8 | SQuaD v1.1 | - | A100 SXM4-40GB |
A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
Inference Image Classification on CNNs with TensorRT
ResNet-50 v1.5 Throughput
DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: Mixed | Dataset: Synthetic
ResNet-50 v1.5 Power Efficiency
DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-1029GQ-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.0 | Batch Size = 128 | 21.09-py3 | Precision: Mixed | Dataset: Synthetic
A100 Full Chip Inference Performance
| Network | Batch Size | Full Chip Throughput | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 11,468 images/sec | 0.7 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| 128 | 30,671 images/sec | 4.17 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| 223 | 32,204 images/sec | 6.92 | 1x A100 | DGX A100 | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A100 SXM-80GB | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 11,203 images/sec | 0.71 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100-SXM4-40GB | |
| 128 | 29,855 images/sec | 4.29 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100-SXM-80GB | |
| 214 | 31,042 images/sec | 6.89 | 1x A100 | DGX A100 | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A100 SXM-80GB | |
| ResNext101 | 32 | 7,674 samples/sec | 4.17 | 1x A100 | - | - | INT8 | Synthetic | TensorRT 7.2 | A100-SXM4-40GB |
| EfficientNet-B0 | 128 | 22,346 images/sec | 5.73 | 1x A100 | - | - | INT8 | Synthetic | TensorRT 7.2 | A100-SXM4-40GB |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 6,895 sequences/sec | 1.16 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A100-SXM-80GB | |
| 128 | 13,554 sequences/sec | 9.44 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A100-SXM4-40GB | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 2,333 sequences/sec | 3.43 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A100-SXM-80GB | |
| 128 | 4,485 sequences/sec | 28.54 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A100-SXM4-40GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
For BS=1 inference refer to the Triton Inference Server section
Starting from 21.09-py3, ECC is enabled
A100 1/7 MIG Inference Performance
| Network | Batch Size | 1/7 MIG Throughput | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 3,725 images/sec | 2.15 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| 29 | 4,277 images/sec | 6.78 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| 128 | 4,642 images/sec | 27.58 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 3,623 images/sec | 2.21 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| 28 | 4,107 images/sec | 6.82 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| 128 | 4,501 images/sec | 28.44 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 1,676 sequences/sec | 4.77 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2 | A100 SXM-80GB | |
| 128 | 2,151 sequences/sec | 59.52 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2 | A100 SXM-80GB | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 553 sequences/sec | 14.48 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2 | A100 SXM-80GB | |
| 128 | 671 sequences/sec | 190.63 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2 | A100 SXM-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled
A100 7 MIG Inference Performance
| Network | Batch Size | 7 MIG Throughput | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 26,033 images/sec | 2.15 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| 29 | 29,905 images/sec | 6.79 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| 128 | 32,470 images/sec | 27.59 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 25,299 images/sec | 2.21 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| 28 | 28,868 images/sec | 6.79 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| 128 | 31,510 images/sec | 28.44 | 1x A100 | DGX A100 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100 SXM-80GB | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 11,771 sequences/sec | 4.76 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2 | A100 SXM-80GB | |
| 128 | 15,052 sequences/sec | 59.53 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2 | A100 SXM-80GB | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | ||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | |||||||||
| 8 | 3,865 sequences/sec | 14.49 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2 | A100 SXM-80GB | |
| 128 | 4,702 sequences/sec | 190.54 | 1x A100 | DGX A100 | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2 | A100 SXM-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled
A40 Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 9,637 images/sec | 40 images/sec/watt | 0.83 | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A40 | |
| 116 | 16,775 images/sec | - | 6.92 | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A40 | |
| 128 | 15,943 images/sec | 53 images/sec/watt | 8.03 | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A40 | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 9,447 images/sec | 39 images/sec/watt | 0.85 | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A40 | |
| 109 | 16,537 images/sec | - | 6.59 | 1x A40 | GIGABYTE G482-Z52-00 | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A40 | |
| 128 | 15,174 images/sec | 51 images/sec/watt | 8.44 | 1x A40 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A40 | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 4,494 sequences/sec | 19 sequences/sec/watt | 1.78 | 1x A40 | GIGABYTE G482-Z52-00 | 21.05-py3 | INT8 | Sample Text | TensorRT 7.2 | A40 | |
| 128 | 7,180 sequences/sec | 27 sequences/sec/watt | 17.83 | 1x A40 | GIGABYTE G482-Z52-00 | 21.05-py3 | INT8 | Sample Text | TensorRT 7.2 | A40 | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 1,666 sequences/sec | 6 sequences/sec/watt | 4.8 | 1x A40 | GIGABYTE G482-Z52-00 | 21.05-py3 | INT8 | Sample Text | TensorRT 7.2 | A40 | |
| 128 | 2,216 sequences/sec | 9 sequences/sec/watt | 57.76 | 1x A40 | GIGABYTE G482-Z52-00 | 21.05-py3 | INT8 | Sample Text | TensorRT 7.2 | A40 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
A30 Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 8,497 images/sec | 70 images/sec/watt | 0.94 | 1x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.03-py3 | INT8 | Synthetic | TensorRT 7.2 | A30 | |
| 109 | 16,083 images/sec | - | 6.78 | 1x A30 | GIGABYTE G482-Z52-00 | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A30 | |
| 128 | 15,385 images/sec | 94 images/sec/watt | 8.32 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 8,330 images/sec | 68 images/sec/watt | 0.96 | 1x A30 | GIGABYTE G482-Z52-SW-QZ-001 | 21.03-py3 | INT8 | Synthetic | TensorRT 7.2 | A30 | |
| 106 | 15,495 images/sec | - | 6.84 | 1x A30 | GIGABYTE G482-Z52-00 | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A30 | |
| 128 | 15,411 images/sec | 93 images/sec/watt | 8.31 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 4,417 sequences/sec | 33 sequences/sec/watt | 1.81 | 1x A30 | GIGABYTE G482-Z52-00 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A30 | |
| 128 | 6,815 sequences/sec | 50 sequences/sec/watt | 18.78 | 1x A30 | GIGABYTE G482-Z52-00 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A30 | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 1,492 sequences/sec | 11 sequences/sec/watt | 5.36 | 1x A30 | GIGABYTE G482-Z52-00 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A30 | |
| 128 | 2,207 sequences/sec | 15 sequences/sec/watt | 58 | 1x A30 | GIGABYTE G482-Z52-00 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
A30 1/4 MIG Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 3,636 images/sec | 45 images/sec/watt | 2.2 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| 28 | 4,212 images/sec | - | 6.65 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| 128 | 4,593 images/sec | 51 images/sec/watt | 27.87 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 3,551 images/sec | 43 images/sec/watt | 2.25 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| 28 | 4,090 images/sec | - | 6.85 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| 128 | 4,445 images/sec | 49 images/sec/watt | 28.8 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
A30 4 MIG Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 14,553 images/sec | 43 images/sec/watt | 2.2 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| 29 | 16,975 images/sec | - | 6.83 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| 128 | 18,316 images/sec | 51 images/sec/watt | 27.95 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 14,178 images/sec | 42 images/sec/watt | 2.26 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| 28 | 16,376 images/sec | - | 6.84 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 | |
| 128 | 17,750 images/sec | 50 images/sec/watt | 28.85 | 1x A30 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
A10 Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 7,954 images/sec | 53 images/sec/watt | 1.01 | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A10 | |
| 75 | 11,769 images/sec | - | 6.8 | 1x A10 | GIGABYTE G482-Z52-00 | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A10 | |
| 128 | 11,264 images/sec | 75 images/sec/watt | 11.36 | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A10 | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 7,683 images/sec | 51 images/sec/watt | 1.04 | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A10 | |
| 75 | 11,044 images/sec | - | 6.79 | 1x A10 | GIGABYTE G482-Z52-00 | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A10 | |
| 128 | 10,676 images/sec | 71 images/sec/watt | 11.99 | 1x A10 | GIGABYTE G482-Z52-00 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A10 | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 3,598 sequences/sec | 27 sequences/sec/watt | 2.22 | 1x A10 | GIGABYTE G482-Z52-00 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A10 | |
| 128 | 4,766 sequences/sec | 35 sequences/sec/watt | 26.86 | 1x A10 | GIGABYTE G482-Z52-00 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A10 | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 1,257 sequences/sec | 10 sequences/sec/watt | 6.36 | 1x A10 | GIGABYTE G482-Z52-00 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A10 | |
| 128 | 1,462 sequences/sec | 11 sequences/sec/watt | 88 | 1x A10 | GIGABYTE G482-Z52-00 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | A10 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
A2 Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 v1.5 | 1 | 1,377 images/sec | 31 images/sec/watt | 0.73 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A2 |
| 2 | 1,823 images/sec | 37 images/sec/watt | 1.10 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A2 | |
| 8 | 2,475 images/sec | 43 images/sec/watt | 3.23 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | A2 | |
| BERT Base | 1 | 619 sequences/sec | 13 sequences/sec/watt | 1.62 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2.3 | A2 |
| 2 | 789 sequences/sec | 15 sequences/sec/watt | 2.53 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2.3 | A2 | |
| 8 | 1,018 sequences/sec | 19 sequences/sec/watt | 7.86 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2.3 | A2 | |
| BERT Large | 1 | 241 sequences/sec | 4 sequences/sec/watt | 4.14 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2.3 | A2 |
| 2 | 261 sequences/sec | 5 sequences/sec/watt | 7.66 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2.3 | A2 | |
| 8 | 317 sequences/sec | 6 sequences/sec/watt | 25.23 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2.3 | A2 | |
| EfficientDet-D0 | 1 | 280 images/sec | - | 3.57 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | INT8 | Synthetic | TensorRT 8.2.06 | A2 |
| QuartzNet | 1 | 323 images/sec | - | 3.10 | 1x A2 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | INT8 | Synthetic | TensorRT 8.2.06 | A2 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
T4 Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 4,008 images/sec | 56 images/sec/watt | 2.04 | 1x T4 | Supermicro SYS-4029GP-TRT | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2.3 | NVIDIA T4 | |
| 32 | 4,771 images/sec | - | 6.71 | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0.1 | NVIDIA T4 | |
| 128 | 4,879 images/sec | 70 images/sec/watt | 26.23 | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | NVIDIA T4 | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 3,745 images/sec | 54 images/sec/watt | 2.14 | 1x T4 | Supermicro SYS-4029GP-TRT | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2.3 | NVIDIA T4 | |
| 29 | 4,501 images/sec | - | 6.44 | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.08-py3 | INT8 | Synthetic | TensorRT 8.0 | NVIDIA T4 | |
| 128 | 4,596 images/sec | 66 images/sec/watt | 27.85 | 1x T4 | Supermicro SYS-1029GQ-TRT | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | NVIDIA T4 | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 1,766 sequences/sec | 27 sequences/sec/watt | 4.53 | 1x T4 | Supermicro SYS-4029GP-TRT | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | NVIDIA T4 | |
| 128 | 1,872 sequences/sec | 28 sequences/sec/watt | 68 | 1x T4 | Supermicro SYS-4029GP-TRT | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | NVIDIA T4 | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 573 sequences/sec | 9 sequences/sec/watt | 13.97 | 1x T4 | Supermicro SYS-4029GP-TRT | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | NVIDIA T4 | |
| 128 | 565 sequences/sec | 8 sequences/sec/watt | 227 | 1x T4 | Supermicro SYS-4029GP-TRT | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | NVIDIA T4 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server section
Starting from 21.09-py3, ECC is enabled
V100 Inference Performance
| Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 4,362 images/sec | 16 images/sec/watt | 1.83 | 1x V100 | DGX-2 | 21.09-py3 | Mixed | Synthetic | TensorRT 8.0.3 | V100-SXM3-32GB | |
| 52 | 7,917 images/sec | - | 6.57 | 1x V100 | DGX-2 | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2.3 | V100-SXM3-32GB | |
| 128 | 8,165 images/sec | 23 images/sec/watt | 15.68 | 1x V100 | DGX-2 | 21.09-py3 | Mixed | Synthetic | TensorRT 8.0.3 | V100-SXM3-32GB | |
| ResNet-50v1.5 | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 4,285 images/sec | 15 images/sec/watt | 1.87 | 1x V100 | DGX-2 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | V100-SXM3-32GB | |
| 52 | 7,508 images/sec | - | 6.93 | 1x V100 | DGX-2 | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2.3 | V100-SXM3-32GB | |
| 128 | 7,773 images/sec | 22 images/sec/watt | 16.47 | 1x V100 | DGX-2 | 21.09-py3 | Mixed | Synthetic | TensorRT 8.0.3 | V100-SXM3-32GB | |
| BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 2,201 sequences/sec | 8 sequences/sec/watt | 3.64 | 1x V100 | DGX-2 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | V100-SXM3-32GB | |
| 128 | 3,174 sequences/sec | 10 sequences/sec/watt | 40.33 | 1x V100 | DGX-2 | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | V100-SXM3-32GB | |
| BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server tab | |||||||||
| 2 | For Batch Size 2, please refer to Triton Inference Server tab | ||||||||||
| 8 | 790 sequences/sec | 3 sequences/sec/watt | 10.12 | 1x V100 | DGX-2 | 21.06-py3 | Mixed | Sample Text | TensorRT 7.2 | V100-SXM3-32GB | |
| 128 | 971 sequences/sec | 3 sequences/sec/watt | 132 | 1x V100 | DGX-2 | 21.06-py3 | Mixed | Sample Text | TensorRT 7.2 | V100-SXM3-32GB |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
For BS=1 inference refer to the Triton Inference Server section
Starting from 21.09-py3, ECC is enabled
Inference Performance of NVIDIA GPU on Cloud
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Inference Performance on Cloud
| Network | Batch Size | Throughput | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50v1.5 | 8 | 10,961 images/sec | 0.73 | 1x A100 | GCP A2-HIGHGPU-1G | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100-SXM4-40GB |
| 128 | 27,779 images/sec | 4.61 | 1x A100 | GCP A2-HIGHGPU-1G | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | A100-SXM4-40GB |
Starting from 21.09-py3, ECC is enabled
T4 Inference Performance on Cloud
| Network | Batch Size | Throughput | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50v1.5 | 8 | 3,516 images/sec | 2.28 | 1x T4 | GCP N1-HIGHMEM-8 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | NVIDIA T4 |
| 128 | 4,090 images/sec | 31.3 | 1x T4 | GCP N1-HIGHMEM-8 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | NVIDIA T4 | |
| 8 | 3,533 images/sec | 2.26 | 1x T4 | AWS EC2 g4dn.4xlarge | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2 | NVIDIA T4 | |
| 128 | 4,555 images/sec | 28.1 | 1x T4 | AWS EC2 g4dn.4xlarge | 21.06-py3 | INT8 | Synthetic | TensorRT 7.2 | NVIDIA T4 | |
| BERT-LARGE | 8 | 551 sequences/sec | 14.52 | 1x T4 | AWS EC2 g4dn.4xlarge | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | NVIDIA T4 |
| 128 | 540 sequences/sec | 237.25 | 1x T4 | AWS EC2 g4dn.4xlarge | 21.06-py3 | INT8 | Sample Text | TensorRT 7.2 | NVIDIA T4 |
BERT-Large: Sequence Length = 128
Starting from 21.09-py3, ECC is enabled
V100 Inference Performance on Cloud
| Network | Batch Size | Throughput | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
|---|---|---|---|---|---|---|---|---|---|---|
| ResNet-50v1.5 | 8 | 4,217 images/sec | 1.9 | 1x V100 | GCP N1-HIGHMEM-8 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | V100-SXM2-16GB |
| 128 | 7,436 images/sec | 17.21 | 1x V100 | GCP N1-HIGHMEM-8 | 21.09-py3 | INT8 | Synthetic | TensorRT 8.0.3 | V100-SXM2-16GB |
Starting from 21.09-py3, ECC is enabled
Conversational AI
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.
Related Resources
Download and get started with NVIDIA Riva.
Riva Benchmarks
Automatic Speech Recognition
A100 Best Streaming Throughput Mode (800 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Quartznet | 1 | 14.4 | 1 | A100 SXM4-40GB |
| Quartznet | 256 | 254.3 | 64 | A100 SXM4-40GB |
| Quartznet | 512 | 351.2 | 506 | A100 SXM4-40GB |
| Quartznet | 1024 | 630.8 | 1005 | A100 SXM4-40GB |
| Jasper | 1 | 17.6 | 1 | A100 SXM4-40GB |
| Jasper | 256 | 244.9 | 254 | A100 SXM4-40GB |
| Jasper | 512 | 381 | 507 | A100 SXM4-40GB |
| Jasper | 1024 | 749.3 | 1,004 | A100 SXM4-40GB |
A100 Best Streaming Latency Mode (100 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Quartznet | 1 | 9.6 | 1 | A100 SXM4-40GB |
| Quartznet | 16 | 25.9 | 16 | A100 SXM4-40GB |
| Quartznet | 128 | 132.4 | 128 | A100 SXM4-40GB |
| Jasper | 1 | 13.4 | 1 | A100 SXM4-40GB |
| Jasper | 16 | 26.3 | 16 | A100 SXM4-40GB |
| Jasper | 128 | 258.9 | 128 | A100 SXM4-40GB |
A100 Offline Mode (3200 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Quartznet | 1 | 28.1 | 1 | A100 SXM4-40GB |
| Quartznet | 512 | 566.5 | 505 | A100 SXM4-40GB |
| Quartznet | 1,024 | 899.3 | 1,000 | A100 SXM4-40GB |
| Quartznet | 1,512 | 1,303.8 | 1,460 | A100 SXM4-40GB |
| Jasper | 1 | 31 | 1 | A100 SXM4-40GB |
| Jasper | 512 | 667.5 | 504 | A100 SXM4-40GB |
| Jasper | 1,024 | 1,089 | 997 | A100 SXM4-40GB |
| Jasper | 1,512 | 1,753.8 | 1,449 | A100 SXM4-40GB |
V100 Best Streaming Throughput Mode (800 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Quartznet | 1 | 14.4 | 1 | V100 SXM2-16GB |
| Quartznet | 256 | 222.2 | 254 | V100 SXM2-16GB |
| Quartznet | 512 | 385.2 | 505 | V100 SXM2-16GB |
| Quartznet | 768 | 574.5 | 752 | V100 SXM2-16GB |
| Jasper | 1 | 26.8 | 1 | V100 SXM2-16GB |
| Jasper | 128 | 239.4 | 127 | V100 SXM2-16GB |
| Jasper | 256 | 416 | 253 | V100 SXM2-16GB |
| Jasper | 512 | 969.7 | 500 | V100 SXM2-16GB |
V100 Best Streaming Latency Mode (100 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Quartznet | 1 | 8.8 | 1 | V100 SXM2-16GB |
| Quartznet | 16 | 22.4 | 16 | V100 SXM2-16GB |
| Quartznet | 128 | 114.7 | 127 | V100 SXM2-16GB |
| Jasper | 1 | 21.5 | 1 | V100 SXM2-16GB |
| Jasper | 16 | 36.9 | 16 | V100 SXM2-16GB |
| Jasper | 64 | 406.4 | 64 | V100 SXM2-16GB |
| Jasper | 512 | 969.7 | 500 | V100 SXM2-16GB |
V100 Offline Mode (3200 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Quartznet | 1 | 32.933 | 1 | V100 SXM2-16GB |
| Quartznet | 256 | 461.44 | 253 | V100 SXM2-16GB |
| Quartznet | 512 | 784.73 | 502 | V100 SXM2-16GB |
| Quartznet | 768 | 1,121.6 | 747 | V100 SXM2-16GB |
| Quartznet | 1,024 | 1,551.5 | 986 | V100 SXM2-16GB |
| Jasper | 1 | 48.351 | 1 | V100 SXM2-16GB |
| Jasper | 256 | 734.99 | 252 | V100 SXM2-16GB |
| Jasper | 512 | 1,423.3 | 498 | V100 SXM2-16GB |
| Jasper | 768 | 2,190.2 | 730 | V100 SXM2-16GB |
T4 Best Streaming Throughput Mode (800 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Quartznet | 1 | 33.183 | 1 | NVIDIA T4 |
| Quartznet | 64 | 162.63 | 64 | NVIDIA T4 |
| Quartznet | 128 | 263.6 | 127 | NVIDIA T4 |
| Quartznet | 256 | 449.28 | 253 | NVIDIA T4 |
| Quartznet | 384 | 732.75 | 376 | NVIDIA T4 |
| Jasper | 1 | 72.377 | 1 | NVIDIA T4 |
| Jasper | 64 | 259.64 | 64 | NVIDIA T4 |
| Jasper | 128 | 450.81 | 127 | NVIDIA T4 |
| Jasper | 256 | 1,200.8 | 249 | NVIDIA T4 |
T4 Best Streaming Latency Mode (100 ms chunk)
| Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Quartznet | 1 | 19.2 | 1 | NVIDIA T4 |
| Quartznet | 16 | 56.4 | 16 | NVIDIA T4 |
| Quartznet | 64 | 242.4 | 64 | NVIDIA T4 |
| Jasper | 1 | 46.9 | 1 | NVIDIA T4 |
| Jasper | 8 | 51.1 | 8 | NVIDIA T4 |
| Jasper | 16 | 84.4 | 16 | NVIDIA T4 |
T4 Offline Mode (3200 ms chunk)
| Acoustic model | # of streams | Latency (ms) (avg) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| Quartznet | 1 | 157.62 | 1 | NVIDIA T4 |
| Quartznet | 256 | 906.17 | 251 | NVIDIA T4 |
| Quartznet | 512 | 1,515.2 | 495 | NVIDIA T4 |
| Jasper | 1 | 96.201 | 1 | NVIDIA T4 |
| Jasper | 256 | 1,758.4 | 247 | NVIDIA T4 |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.0.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Natural Language Processing
A100 Benchmarks
| Task | # of streams | Avg Latency (ms) | Throughput (seq/sec) | GPU Version |
|---|---|---|---|---|
| NER | 1 | 3.19 | 311 | A100 SXM4-40GB |
| NER | 256 | 95.5 | 2549 | A100 SXM4-40GB |
| Q&A | 1 | 4.95 | 201 | A100 SXM4-40GB |
| Q&A | 128 | 279 | 453 | A100 SXM4-40GB |
V100 Benchmarks
| Task | # of streams | Avg Latency (ms) | Throughput (seq/sec) | GPU Version |
|---|---|---|---|---|
| NER | 1 | 4.87 | 204 | V100 SXM2-16GB |
| NER | 256 | 135 | 1,797 | V100 SXM2-16GB |
| Q&A | 1 | 7.47 | 134 | V100 SXM2-16GB |
| Q&A | 128 | 521 | 244 | V100 SXM2-16GB |
T4 Benchmarks
| Task | # of streams | Avg Latency (ms) | Throughput (seq/sec) | GPU Version |
|---|---|---|---|---|
| NER | 1 | 9.31 | 107 | NVIDIA T4 |
| NER | 256 | 255 | 960 | NVIDIA T4 |
| Q&A | 1 | 11.5 | 87 | NVIDIA T4 |
| Q&A | 128 | 571 | 223 | NVIDIA T4 |
Named Entity Recogniton (NER): 128 seq len, BERT-base | Question Answering (QA): 384 seq len, BERT-large | NLP Throughput (seq/s) - Number of sequences processed per second | Performance of the Riva named entity recognition (NER) service (using a BERT-base model, sequence length of 128) and the Riva question answering (QA) service (using a BERT-large model, sequence length of 384) was measured in Riva. Batch size 1 latency and maximum throughput were measured. Riva version: v1.0.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Text to Speech
A100 Benchmarks
| # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| 1 | 0.06 | 0.04 | 20 | A100 SXM4-40GB |
| 4 | 0.48 | 0.03 | 37 | A100 SXM4-40GB |
| 6 | 0.69 | 0.03 | 42 | A100 SXM4-40GB |
| 8 | 0.88 | 0.03 | 46 | A100 SXM4-40GB |
| 10 | 1.06 | 0.03 | 49 | A100 SXM4-40GB |
V100 Benchmarks
| # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| 1 | 0.08 | 0.05 | 14 | V100 SXM2-16GB |
| 4 | 0.77 | 0.05 | 23 | V100 SXM2-16GB |
| 6 | 1.11 | 0.05 | 26 | V100 SXM2-16GB |
| 8 | 1.4 | 0.06 | 28 | V100 SXM2-16GB |
| 10 | 1.74 | 0.07 | 28 | V100 SXM2-16GB |
T4 Benchmarks
| # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
|---|---|---|---|---|
| 1 | 0.12 | 0.07 | 11 | NVIDIA T4 |
| 4 | 1.02 | 0.07 | 17 | NVIDIA T4 |
| 6 | 1.59 | 0.07 | 18 | NVIDIA T4 |
| 8 | 2.13 | 0.08 | 19 | NVIDIA T4 |
| 10 | 2.55 | 0.1 | 18 | NVIDIA T4 |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.0.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Last updated: December 1st, 2021