NVIDIA Data Center Deep Learning Product Performance
Reproducible Performance
Reproduce on your systems by following the instructions in the Measuring Training and Inferencing Performance on NVIDIA AI Platforms Reviewer’s Guide
Related Resources
Read why training to convergence is essential for enterprise AI adoption.
Learn how Cloud Service, OEMs Raise the Bar on AI Training with NVIDIA AI in the MLPerf training.
Access containers in the NVIDIA NGC™ catalog.
Learn how MLPerf Benchmarks show why AI is the future of HPC.
HPC Performance
Review the latest GPU-acceleration factors of popular HPC applications.
Training to Convergence
Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.
Related Resources
Read our blog on convergence for more details.
Get up and running quickly with NVIDIA’s complete solution stack:
Pull software containers from NVIDIA NGC.
NVIDIA Performance on MLPerf 2.0 Training Benchmarks
BERT Time to Train on A100
PyTorch | Precision: Mixed | Dataset: Wikipedia 2020/01/01 | Convergence criteria - refer to MLPerf requirements
MLPerf Training Performance
NVIDIA A100 Performance on MLPerf 2.0 AI Benchmarks - Closed Division
MLPerf™ v2.0 Training Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Strong Scaling - Closed Division
Framework | Network | Time to Train (mins) | MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
MXNet | CosmoFlow | 8.04 | Mean average error 0.124 | 1,024x A100 | DGX A100 | 1.0-1120 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | A100-SXM4-80GB |
25.78 | Mean average error 0.124 | 128x A100 | DGX A100 | 1.0-1121 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | A100-SXM4-80GB | ||
PyTorch | DeepCAM | 1.67 | IOU 0.82 | 2,048x A100 | DGX A100 | 1.0-1122 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | A100-SXM4-80GB |
2.65 | IOU 0.82 | 512x A100 | DGX A100 | 1.0-1123 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | A100-SXM4-80GB |
NVIDIA A100 Performance on MLPerf 1.0 Training HPC Benchmarks: Weak Scaling - Closed Division
Framework | Network | Throughput | MLPerf Quality Target | GPU | Server | MLPerf-ID | Precision | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|
MXNet | CosmoFlow | 0.73 models/min | Mean average error 0.124 | 4,096x A100 | DGX A100 | 1.0-1131 | Mixed | CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets | A100-SXM4-80GB |
PyTorch | DeepCAM | 5.27 models/min | IOU 0.82 | 4,096x A100 | DGX A100 | 1.0-1132 | Mixed | CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background) | A100-SXM4-80GB |
MLPerf™ v1.0 Training HPC Closed: MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v1.0 Training HPC rules and guidelines, click here
Converged Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | Tacotron2 | 101 | .55 Training Loss | 305,059 total output mels/sec | 8x A100 | DGX A100 | 22.05-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM4-80GB |
1.12.0a0 | WaveGlow | 250 | -5.81 Training Loss | 1,709,596 output samples/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A100-SXM4-80GB | |
1.12.0a0 | GNMT V2 | 16 | 24.16 BLEU Score | 960,545 total tokens/sec | 8x A100 | DGX A100 | 22.05-py3 | Mixed | 128 | wmt16-en-de | A100-SXM4-80GB | |
1.12.0a0 | NCF | 0.37 | .96 Hit Rate at 10 | 152,745,062 samples/sec | 8x A100 | DGX A100 | 22.05-py3 | Mixed | 131072 | MovieLens 20M | A100-SXM4-80GB | |
1.12.0a0 | Transformer-XL Base | 177 | 22.45 Perplexity | 742,987 total tokens/sec | 8x A100 | DGX A100 | 22.05-py3 | Mixed | 128 | WikiText-103 | A100-SXM4-80GB | |
1.12.0a0 | EfficientNet-B0 | 576 | 76.54 Top 1 | 16,016 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-40GB | |
1.12.0a0 | EfficientDet-D0 | 454 | .34 BBOX mAP | 1,990 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 150 | COCO 2017 | A100-SXM4-80GB | |
1.12.0a0 | EfficientNet-WideSE-B0 | 575 | 76.89 Top 1 | 15,489 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
1.12.0a0 | EfficientNet-WideSE-B4 | 1,252 | 78.06 Top 1 | 6,956 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A100-SXM4-80GB | |
1.12.0a0 | SE3 Transformer | 9 | .04 MAE | 21,466 molecules/sec | 8x A100 | DGX A100 | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | A100-SXM4-80GB | |
Tensorflow | 1.15.5 | ResNext101 | 188 | 79.19 Top 1 | 10,300 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB |
1.15.5 | SE-ResNext101 | 216 | 79.76 Top 1 | 8,960 images/sec | 8x A100 | DGX A100 | 22.05-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 1,080 images/sec | 8x A100 | DGX A100 | 22.05-py3 | Mixed | 2 | DAGM2007 | A100-SXM4-80GB | |
2.8.0 | U-Net Medical | 5 | .89 DICE Score | 973 images/sec | 8x A100 | DGX A100 | 22.05-py3 | Mixed | 8 | EM segmentation challenge | A100-SXM4-80GB | |
2.8.0 | Electra Base Fine Tuning | 3 | 92.55 F1 | 2,823 sequences/sec | 8x A100 | DGX A100 | 22.05-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
2.8.0 | Wide and Deep | 6 | .66 MAP at 12 | 4,442,267 images/sec | 8x A100 | DGX A100 | 22.04-py3 | Mixed | 16384 | Tabular Outbrain Parquet | A100-SXM4-40GB |
A40 Training Performance
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | NCF | 1 | .96 Hit Rate at 10 | 49,793,335 samples/sec | 8x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 131072 | MovieLens 20M | A40 |
1.12.0a0 | Tacotron2 | 117 | .56 Training Loss | 262,848 total output mels/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.05-py3 | Mixed | 128 | LJSpeech 1.1 | A40 | |
1.12.0a0 | WaveGlow | 468 | -5.76 Training Loss | 901,085 output samples/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A40 | |
1.12.0a0 | GNMT v2 | 45 | 24.3 BLEU Score | 326,251 total tokens/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.05-py3 | Mixed | 128 | wmt16-en-de | A40 | |
1.12.0a0 | Transformer XL Base | 434 | 22.43 Perplexity | 303,433 total tokens/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.05-py3 | Mixed | 128 | WikiText-103 | A40 | |
1.12.0a0 | EfficientNet-B0 | 875 | 76.44 Top 1 | 10,150 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 256 | Imagenet2012 | A40 | |
1.12.0a0 | EfficientDet-D0 | 640 | .34 BBOX mAP | 1,266 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 60 | COCO 2017 | A40 | |
1.12.0a0 | SE3 Transformer | 13 | .04 MAE | 13,766 molecules/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | A40 | |
Tensorflow | 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 660 images/sec | 8x A40 | GIGABYTE G482-Z52-00 | 22.02-py3 | Mixed | 2 | DAGM2007 | A40 |
1.15.5 | ResNeXt101 | 424 | 79.18 Top 1 | 4,541 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.04-py3 | Mixed | 256 | Imagenet2012 | A40 | |
1.15.5 | SE-ResNext101 | 468 | 79.74 Top 1 | 4,125 images/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.05-py3 | Mixed | 256 | Imagenet2012 | A40 | |
2.8.0 | Electra Base Fine Tuning | 4 | 92.6 F1 | 1,132 sequences/sec | 8x A40 | Supermicro AS -4124GS-TNR | 22.05-py3 | Mixed | 32 | SQuaD v1.1 | A40 |
A30 Training Performance
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | Tacotron2 | 123 | .53 Training Loss | 253,743 total output mels/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 104 | LJSpeech 1.1 | A30 |
1.12.0a0 | WaveGlow | 450 | -5.8 Training Loss | 939,042 output samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A30 | |
1.12.0a0 | GNMT V2 | 45 | 24.43 BLEU Score | 325,304 total tokens/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | wmt16-en-de | A30 | |
1.12.0a0 | NCF | 1 | .96 Hit Rate at 10 | 57,626,487 samples/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 131072 | MovieLens 20M | A30 | |
1.12.0a0 | ResNeXt101 | 541 | 79.47 Top 1 | 3,639 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 112 | Imagenet2012 | A30 | |
1.12.0a0 | EfficientNet-B0 | 908 | 76.4 Top 1 | 9,667 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A30 | |
1.12.0a0 | EfficientDet-D0 | 768 | .34 BBOX mAP | 956 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 30 | COCO 2017 | A30 | |
1.12.0a0 | SE3 Transformer | 12 | .04 MAE | 15,418 molecules/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | A30 | |
Tensorflow | 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 698 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 2 | DAGM2007 | A30 |
2.8.0 | U-Net Medical | 2 | .88 DICE Score | 475 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 8 | EM segmentation challenge | A30 | |
1.15.5 | ResNeXt101 | 459 | 79.33 Top 1 | 4,213 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A30 | |
1.15.5 | SE-ResNext101 | 557 | 79.83 Top 1 | 3,475 images/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 96 | Imagenet2012 | A30 | |
2.8.0 | Electra Base Fine Tuning | 5 | 92.69 F1 | 990 sequences/sec | 8x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | SQuaD v1.1 | A30 |
A10 Training Performance
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | Tacotron2 | 143 | .54 Training Loss | 218,110 total output mels/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 104 | LJSpeech 1.1 | A10 |
1.12.0a0 | WaveGlow | 546 | -5.86 Training Loss | 771,407 output samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A10 | |
1.12.0a0 | GNMT V2 | 52 | 24.1 BLEU Score | 281,087 total tokens/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | wmt16-en-de | A10 | |
1.12.0a0 | NCF | 1 | .96 Hit Rate at 10 | 45,665,439 samples/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 131072 | MovieLens 20M | A10 | |
1.12.0a0 | EfficientNet-B0 | 1,117 | 76.3 Top 1 | 7,885 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A10 | |
1.12.0a0 | EfficientDet-D0 | 790 | .34 BBOX mAP | 923 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 30 | COCO 2017 | A10 | |
1.12.0a0 | EfficientNet-WideSE-B0 | 1,143 | 76.78 Top 1 | 7,729 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A10 | |
1.12.0a0 | SE3 Transformer | 15 | .04 MAE | 12,058 molecules/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | A10 | |
Tensorflow | 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 657 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 2 | DAGM2007 | A10 |
2.8.0 | U-Net Medical | 3 | .89 DICE Score | 369 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 8 | EM segmentation challenge | A10 | |
1.15.5 | ResNext101 | 573 | 79.16 Top 1 | 3,365 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | Imagenet2012 | A10 | |
1.15.5 | SE-ResNext101 | 674 | 79.84 Top 1 | 2,869 images/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 96 | Imagenet2012 | A10 | |
2.8.0 | Electra Base Fine Tuning | 6 | 92.64 F1 | 753 sequences/sec | 8x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | SQuaD v1.1 | A10 |
T4 Training Performance +
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | ResNeXt101 | 1,375 | 79.43 Top 1 | 1,432 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | Mixed | 112 | Imagenet2012 | NVIDIA T4 |
1.12.0a0 | WaveGlow | 1,120 | -5.82 Training Loss | 387,032 output samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.03-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA T4 | |
1.12.0a0 | GNMT V2 | 93 | 24.22 BLEU Score | 155,176 total tokens/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
1.12.0a0 | NCF | 2 | .96 Hit Rate at 10 | 25,081,490 samples/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | Mixed | 131072 | MovieLens 20M | NVIDIA T4 | |
1.12.0a0 | EfficientNet-B0 | 2,371 | 76.43 Top 1 | 3,702 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
1.12.0a0 | EfficientDet-D0 | 1,349 | .34 BBOX mAP | 506 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 30 | COCO 2017 | NVIDIA T4 | |
1.12.0a0 | EfficientNet-WideSE-B0 | 2,480 | 76.67 Top 1 | 3,567 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
1.12.0a0 | SE3 Transformer | 37 | .04 MAE | 4,666 molecules/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | NVIDIA T4 | |
Tensorflow | 1.15.5 | U-Net Industrial | 2 | .99 IoU Threshold 0.99 | 299 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | Mixed | 2 | DAGM2007 | NVIDIA T4 |
1.15.5 | U-Net Medical | 39 | .89 DICE Score | 149 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | Mixed | 8 | EM segmentation challenge | NVIDIA T4 | |
1.15.5 | ResNext101 | 1,257 | 79.38 Top 1 | 1,533 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.04-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
1.15.5 | SE-ResNext101 | 1,580 | 79.91 Top 1 | 1,220 images/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | Mixed | 96 | Imagenet2012 | NVIDIA T4 | |
2.8.0 | Electra Base Fine Tuning | 10 | 92.7 F1 | 378 sequences/sec | 8x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | Mixed | 16 | SQuaD v1.1 | NVIDIA T4 |
V100 Training Performance +
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | Tacotron2 | 181 | .53 Training Loss | 180,095 total output mels/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 104 | LJSpeech 1.1 | V100-SXM3-32GB |
1.12.0a0 | WaveGlow | 411 | -5.72 Training Loss | 1,035,406 output samples/sec | 8x V100 | DGX-2 | 22.03-py3 | Mixed | 10 | LJSpeech 1.1 | V100-SXM3-32GB | |
1.12.0a0 | GNMT V2 | 34 | 24.42 BLEU Score | 440,905 total tokens/sec | 8x V100 | DGX-2 | 22.05-py3 | Mixed | 128 | wmt16-en-de | V100-SXM3-32GB | |
1.12.0a0 | NCF | 1 | .96 Hit Rate at 10 | 94,214,173 samples/sec | 8x V100 | DGX-2 | 22.05-py3 | Mixed | 131072 | MovieLens 20M | V100-SXM3-32GB | |
1.12.0a0 | EfficientNet-B0 | 1,028 | 76.47 Top 1 | 8,709 images/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 256 | Imagenet2012 | V100-SXM3-32GB | |
1.12.0a0 | EfficientDet-D0 | 1,239 | .34 BBOX mAP | 565 images/sec | 8x V100 | DGX-2 | 22.05-py3 | Mixed | 60 | COCO 2017 | V100-SXM3-32GB | |
1.12.0a0 | EfficientNet-WideSE-B0 | 1,024 | 76.97 Top 1 | 8,737 images/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 256 | Imagenet2012 | V100-SXM3-32GB | |
1.12.0a0 | SE3 Transformer | 14 | .04 MAE | 13,459 molecules/sec | 8x V100 | DGX-2 | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | V100-SXM3-32GB | |
Tensorflow | 1.15.5 | U-Net Industrial | 1 | .99 IoU Threshold 0.99 | 639 images/sec | 8x V100 | DGX-2 | 22.05-py3 | Mixed | 2 | DAGM2007 | V100-SXM3-32GB |
1.15.5 | ResNext101 | 419 | 79.4 Top 1 | 4,622 images/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 128 | Imagenet2012 | V100-SXM3-32GB | |
1.15.5 | SE-ResNext101 | 493 | 79.9 Top 1 | 3,945 images/sec | 8x V100 | DGX-2 | 22.05-py3 | Mixed | 96 | Imagenet2012 | V100-SXM3-32GB | |
1.15.5 | U-Net Medical | 12 | .89 DICE Score | 466 images/sec | 8x V100 | DGX-2 | 22.05-py3 | Mixed | 8 | EM segmentation challenge | V100-SXM3-32GB | |
2.8.0 | Wide and Deep | 9 | .66 MAP at 12 | 2,921,693 samples/sec | 8x V100 | DGX-2 | 22.04-py3 | Mixed | 16384 | Kaggle Outbrain Click Prediction | V100-SXM3-32GB | |
2.8.0 | Electra Base Fine Tuning | 4 | 92.62 F1 | 1,376 sequences/sec | 8x V100 | DGX-2 | 22.05-py3 | Mixed | 32 | SQuaD v1.1 | V100-SXM3-32GB |
Converged Training Performance of NVIDIA GPU on Cloud
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance on Cloud
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Tensorflow | - | BERT-LARGE | 12 | 91.38 F1 | 769 sequences/sec | 8x A100 | AWS EC2 p4d.24xlarge | 22.04-py3 | Mixed | 24 | SQuaD v1.1 | A100-SXM4-40GB |
- | BERT-LARGE | 10 | 91.36 F1 | 825 sequences/sec | 8x A100 | Azure Standard_ND96amsr_A100_v4 | 22.05-py3 | Mixed | 24 | SQuaD v1.1 | A100-SXM4-40GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
V100 Training Performance on Cloud +
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Tensorflow | - | BERT-LARGE | 29 | 91.3 F1 | 172 sequences/sec | 8x V100 | AWS EC2 p3.16xlarge | 22.04-py3 | Mixed | 3 | SQuaD v1.1 | V100-SXM2-16GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
Converged Multi-Node Training Performance of NVIDIA GPU
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Multi-Node Training Performance
Framework | Framework Version | Network | Time to Train (mins) | Accuracy | Throughput | Total GPUs | Nodes | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.11.0a0 | BERT-LARGE Pre-Training P1 | 296 | 1.53 Training Loss | 25,365 sequences/sec | 64x A100 | 8 | Selene | 21.12-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM4-80GB |
1.11.0a0 | BERT-LARGE Pre-Training P2 | 169 | 1.35 Training Loss | 5,112 sequences/sec | 64x A100 | 8 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | BERT-LARGE Pre-Training E2E | 253 | 1.35 Training Loss | - | 64x A100 | 8 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | BERT-LARGE Pre-Training P1 | 160 | 1.51 Training Loss | 48,380 sequences/sec | 128x A100 | 16 | Selene | 21.12-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | BERT-LARGE Pre-Training P2 | 87 | 1.34 Training Loss | 9,961 sequences/sec | 128x A100 | 16 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | BERT-LARGE Pre-Training E2E | 136 | 1.34 Training Loss | - | 128x A100 | 16 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | BERT-LARGE Pre-Training P1 | 87 | 1.49 Training Loss | 89,062 sequences/sec | 256x A100 | 32 | Selene | 21.12-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | BERT-LARGE Pre-Training P2 | 46 | 1.34 Training Loss | 19,169 sequences/sec | 256x A100 | 32 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | BERT-LARGE Pre-Training E2E | 73 | 1.34 Training Loss | - | 256x A100 | 32 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | BERT-LARGE Pre-Training P1 | 51 | 1.5 Training Loss | 153,429 sequences/sec | 512x A100 | 64 | Selene | 21.12-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | BERT-LARGE Pre-Training P2 | 25 | 1.33 Training Loss | 36,887 sequences/sec | 512x A100 | 64 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | BERT-LARGE Pre-Training E2E | 42 | 1.33 Training Loss | - | 512x A100 | 64 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.10.0a0 | BERT-LARGE Pre-Training P1 | 26 | 1.5 Training Loss | 300,769 sequences/sec | 1,024x A100 | 128 | Selene | 21.09-py3 | Mixed | 64 | SQuaD v1.1 | A100-SXM4-80GB | |
1.10.0a0 | BERT-LARGE Pre-Training P2 | 13 | 1.35 Training Loss | 74,498 sequences/sec | 1,024x A100 | 128 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.10.0a0 | BERT-LARGE Pre-Training E2E | 22 | 1.35 Training Loss | - | 1,024x A100 | 128 | Selene | 21.09-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | Transformer | 186 | 18.25 Perplexity | 454,979 total tokens/sec | 16x A100 | 2 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | Transformer | 105 | 18.27 Perplexity | 822,173 total tokens/sec | 64x A100 | 4 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB | |
1.11.0a0 | Transformer | 63 | 18.34 Perplexity | 1,389,494 total tokens/sec | 64x A100 | 8 | Selene | 21.12-py3 | Mixed | 16 | SQuaD v1.1 | A100-SXM4-80GB |
BERT-Large Pre-Training Phase 1 Sequence Length = 128
BERT-Large Pre-Training Phase 2 Sequence Length = 512
Starting from 21.09-py3, ECC is enabled
Single-GPU Training
Some scenarios aren’t used in real-world training, such as single-GPU throughput. The table below provides an indication of a platform’s single-chip throughput.
Related Resources
Achieve unprecedented acceleration at every scale with NVIDIA’s complete solution stack.
Pull software containers from NVIDIA NGC.
Scenarios that are not typically used in real-world training, such as single GPU throughput are illustrated in the table below, and provided for reference as an indication of single chip throughput of the platform.
NVIDIA’s complete solution stack, from hardware to software, allows data scientists to deliver unprecedented acceleration at every scale. Visit the NVIDIA NGC catalog to pull containers and quickly get up and running with deep learning.
Single GPU Training Performance of NVIDIA A100, A40, A30, A10, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | Tacotron2 | 40,316 total output mels/sec | 1x A100 | DGX A100 | 22.05-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM4-80GB |
1.12.0a0 | WaveGlow | 230,472 output samples/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A100-SXM4-80GB | |
1.11.0a0 | FastPitch | 87,184 frames/sec | 1x A100 | DGX A100 | 22.02-py3 | TF32 | 128 | LJSpeech 1.1 | A100-SXM4-80GB | |
1.12.0a0 | GNMT V2 | 170,507 total tokens/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 128 | wmt16-en-de | A100-SXM4-80GB | |
1.12.0a0 | NCF | 39,322,684 samples/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 1048576 | MovieLens 20M | A100-SXM4-80GB | |
1.12.0a0 | ResNeXt101 | 1,134 images/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 128 | Imagenet2012 | A100-SXM4-80GB | |
1.12.0a0 | Transformer-XL Large | 17,068 total tokens/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 16 | WikiText-103 | A100-SXM4-80GB | |
1.12.0a0 | Transformer-XL Base | 90,918 total tokens/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 128 | WikiText-103 | A100-SXM4-80GB | |
1.12.0a0 | nnU-Net | 1,120 images/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 64 | Medical Segmentation Decathlon | A100-SXM4-80GB | |
1.12.0a0 | BERT Large Pre-Training Phase 2 | 289 sequences/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 56 | Wikipedia 2020/01/01 | A100-SXM4-80GB | |
1.12.0a0 | BERT Large Pre-Training Phase 1 | 853 sequences/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 512 | Wikipedia 2020/01/01 | A100-SXM4-80GB | |
1.12.0a0 | EfficientDet-D0 | 270 images/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 150 | COCO 2017 | A100-SXM4-80GB | |
1.12.0a0 | EfficientNet-WideSE-B0 | 1,920 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
1.12.0a0 | EfficientNet-WideSE-B4 | 940 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A100-SXM4-80GB | |
1.12.0a0 | SE3 Transformer | 3,097 molecules/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | A100-SXM4-80GB | |
Tensorflow | 1.15.5 | ResNext101 | 1,322 images/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB |
1.15.5 | SE-ResNext101 | 1,156 images/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 256 | Imagenet2012 | A100-SXM4-80GB | |
1.15.5 | U-Net Industrial | 371 images/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 16 | DAGM2007 | A100-SXM4-40GB | |
2.8.0 | U-Net Medical | 149 images/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 8 | EM segmentation challenge | A100-SXM4-80GB | |
2.8.0 | Wide and Deep | 1,876,978 samples/sec | 1x A100 | DGX A100 | 22.04-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | A100-SXM4-80GB | |
2.8.0 | Electra Base Fine Tuning | 372 sequences/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 32 | SQuaD v1.1 | A100-SXM4-80GB | |
1.15.5 | NCF | 43,735,836 samples/sec | 1x A100 | DGX A100 | 22.05-py3 | Mixed | 1048576 | MovieLens 20M | A100-SXM4-40GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
A40 Training Performance
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | Tacotron2 | 33,988 total output mels/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | LJSpeech 1.1 | A40 |
1.12.0a0 | WaveGlow | 144,465 output samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A40 | |
1.12.0a0 | GNMT V2 | 81,260 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | wmt16-en-de | A40 | |
1.12.0a0 | NCF | 17,869,902 samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 1048576 | MovieLens 20M | A40 | |
1.12.0a0 | Transformer-XL Large | 10,184 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | WikiText-103 | A40 | |
1.12.0a0 | FastPitch | 77,556 frames/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 32 | LJSpeech 1.1 | A40 | |
1.12.0a0 | Transformer-XL Base | 42,411 total tokens/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | WikiText-103 | A40 | |
1.12.0a0 | nnU-Net | 562 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 64 | Medical Segmentation Decathlon | A40 | |
1.12.0a0 | EfficientNet-B0 | 1,255 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 256 | Imagenet2012 | A40 | |
1.12.0a0 | EfficientDet-D0 | 172 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 60 | COCO 2017 | A40 | |
1.12.0a0 | EfficientNet-WideSE-B0 | 1,262 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 256 | Imagenet2012 | A40 | |
1.12.0a0 | EfficientNet-WideSE-B4 | 485 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 128 | Imagenet2012 | A40 | |
1.12.0a0 | SE3 Transformer | 1,811 molecules/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | A40 | |
Tensorflow | 1.15.5 | U-Net Industrial | 123 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | DAGM2007 | A40 |
1.15.5 | BERT-LARGE | 51 sentences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 24 | SQuaD v1.1 | A40 | |
2.8.0 | U-Net Medical | 70 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 8 | EM segmentation challenge | A40 | |
2.8.0 | Wide and Deep | 956,456 samples/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | A40 | |
1.15.5 | ResNext101 | 605 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 256 | Imagenet2012 | A40 | |
1.15.5 | SE-ResNext101 | 551 images/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 256 | Imagenet2012 | A40 | |
2.8.0 | Electra Base Fine Tuning | 165 sequences/sec | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 32 | SQuaD v1.1 | A40 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
A30 Training Performance
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | Tacotron2 | 34,239 total output mels/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 104 | LJSpeech 1.1 | A30 |
1.12.0a0 | WaveGlow | 146,477 output samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A30 | |
1.12.0a0 | FastPitch | 66,964 frames/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | LJSpeech 1.1 | A30 | |
1.12.0a0 | NCF | 18,779,967 samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 1048576 | MovieLens 20M | A30 | |
1.12.0a0 | GNMT V2 | 91,406 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | wmt16-en-de | A30 | |
1.12.0a0 | Transformer-XL Base | 19,067 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 32 | WikiText-103 | A30 | |
1.12.0a0 | ResNeXt101 | 547 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 112 | Imagenet2012 | A30 | |
1.12.0a0 | Transformer-XL Large | 7,055 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 4 | WikiText-103 | A30 | |
1.12.0a0 | nnU-Net | 580 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 64 | Medical Segmentation Decathlon | A30 | |
1.12.0a0 | EfficientDet-D0 | 145 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 30 | COCO 2017 | A30 | |
1.12.0a0 | EfficientNet-WideSE-B0 | 1,231 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | Imagenet2012 | A30 | |
1.12.0a0 | EfficientNet-WideSE-B4 | 348 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | Imagenet2012 | A30 | |
1.12.0a0 | SE3 Transformer | 2,017 molecules/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | A30 | |
Tensorflow | 1.15.5 | ResNext101 | 587 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | Imagenet2012 | A30 |
1.15.5 | SE-ResNext101 | 497 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 96 | Imagenet2012 | A30 | |
1.15.5 | U-Net Industrial | 118 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | DAGM2007 | A30 | |
2.8.0 | U-Net Medical | 73 images/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 8 | EM segmentation challenge | A30 | |
1.15.5 | Transformer XL Base | 18,449 total tokens/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | WikiText-103 | A30 | |
2.8.0 | Electra Base Fine Tuning | 165 sequences/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | SQuaD v1.1 | A30 | |
1.15.5 | Wide and Deep | 320,517 samples/sec | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | A30 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
A10 Training Performance
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | Tacotron2 | 28,869 total output mels/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 104 | LJSpeech 1.1 | A10 |
1.12.0a0 | WaveGlow | 112,773 output samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | A10 | |
1.12.0a0 | FastPitch | 58,928 frames/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | LJSpeech 1.1 | A10 | |
1.12.0a0 | Transformer-XL Base | 15,821 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 32 | WikiText-103 | A10 | |
1.12.0a0 | GNMT V2 | 65,590 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | wmt16-en-de | A10 | |
1.12.0a0 | ResNeXt101 | 397 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 112 | Imagenet2012 | A10 | |
1.12.0a0 | NCF | 14,965,414 samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 1048576 | MovieLens 20M | A10 | |
1.12.0a0 | Transformer-XL Large | 6,061 total tokens/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 4 | WikiText-103 | A10 | |
1.12.0a0 | nnU-Net | 454 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 64 | Medical Segmentation Decathlon | A10 | |
1.12.0a0 | EfficientDet-D0 | 132 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 30 | COCO 2017 | A10 | |
1.12.0a0 | EfficientNet-WideSE-B0 | 1,010 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | Imagenet2012 | A10 | |
1.12.0a0 | EfficientNet-WideSE-B4 | 337 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.04-py3 | Mixed | 32 | Imagenet2012 | A10 | |
1.12.0a0 | SE3 Transformer | 1,579 molecules/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | A10 | |
Tensorflow | 1.15.5 | ResNext101 | 451 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 128 | Imagenet2012 | A10 |
1.15.5 | SE-ResNext101 | 391 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 96 | Imagenet2012 | A10 | |
1.15.5 | U-Net Industrial | 100 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | DAGM2007 | A10 | |
2.8.0 | U-Net Medical | 52 images/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 8 | EM segmentation challenge | A10 | |
2.8.0 | Electra Base Fine Tuning | 122 sequences/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 16 | SQuaD v1.1 | A10 | |
1.15.5 | Wide and Deep | 293,476 samples/sec | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | A10 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec | Server with a hyphen indicates a pre-production server
T4 Training Performance +
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | ResNeXt101 | 186 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 112 | Imagenet2012 | NVIDIA T4 |
1.12.0a0 | Tacotron2 | 6,486 total output mels/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | FP32 | 48 | LJSpeech 1.1 | NVIDIA T4 | |
1.12.0a0 | WaveGlow | 55,615 output samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | NVIDIA T4 | |
1.12.0a0 | FastPitch | 30,228 frames/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 16 | LJSpeech 1.1 | NVIDIA T4 | |
1.12.0a0 | GNMT V2 | 30,820 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 128 | wmt16-en-de | NVIDIA T4 | |
1.12.0a0 | NCF | 7,253,124 samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 1048576 | MovieLens 20M | NVIDIA T4 | |
1.12.0a0 | Transformer-XL Base | 9,024 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 32 | WikiText-103 | NVIDIA T4 | |
1.12.0a0 | SE-ResNeXt101 | 150 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 112 | Imagenet2012 | NVIDIA T4 | |
1.12.0a0 | Transformer-XL Large | 2,733 total tokens/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 4 | WikiText-103 | NVIDIA T4 | |
1.12.0a0 | nnU-Net | 205 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 64 | Medical Segmentation Decathlon | NVIDIA T4 | |
1.12.0a0 | EfficientDet-D0 | 68 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 30 | COCO 2017 | NVIDIA T4 | |
1.12.0a0 | EfficientNet-WideSE-B0 | 480 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
1.12.0a0 | EfficientNet-WideSE-B4 | 170 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.04-py3 | Mixed | 32 | Imagenet2012 | NVIDIA T4 | |
1.12.0a0 | SE3 Transformer | 601 molecules/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | NVIDIA T4 | |
Tensorflow | 1.15.5 | U-Net Industrial | 45 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 16 | DAGM2007 | NVIDIA T4 |
2.8.0 | U-Net Medical | 21 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 8 | EM segmentation challenge | NVIDIA T4 | |
1.15.5 | SE-ResNext101 | 164 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 96 | Imagenet2012 | NVIDIA T4 | |
1.15.5 | ResNext101 | 199 images/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 128 | Imagenet2012 | NVIDIA T4 | |
2.8.0 | Electra Base Fine Tuning | 56 sequences/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 16 | SQuaD v1.1 | NVIDIA T4 | |
2.8.0 | Wide and Deep | 195,709 samples/sec | 1x T4 | Supermicro SYS-1029GQ-TRT | 22.05-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | NVIDIA T4 |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
V100 Training Performance +
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
PyTorch | 1.12.0a0 | ResNeXt101 | 569 images/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 112 | Imagenet2012 | V100-SXM3-32GB |
1.12.0a0 | Tacotron2 | 24,775 total output mels/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 104 | LJSpeech 1.1 | V100-SXM3-32GB | |
1.12.0a0 | WaveGlow | 144,081 output samples/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 10 | LJSpeech 1.1 | V100-SXM3-32GB | |
1.12.0a0 | FastPitch | 68,742 frames/sec | 1x V100 | DGX-2 | 22.03-py3 | Mixed | 64 | LJSpeech 1.1 | V100-SXM3-32GB | |
1.12.0a0 | GNMT V2 | 78,408 total tokens/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 128 | wmt16-en-de | V100-SXM3-32GB | |
1.12.0a0 | NCF | 23,153,985 samples/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 1048576 | MovieLens 20M | V100-SXM3-32GB | |
1.12.0a0 | Transformer-XL Base | 17,951 total tokens/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 32 | WikiText-103 | V100-SXM3-32GB | |
1.12.0a0 | Transformer-XL Large | 7,310 total tokens/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 8 | WikiText-103 | V100-SXM3-32GB | |
1.12.0a0 | nnU-Net | 659 images/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 64 | Medical Segmentation Decathlon | V100-SXM3-32GB | |
1.12.0a0 | EfficientNet-B0 | 1,280 images/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 256 | Imagenet2012 | V100-SXM3-32GB | |
1.12.0a0 | EfficientDet-D0 | 150 images/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 60 | COCO 2017 | V100-SXM3-32GB | |
1.12.0a0 | EfficientNet-WideSE-B0 | 1,279 images/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 256 | Imagenet2012 | V100-SXM3-32GB | |
1.12.0a0 | EfficientNet-WideSE-B4 | 501 images/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 64 | Imagenet2012 | V100-SXM3-32GB | |
1.12.0a0 | SE3 Transformer | 1,818 molecules/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 240 | Quantum Machines 9 | V100-SXM3-32GB | |
Tensorflow | 1.15.5 | ResNext101 | 633 images/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 128 | Imagenet2012 | V100-SXM3-32GB |
1.15.5 | SE-ResNext101 | 556 images/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 96 | Imagenet2012 | V100-SXM3-32GB | |
1.15.5 | U-Net Industrial | 119 images/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 16 | DAGM2007 | V100-SXM3-32GB | |
2.8.0 | U-Net Medical | 67 images/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 8 | EM segmentation challenge | V100-SXM3-32GB | |
2.8.0 | Wide and Deep | 1,022,754 samples/sec | 1x V100 | DGX-2 | 22.04-py3 | Mixed | 131072 | Kaggle Outbrain Click Prediction | V100-SXM3-32GB | |
2.8.0 | Electra Base Fine Tuning | 188 sequences/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 32 | SQuaD v1.1 | V100-SXM3-32GB | |
1.15.5 | Transformer XL Base | 18,500 total tokens/sec | 1x V100 | DGX-2 | 22.05-py3 | Mixed | 16 | WikiText-103 | V100-SXM3-32GB |
FastPitch throughput metric frames/sec refers to mel-scale spectrogram frames/sec
Single GPU Training Performance of NVIDIA GPU on Cloud
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Training Performance on Cloud
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
MXNet | - | ResNet-50 v1.5 | 2,916 images/sec | 1x A100 | GCP A2-HIGHGPU-1G | 22.04-py3 | Mixed | 192 | ImageNet2012 | A100-SXM4-40GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
T4 Training Performance on Cloud +
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
MXNet | - | ResNet-50 v1.5 | 457 images/sec | 1x T4 | AWS EC2 g4dn.4xlarge | 22.04-py3 | Mixed | 192 | ImageNet2012 | NVIDIA T4 |
- | ResNet-50 v1.5 | 419 images/sec | 1x T4 | GCP N1-HIGHMEM-8 | 22.04-py3 | Mixed | 192 | ImageNet2012 | NVIDIA T4 | |
TensorFlow | - | ResNet-50 v1.5 | 417 images/sec | 1x T4 | AWS EC2 g4dn.4xlarge | 22.04-py3 | Mixed | 256 | Imagenet2012 | NVIDIA T4 |
- | ResNet-50 v1.5 | 406 images/sec | 1x T4 | GCP N1-HIGHMEM-8 | 22.04-py3 | Mixed | 256 | Imagenet2012 | NVIDIA T4 |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
V100 Training Performance on Cloud +
Framework | Framework Version | Network | Throughput | GPU | Server | Container | Precision | Batch Size | Dataset | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|
MXNet | - | ResNet-50 v1.5 | 1,519 images/sec | 1x V100 | AWS EC2 p3.2xlarge | 22.04-py3 | Mixed | 192 | ImageNet2012 | V100-SXM2-16GB |
- | ResNet-50 v1.5 | 1,434 images/sec | 1x V100 | GCP N1-HIGHMEM-8 | 22.04-py3 | Mixed | 192 | ImageNet2012 | V100-SXM2-16GB |
BERT-Large = BERT-Large Fine Tuning (Squadv1.1) with Sequence Length of 384
AI Inference
Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.
Related Resources
Learn how NVIDIA landed top performance spots on all MLPerf Inference 2.0 tests.
Read the inference whitepape to explore the evolving landscape and get an overview of inference platforms.
Learn how Dynamic Batching can increase throughput on Triton with Benefits of Triton.
For additional data on Triton performance in offline and online server, please refer to ResNet-50 v1.5
Power high-throughput, low-latency inference with NVIDIA’s complete solution stack:
Achieve the most efficient inference performance with NVIDIA® TensorRT™ running on NVIDIA Tensor Core GPUs.
Maximize performance and simplify the deployment of AI models with the NVIDIA Triton™ Inference Server.
Pull software containers from NVIDIA NGC to race into production.
MLPerf Inference v2.0 Performance Benchmarks
Offline Scenario - Closed Division
Network | Throughput | GPU | Server | GPU Version | Dataset | Target Accuracy |
---|---|---|---|---|---|---|
ResNet-50 v1.5 | 312,849 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet | 76.46% Top1 |
314,929 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | ImageNet | 76.46% Top1 | |
138,516 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | ImageNet | 76.46% Top1 | |
5,231 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | ImageNet | 76.46% Top1 | |
293,451 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | ImageNet | 76.46% Top1 | |
145,947 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | ImageNet | 76.46% Top1 | |
147,246 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | ImageNet | 76.46% Top1 | |
5,089 samples/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | ImageNet | 76.46% Top1 | |
SSD ResNet-34 | 7,923 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | COCO | 0.2 mAP |
7,880 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | COCO | 0.2 mAP | |
3,397 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | COCO | 0.2 mAP | |
135 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | COCO | 0.2 mAP | |
7,297 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | COCO | 0.2 mAP | |
3,623 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | COCO | 0.2 mAP | |
3,827 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | COCO | 0.2 mAP | |
129 samples/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | COCO | 0.2 mAP | |
3D-UNet | 25 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | KiTS 2019 | 0.863 DICE mean |
24 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | KiTS 2019 | 0.863 DICE mean | |
11 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | KiTS 2019 | 0.863 DICE mean | |
24 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | KiTS 2019 | 0.863 DICE mean | |
12 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | KiTS 2019 | 0.863 DICE mean | |
13 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | KiTS 2019 | 0.863 DICE mean | |
RNN-T | 106,753 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech | 7.45% WER |
107,399 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | LibriSpeech | 7.45% WER | |
49,789 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | LibriSpeech | 7.45% WER | |
1,612 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | LibriSpeech | 7.45% WER | |
101,788 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | LibriSpeech | 7.45% WER | |
52,752 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | LibriSpeech | 7.45% WER | |
52,453 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | LibriSpeech | 7.45% WER | |
1,432 samples/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | LibriSpeech | 7.45% WER | |
BERT | 27,971 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 | 90.07% f1 |
27,894 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | SQuAD v1.1 | 90.07% f1 | |
11,387 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | SQuAD v1.1 | 90.07% f1 | |
484 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 | 90.07% f1 | |
25,035 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | SQuAD v1.1 | 90.07% f1 | |
12,595 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | SQuAD v1.1 | 90.07% f1 | |
13,340 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | SQuAD v1.1 | 90.07% f1 | |
502 samples/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | SQuAD v1.1 | 90.07% f1 | |
DLRM | 2,499,040 samples/sec | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC |
2,477,270 samples/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
1,065,600 samples/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
40,424 samples/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
2,313,280 samples/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
1,125,130 samples/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | Criteo 1TB Click Logs | 80.25% AUC | |
1,105,550 samples/sec | 8x A30 | Gigabyte G482-Z54 | A30 | Criteo 1TB Click Logs | 80.25% AUC | |
35,831 samples/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | Criteo 1TB Click Logs | 80.25% AUC |
Server Scenario - Closed Division
Network | Throughput | GPU | Server | GPU Version | Target Accuracy | MLPerf Server Latency Constraints (ms) | Dataset |
---|---|---|---|---|---|---|---|
ResNet-50 v1.5 | 260,031 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet |
270,027 queries/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet | |
107,000 queries/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet | |
3,527 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 76.46% Top1 | 15 | ImageNet | |
200,007 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 76.46% Top1 | 15 | ImageNet | |
104,000 queries/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | 76.46% Top1 | 15 | ImageNet | |
116,002 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | 76.46% Top1 | 15 | ImageNet | |
3,398 queries/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | 76.46% Top1 | 15 | ImageNet | |
SSD ResNet-34 | 7,575 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 0.2 mAP | 100 | COCO |
7,505 queries/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | 0.2 mAP | 100 | COCO | |
3,247 queries/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | 0.2 mAP | 100 | COCO | |
98 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 0.2 mAP | 100 | COCO | |
6,466 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 0.2 mAP | 100 | COCO | |
3,078 queries/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | 0.2 mAP | 100 | COCO | |
3,570 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | 0.2 mAP | 100 | COCO | |
95 queries/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | 0.2 mAP | 100 | COCO | |
RNN-T | 104,000 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech |
104,000 queries/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech | |
44,989 queries/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech | |
1,350 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 7.45% WER | 1,000 | LibriSpeech | |
89,994 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 7.45% WER | 1,000 | LibriSpeech | |
42,989 queries/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | 7.45% WER | 1,000 | LibriSpeech | |
36,989 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | 7.45% WER | 1,000 | LibriSpeech | |
1,100 queries/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | 7.45% WER | 1,000 | LibriSpeech | |
BERT | 25,792 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 90.07% f1 | 130 | SQuAD v1.1 |
25,391 queries/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | 90.07% f1 | 130 | SQuAD v1.1 | |
10,794 queries/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | 90.07% f1 | 130 | SQuAD v1.1 | |
380 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 90.07% f1 | 130 | SQuAD v1.1 | |
22,989 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 90.07% f1 | 130 | SQuAD v1.1 | |
10,394 queries/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | 90.07% f1 | 130 | SQuAD v1.1 | |
11,491 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | 90.07% f1 | 130 | SQuAD v1.1 | |
380 queries/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | 90.07% f1 | 130 | SQuAD v1.1 | |
DLRM | 2,302,640 queries/sec | 8x A100 | DGX A100 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs |
1,951,890 queries/sec | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
950,448 queries/sec | 4x A100 | DGX Station A100 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
35,989 queries/sec | 1x1g.10gb A100 | DGX A100 | A100 SXM-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
1,300,850 queries/sec | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
600,183 queries/sec | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
960,456 queries/sec | 8x A30 | Gigabyte G482-Z54 | A30 | 80.25% AUC | 30 | Criteo 1TB Click Logs | |
30,987 queries/sec | 1x1g.6gb A30 | Gigabyte G482-Z54 | A30 | 80.25% AUC | 30 | Criteo 1TB Click Logs |
Power Efficiency Offline Scenario - Closed Division
Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
---|---|---|---|---|---|---|
ResNet-50 v1.5 | 250,242 samples/sec | 86.74 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet |
268,462 samples/sec | 95.36 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | ImageNet | |
128,665 samples/sec | 113.68 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | ImageNet | |
211,065 samples/sec | 113.44 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | ImageNet | |
104,893 samples/sec | 105.20 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | ImageNet | |
SSD ResNet-34 | 6,576 samples/sec | 2.11 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | COCO |
6,521 samples/sec | 2.31 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | COCO | |
3,307 samples/sec | 2.67 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | COCO | |
5,778 samples/sec | 2.75 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | COCO | |
2,894 samples/sec | 2.57 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | COCO | |
3D-UNet | 21 samples/sec | 0.007 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | KiTS 2019 |
20 samples/sec | 0.008 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | KiTS 2019 | |
11 samples/sec | 0.009 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | KiTS 2019 | |
19 samples/sec | 0.010 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | KiTS 2019 | |
10 samples/sec | 0.010 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | KiTS 2019 | |
RNN-T | 90,730 samples/sec | 27.94 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech |
90,946 samples/sec | 31.89 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | LibriSpeech | |
44,966 samples/sec | 37.87 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | LibriSpeech | |
85,952 samples/sec | 39.16 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | LibriSpeech | |
42,945 samples/sec | 37.86 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | LibriSpeech | |
BERT | 24,794 samples/sec | 6.99 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 |
20,706 samples/sec | 7.38 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | SQuAD v1.1 | |
10,828 samples/sec | 8.64 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | SQuAD v1.1 | |
19,993 samples/sec | 8.47 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | SQuAD v1.1 | |
10,047 samples/sec | 8.06 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | SQuAD v1.1 | |
DLRM | 2,140,540 samples/sec | 646.23 samples/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs |
1,940,830 samples/sec | 701.53 samples/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | Criteo 1TB Click Logs | |
1,001,010 samples/sec | 797.59 samples/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | Criteo 1TB Click Logs | |
1,845,900 samples/sec | 795.67 samples/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | Criteo 1TB Click Logs | |
953,749 samples/sec | 768.81 samples/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | Criteo 1TB Click Logs |
Power Efficiency Server Scenario - Closed Division
Network | Throughput | Throughput per Watt | GPU | Server | GPU Version | Dataset |
---|---|---|---|---|---|---|
ResNet-50 v1.5 | 229,016 queries/sec | 78.69 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | ImageNet |
230,018 queries/sec | 81.52 queries/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | ImageNet | |
107,000 queries/sec | 94.59 queries/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | ImageNet | |
185,005 queries/sec | 88.70 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | ImageNet | |
92,496 queries/sec | 93.88 queries/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | ImageNet | |
SSD ResNet-34 | 6,298 queries/sec | 2.01 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | COCO |
6,298 queries/sec | 2.23 queries/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | COCO | |
3,078 queries/sec | 2.50 queries/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | COCO | |
5,697 queries/sec | 2.72 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | COCO | |
2,748 queries/sec | 2.48 queries/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | COCO | |
RNN-T | 87,992 queries/sec | 25.47 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | LibriSpeech |
74,990 queries/sec | 26.37 queries/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | LibriSpeech | |
43,388 queries/sec | 33.53 queries/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | LibriSpeech | |
74,990 queries/sec | 34.09 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | LibriSpeech | |
37,489 queries/sec | 32.89 queries/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | LibriSpeech | |
BERT | 21,492 queries/sec | 6.36 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | SQuAD v1.1 |
20,992 queries/sec | 6.47 queries/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | SQuAD v1.1 | |
10,195 queries/sec | 8.01 queries/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | SQuAD v1.1 | |
17,292 queries/sec | 8.09 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | SQuAD v1.1 | |
9,995 queries/sec | 7.99 queries/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | SQuAD v1.1 | |
DLRM | 2,001,990 queries/sec | 593.66 queries/sec/watt | 8x A100 | DGX A100 | A100 SXM-80GB | Criteo 1TB Click Logs |
1,831,680 queries/sec | 651.00 queries/sec/watt | 8x A100 | Gigabyte G492-PD0 | A100 SXM-80GB | Criteo 1TB Click Logs | |
870,363 queries/sec | 649.49 queries/sec/watt | 4x A100 | DGX Station A100 | A100 SXM-80GB | Criteo 1TB Click Logs | |
750,272 queries/sec | 358.95 queries/sec/watt | 8x A100 | Gigabyte G482-Z54 | A100 PCIe-80GB | Criteo 1TB Click Logs | |
500,121 queries/sec | 408.95 queries/sec/watt | 4x A100 | Gigabyte G242-P31 | A100 PCIe-80GB | Criteo 1TB Click Logs |
MLPerf™ v2.0 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 2.0-073, 2.0-075, 2.0-077, 2.0-078, 2.0-080, 2.0-081, 2.0-083, 2.0-084, 2.0-090, 2.0-094, 2.0-095, 2.0-097, 2.0-098. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
BERT-Large sequence length = 384.
DLRM samples refers to 270 pairs/sample average
1x1g.6gb and 1x1g.10gb is a notation used to refer to the MIG configuration. In this example, the workload is running on a single MIG slice, each with 6GB or 10GB of memory on a single A30 and A100 respectively.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here
NVIDIA TRITON Inference Server Delivered Comparable Performance to Custom Harness in MLPerf v2.0
NVIDIA landed top performance spots on all MLPerf™ Inference 2.0 tests, the AI-industry’s leading benchmark competition. For inference submissions, we have typically used a custom A100 inference serving harness. This custom harness has been designed and optimized specifically for providing the highest possible inference performance for MLPerf™ workloads, which require running inference on bare metal.
MLPerf™ v2.0 A100 Inference Closed: ResNet-50 v1.5, SSD ResNet-34, BERT 99% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 2.0-094, 2.0-096. MLPerf name and logo are trademarks. See www.mlcommons.org for more information.
NVIDIA Client Batch Size 1 and 2 Performance with Triton Inference Server
A100 Triton Inference Server Performance
Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | A100-SXM4-80GB | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 39.31 | 611 inf/sec | 384 | 22.05-py3 |
BERT Large Inference | A100-SXM4-80GB | TensorFlow | TensorRT | Mixed | 2 | 2 | 1 | 16 | 42.69 | 750 inf/sec | 384 | 22.05-py3 |
BERT Large Inference | A100-PCIE-40GB | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 44.77 | 536 inf/sec | 384 | 22.05-py3 |
BERT Large Inference | A100-PCIE-40GB | TensorFlow | TensorRT | Mixed | 1 | 2 | 1 | 24 | 82.05 | 585 inf/sec | 384 | 22.05-py3 |
BERT Base Inference | A100-SXM4-80GB | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 7.55 | 3,178 inf/sec | 128 | 22.05-py3 |
BERT Base Inference | A100-SXM4-40GB | TensorFlow | TensorRT | Mixed | 1 | 2 | 1 | 20 | 7.94 | 5,039 inf/sec | 128 | 22.05-py3 |
BERT Base Inference | A100-PCIE-40GB | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 7 | 3,427 inf/sec | 128 | 22.05-py3 |
BERT Base Inference | A100-PCIE-40GB | TensorFlow | TensorRT | Mixed | 1 | 2 | 1 | 20 | 8.34 | 4,793 inf/sec | 128 | 22.05-py3 |
DLRM Inference | A100-SXM4-40GB | PyTorch | TensorRT | Mixed | 2 | 1 | 65,536 | 26 | 2.07 | 12,560 inf/sec | - | 22.05-py3 |
DLRM Inference | A100-SXM4-40GB | PyTorch | TensorRT | Mixed | 2 | 2 | 65,536 | 28 | 2.25 | 24,919 inf/sec | - | 22.05-py3 |
DLRM Inference | A100-PCIE-40GB | PyTorch | TensorRT | Mixed | 4 | 1 | 65,536 | 30 | 2.37 | 12,672 inf/sec | - | 22.05-py3 |
DLRM Inference | A100-PCIE-40GB | PyTorch | TensorRT | Mixed | 2 | 2 | 65,536 | 28 | 2.14 | 26,206 inf/sec | - | 22.05-py3 |
A30 Triton Inference Server Performance
Network | Accelerator | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | A30 | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 74.63 | 321 inf/sec | 384 | 22.05-py3 |
BERT Large Inference | A30 | TensorFlow | TensorRT | Mixed | 2 | 2 | 1 | 16 | 91.98 | 348 inf/sec | 384 | 22.05-py3 |
BERT Base Inference | A30 | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 10.07 | 2,382 inf/sec | 128 | 22.05-py3 |
BERT Base Inference | A30 | TensorFlow | TensorRT | Mixed | 1 | 2 | 1 | 24 | 15.72 | 3,053 inf/sec | 128 | 22.05-py3 |
A10 Triton Inference Server Performance
Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | A10 | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 111.82 | 214 inf/sec | 384 | 22.05-py3 |
BERT Large Inference | A10 | TensorFlow | TensorRT | Mixed | 2 | 2 | 1 | 16 | 141.66 | 226 inf/sec | 384 | 22.05-py3 |
BERT Base Inference | A10 | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 13.16 | 1,824 inf/sec | 128 | 22.05-py3 |
BERT Base Inference | A10 | TensorFlow | TensorRT | Mixed | 2 | 2 | 1 | 16 | 14.01 | 2,285 inf/sec | 128 | 22.05-py3 |
T4 Triton Inference Server Performance +
Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | NVIDIA T4 | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 85 | 283 inf/sec | 384 | 22.03-py3 |
BERT Large Inference | NVIDIA T4 | TensorFlow | TensorRT | Mixed | 2 | 2 | 1 | 24 | 84.47 | 568 inf/sec | 384 | 22.03-py3 |
BERT Base Inference | NVIDIA T4 | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 27.95 | 859 inf/sec | 128 | 22.05-py3 |
BERT Base Inference | NVIDIA T4 | TensorFlow | TensorRT | Mixed | 1 | 2 | 1 | 24 | 50.44 | 952 inf/sec | 128 | 22.05-py3 |
V100 Triton Inference Server Performance +
Network | Accelerator | Training Framework | Framework Backend | Precision | Model Instances on Triton | Client Batch Size | Dynamic Batch Size (Triton) | Number of Concurrent Client Requests | Latency (ms) | Throughput | Sequence/Input Length | Triton Container Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT Large Inference | V100 SXM2-32GB | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 105.96 | 227 inf/sec | 384 | 22.03-py3 |
BERT Large Inference | V100 SXM2-32GB | TensorFlow | TensorRT | Mixed | 2 | 2 | 1 | 16 | 125.94 | 254 inf/sec | 384 | 22.03-py3 |
BERT Base Inference | V100 SXM2-32GB | TensorFlow | TensorRT | Mixed | 4 | 1 | 1 | 24 | 17.60 | 1,363 inf/sec | 128 | 22.03-py3 |
BERT Base Inference | V100 SXM2-32GB | TensorFlow | TensorRT | Mixed | 2 | 2 | 1 | 16 | 14.83 | 2,158 inf/sec | 128 | 22.03-py3 |
DLRM Inference | V100-SXM2-32GB | PyTorch | TensorRT | Mixed | 2 | 1 | 65,536 | 30 | 4.15 | 7,228 inf/sec | - | 22.03-py3 |
DLRM Inference | V100-SXM2-32GB | PyTorch | TensorRT | Mixed | 2 | 2 | 65,536 | 30 | 4.11 | 14,599 inf/sec | - | 22.03-py3 |
Inference Performance of NVIDIA A100, A40, A30, A10, A2, T4 and V100
Benchmarks are reproducible by following links to the NGC catalog scripts
Inference Natural Langugage Processing
BERT Inference Throughput
DGX A100 server w/ 1x NVIDIA A100 with 7 MIG instances of 1g.5gb | Batch Size = 94 | Precision: INT8 | Sequence Length = 128
DGX-1 server w/ 1x NVIDIA V100 | TensorRT 7.1 | Batch Size = 256 | Precision: Mixed | Sequence Length = 128
NVIDIA A100 BERT Inference Benchmarks
Network | Network Type | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|---|
BERT-Large with Sparsity | Attention | 94 | 6,188 sequences/sec | - | - | 1x A100 | DGX A100 | - | INT8 | SQuaD v1.1 | - | A100 SXM4-40GB |
A100 with 7 MIG instances of 1g.5gb | Sequence length=128 | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Starting from 21.09-py3, ECC is enabled
Inference Image Classification on CNNs with TensorRT
ResNet-50 v1.5 Throughput
DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: Mixed | Dataset: Synthetic
ResNet-50 v1.5 Power Efficiency
DGX A100: EPYC 7742@2.25GHz w/ 1x NVIDIA A100-SXM-80GB | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A30 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A40 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
GIGABYTE G482-Z52-00: EPYC 7742@2.25GHz w/ 1x NVIDIA A10 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
Supermicro SYS-4029GP-TRT: Xeon Gold 6240 @2.6 GHz w/ 1x NVIDIA T4 | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: INT8 | Dataset: Synthetic
DGX-2: Platinum 8168 @2.7GHz w/ 1x NVIDIA V100-SXM3-32GB | TensorRT 8.2.3 | Batch Size = 128 | 22.05-py3 | Precision: Mixed | Dataset: Synthetic
A100 Full Chip Inference Performance
Network | Batch Size | Full Chip Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 11,677 images/sec | 58 images/sec/watt | 0.69 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB |
128 | 30,532 images/sec | 80 images/sec/watt | 4.19 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
ResNet-50v1.5 | 8 | 11,434 images/sec | 57 images/sec/watt | 0.7 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB |
128 | 29,533 images/sec | 78 images/sec/watt | 4.33 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 7,435 sequences/sec | 30 sequences/sec/watt | 1.08 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
128 | 14,979 sequences/sec | 38 sequences/sec/watt | 8.55 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 2,675 sequences/sec | 3 sequences/sec/watt | 7.59 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
128 | 4,806 sequences/sec | 3 sequences/sec/watt | 92.04 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
EfficientNet-B0 | 8 | 8,910 images/sec | 58 images/sec/watt | 0.9 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB |
128 | 28,897 images/sec | 89 images/sec/watt | 4.43 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB | |
EfficientNet-B4 | 8 | 2,498 sequences/sec | 11 sequences/sec/watt | 3.2 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB |
128 | 4,413 sequences/sec | 12 sequences/sec/watt | 29.01 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
A100 1/7 MIG Inference Performance
Network | Batch Size | 1/7 MIG Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 3,692 images/sec | 33 images/sec/watt | 2.17 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
128 | 4,596 images/sec | 38 images/sec/watt | 27.85 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
ResNet-50v1.5 | 8 | 3,583 images/sec | 32 images/sec/watt | 2.23 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
128 | 4,443 images/sec | 37 images/sec/watt | 28.81 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
BERT-BASE | 8 | 1,806 sequences/sec | 14 sequences/sec/watt | 4.43 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
128 | 2,174 sequences/sec | 17 sequences/sec/watt | 58.89 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
BERT-LARGE | 8 | 584 sequences/sec | 5 sequences/sec/watt | 13.69 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
128 | 673 sequences/sec | 5 sequences/sec/watt | 190.16 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
A100 7 MIG Inference Performance
Network | Batch Size | 7 MIG Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 25,485 images/sec | 80 images/sec/watt | 2.22 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
128 | 31,964 images/sec | 84 images/sec/watt | 28.05 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
ResNet-50v1.5 | 8 | 24,788 images/sec | 78 images/sec/watt | 2.26 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
128 | 30,899 images/sec | 81 images/sec/watt | 29.04 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
BERT-BASE | 8 | 12,522 sequences/sec | 33 sequences/sec/watt | 4.49 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
128 | 14,478 sequences/sec | 37 sequences/sec/watt | 61.94 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB | |
BERT-LARGE | 8 | 3,980 sequences/sec | 10 sequences/sec/watt | 14.1 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
128 | 4,434 sequences/sec | 11 sequences/sec/watt | 202.32 | 1x A100 | DGX A100 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100 SXM4-80GB |
Containers with a hyphen indicates a pre-release container | Servers with a hyphen indicates a pre-production server
BERT-Large: Sequence Length = 128
A40 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 10,044 images/sec | 41 images/sec/watt | 0.8 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
128 | 16,070 images/sec | 54 images/sec/watt | 7.96 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 | |
ResNet-50v1.5 | 8 | 9,718 images/sec | 38 images/sec/watt | 0.82 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
128 | 15,358 images/sec | 51 images/sec/watt | 8.33 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 | |
BERT-BASE | 8 | 5,322 sequences/sec | 18 sequences/sec/watt | 1.5 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
128 | 7,528 sequences/sec | 25 sequences/sec/watt | 17 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 | |
BERT-LARGE | 8 | 1,722 sequences/sec | 2 sequences/sec/watt | 4.62 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
128 | 2,242 sequences/sec | 2 sequences/sec/watt | 57.1 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 | |
EfficientNet-B0 | 8 | 8,047 images/sec | 42 images/sec/watt | 0.99 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
128 | 16,445 images/sec | 55 images/sec/watt | 7.78 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 | |
EfficientNet-B4 | 8 | 1,878 images/sec | 7 images/sec/watt | 4.26 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
128 | 2,533 images/sec | 8 images/sec/watt | 50.54 | 1x A40 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A40 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
A30 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 7,003 images/sec | 46 images/sec/watt | 1.14 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 15,984 images/sec | 96 images/sec/watt | 8.01 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
ResNet-50v1.5 | 8 | 7,406 images/sec | 50 images/sec/watt | 1.08 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 15,413 images/sec | 94 images/sec/watt | 8.3 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 4,981 sequences/sec | 32 sequences/sec/watt | 1.61 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
128 | 7,108 sequences/sec | 43 sequences/sec/watt | 18.01 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,696 sequences/sec | 4 sequences/sec/watt | 4.72 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
128 | 2,257 sequences/sec | 4 sequences/sec/watt | 56.7 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
EfficientNet-B0 | 8 | 7,126 images/sec | 71 images/sec/watt | 1.12 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 16,143 images/sec | 99 images/sec/watt | 7.93 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
EfficientNet-B4 | 8 | 1,620 images/sec | 12 images/sec/watt | 4.94 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 2,237 images/sec | 14 images/sec/watt | 57.23 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
A30 1/4 MIG Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 3,579 images/sec | 43 images/sec/watt | 2.24 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 4,526 images/sec | 50 images/sec/watt | 28.28 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
ResNet-50v1.5 | 8 | 3,467 images/sec | 42 images/sec/watt | 2.31 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 4,389 images/sec | 48 images/sec/watt | 29.17 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
BERT-BASE | 8 | 1,802 sequences/sec | 20 sequences/sec/watt | 4.44 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 2,151 sequences/sec | 21 sequences/sec/watt | 59.51 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
BERT-LARGE | 8 | 561 sequences/sec | 6 sequences/sec/watt | 14.27 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 675 sequences/sec | 7 sequences/sec/watt | 189.71 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
A30 4 MIG Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 13,856 images/sec | 84 images/sec/watt | 2.33 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 16,996 images/sec | 104 images/sec/watt | 30.24 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
ResNet-50v1.5 | 8 | 13,543 images/sec | 82 images/sec/watt | 2.36 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 16,333 images/sec | 100 images/sec/watt | 31.49 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
BERT-BASE | 8 | 6,524 sequences/sec | 40 sequences/sec/watt | 5.01 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 7,397 sequences/sec | 45 sequences/sec/watt | 70.34 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 | |
BERT-LARGE | 8 | 1,998 sequences/sec | 12 sequences/sec/watt | 16.25 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
128 | 2,312 sequences/sec | 14 sequences/sec/watt | 223.41 | 1x A30 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A30 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
A10 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 7,938 images/sec | 53 images/sec/watt | 1.01 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 |
128 | 11,532 images/sec | 77 images/sec/watt | 11.1 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
ResNet-50v1.5 | 8 | 7,685 images/sec | 51 images/sec/watt | 1.04 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 |
128 | 10,630 images/sec | 71 images/sec/watt | 12.04 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 3,992 sequences/sec | 27 sequences/sec/watt | 2 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
128 | 4,882 sequences/sec | 33 sequences/sec/watt | 26.22 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,264 sequences/sec | 3 sequences/sec/watt | 6.33 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
128 | 1,472 sequences/sec | 3 sequences/sec/watt | 86.94 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
EfficientNet-B0 | 8 | 7,057 images/sec | 47 images/sec/watt | 1.13 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 |
128 | 11,793 images/sec | 79 images/sec/watt | 10.85 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 | |
EfficientNet-B4 | 8 | 1,454 images/sec | 10 images/sec/watt | 5.5 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 |
128 | 1,761 images/sec | 12 images/sec/watt | 72.68 | 1x A10 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A10 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
A2 Inference Performance
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 2,596 images/sec | 43 images/sec/watt | 3.08 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
128 | 3,008 images/sec | 50 images/sec/watt | 42.55 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 | |
ResNet-50v1.5 | 8 | 2,512 images/sec | 42 images/sec/watt | 3.18 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
128 | 2,888 images/sec | 48 images/sec/watt | 44.32 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 | |
BERT-BASE | 8 | 1,055 sequences/sec | 18 sequences/sec/watt | 7.59 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
128 | 1,105 sequences/sec | 18 sequences/sec/watt | 115.8 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 | |
BERT-LARGE | 8 | 313 sequences/sec | 2 sequences/sec/watt | 25.55 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
128 | 334 sequences/sec | 2 sequences/sec/watt | 382.85 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 | |
EfficientNet-B0 | 8 | 2,501 images/sec | 51 images/sec/watt | 3.2 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
128 | 3,175 images/sec | 54 images/sec/watt | 40.32 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 | |
EfficientNet-B4 | 8 | 438 images/sec | 7 images/sec/watt | 18.29 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
128 | 481 images/sec | 8 images/sec/watt | 266.12 | 1x A2 | GIGABYTE G482-Z52-00 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A2 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
T4 Inference Performance +
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 3,787 images/sec | 54 images/sec/watt | 2.11 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 |
128 | 4,707 images/sec | 67 images/sec/watt | 27.19 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
ResNet-50v1.5 | 8 | 3,576 images/sec | 51 images/sec/watt | 2.24 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 |
128 | 4,426 images/sec | 63 images/sec/watt | 28.92 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,540 sequences/sec | 22 sequences/sec/watt | 5.19 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
128 | 1,792 sequences/sec | 26 sequences/sec/watt | 71.45 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 555 sequences/sec | 2 sequences/sec/watt | 14.41 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
128 | 535 sequences/sec | 2 sequences/sec/watt | 239.13 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
EfficientNet-B0 | 8 | 4,479 images/sec | 64 images/sec/watt | 1.79 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 |
128 | 5,944 images/sec | 86 images/sec/watt | 21.53 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 | |
EfficientNet-B4 | 8 | 732 images/sec | 10 images/sec/watt | 10.93 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 |
128 | 789 images/sec | 11 images/sec/watt | 162.21 | 1x T4 | Supermicro SYS-4029GP-TRT | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
V100 Inference Performance +
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50 | 8 | 4,276 images/sec | 16 images/sec/watt | 1.87 | 1x V100 | DGX-2 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB |
128 | 7,916 images/sec | 23 images/sec/watt | 16.17 | 1x V100 | DGX-2 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
ResNet-50v1.5 | 8 | 4,199 images/sec | 15 images/sec/watt | 1.91 | 1x V100 | DGX-2 | 22.05-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB |
128 | 7,564 images/sec | 22 images/sec/watt | 16.92 | 1x V100 | DGX-2 | 22.05-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
BERT-BASE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 1,997 sequences/sec | 6 sequences/sec/watt | 4.01 | 1x V100 | DGX-2 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
128 | 3,150 sequences/sec | 9 sequences/sec/watt | 40.64 | 1x V100 | DGX-2 | 22.05-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
BERT-LARGE | 1 | For Batch Size 1, please refer to Triton Inference Server page | |||||||||
2 | For Batch Size 2, please refer to Triton Inference Server page | ||||||||||
8 | 765 sequences/sec | 1 sequences/sec/watt | 10.45 | 1x V100 | DGX-2 | 22.05-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
128 | 968 sequences/sec | 1 sequences/sec/watt | 132.24 | 1x V100 | DGX-2 | 22.05-py3 | Mixed | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
EfficientNet-B0 | 8 | 4,233 images/sec | 21 images/sec/watt | 1.89 | 1x V100 | DGX-2 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB |
128 | 8,406 images/sec | 27 images/sec/watt | 15.23 | 1x V100 | DGX-2 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB | |
EfficientNet-B4 | 8 | 835 images/sec | 3 images/sec/watt | 9.58 | 1x V100 | DGX-2 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB |
128 | 1,172 images/sec | 4 images/sec/watt | 109.2 | 1x V100 | DGX-2 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM3-32GB |
Sequence length=128 for BERT-BASE and BERT-LARGE | Efficiency based on board power
Containers with a hyphen indicates a pre-release container
Inference Performance of NVIDIA GPU on Cloud
Benchmarks are reproducible by following links to the NGC catalog scripts
A100 Inference Performance on Cloud
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 11,535 images/sec | 62 images/sec/watt | 0.69 | 1x A100 | GCP A2-HIGHGPU-1G | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB |
128 | 28,303 images/sec | 110 images/sec/watt | 4.52 | 1x A100 | GCP A2-HIGHGPU-1G | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB | |
8 | 11,204 images/sec | 61 images/sec/watt | 0.71 | 1x A100 | AWS EC2 p4d.24xlarge | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB | |
128 | 28,291 images/sec | 108 images/sec/watt | 4.52 | 1x A100 | AWS EC2 p4d.24xlarge | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB | |
8 | 11,371 images/sec | - images/sec/watt | 0.7 | 1x A100 | Azure Standard_ND96amsr_A100_v4 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB | |
128 | 29,603 images/sec | - images/sec/watt | 4.32 | 1x A100 | Azure Standard_ND96amsr_A100_v4 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB | |
BERT-LARGE | 8 | 2,669 images/sec | 10 images/sec/watt | 3 | 1x A100 | AWS EC2 p4d.24xlarge | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB |
128 | 4,906 images/sec | 12 images/sec/watt | 26.09 | 1x A100 | AWS EC2 p4d.24xlarge | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB | |
BERT-BASE | 8 | 6,763 sequences/sec | 31 sequences/sec/watt | 1.18 | 1x A100 | AWS EC2 p4d.24xlarge | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB |
128 | 15,199 sequences/sec | 39 sequences/sec/watt | 8.42 | 1x A100 | AWS EC2 p4d.24xlarge | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | A100-SXM4-40GB |
BERT-Large: Sequence Length = 128
T4 Inference Performance on Cloud +
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 3,301 images/sec | 47 images/sec/watt | 2.42 | 1x T4 | GCP N1-HIGHMEM-8 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | NVIDIA T4 |
V100 Inference Performance on Cloud +
Network | Batch Size | Throughput | Efficiency | Latency (ms) | GPU | Server | Container | Precision | Dataset | Framework | GPU Version |
---|---|---|---|---|---|---|---|---|---|---|---|
ResNet-50v1.5 | 8 | 4,122 images/sec | 18 images/sec/watt | 1.94 | 1x V100 | GCP N1-HIGHMEM-8 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM2-16GB |
128 | 7,343 images/sec | 25 images/sec/watt | 17.43 | 1x V100 | GCP N1-HIGHMEM-8 | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM2-16GB | |
8 | 3,824 images/sec | - images/sec/watt | 2.09 | 1x V100 | Azure Standard_NC6s_v3 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM2-16GB | |
128 | 7,043 images/sec | - images/sec/watt | 18.17 | 1x V100 | Azure Standard_NC6s_v3 | 22.05-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM2-16GB | |
BERT-BASE | 8 | 6,763 sequences/sec | 31 sequences/sec/watt | 1.18 | 1x V100 | AWS EC2 p3.2xlarge | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM2-16GB |
128 | 15,199 sequences/sec | 39 sequences/sec/watt | 8.42 | 1x V100 | AWS EC2 p3.2xlarge | 22.04-py3 | INT8 | Synthetic | TensorRT 8.2.3 | V100-SXM2-16GB |
Conversational AI
NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-time performance on GPUs.
Related Resources
Download and get started with NVIDIA Riva.
Riva Benchmarks
A100 ASR Benchmarks
A100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 10.3 | 1 | A100 SXM4-40GB |
Citrinet | 256 | 167.4 | 253 | A100 SXM4-40GB |
Citrinet | 512 | 293.8 | 503 | A100 SXM4-40GB |
Citrinet | 1024 | 661.8 | 988 | A100 SXM4-40GB |
Quartznet | 1 | 17.2 | 1 | A100 SXM4-40GB |
Quartznet | 256 | 142.8 | 254 | A100 SXM4-40GB |
Quartznet | 512 | 214.2 | 505 | A100 SXM4-40GB |
Quartznet | 1024 | 377.8 | 998 | A100 SXM4-40GB |
Jasper | 1 | 20.9 | 1 | A100 SXM4-40GB |
Jasper | 256 | 173.3 | 254 | A100 SXM4-40GB |
Jasper | 512 | 286 | 504 | A100 SXM4-40GB |
Jasper | 1024 | 700.6 | 989 | A100 SXM4-40GB |
A100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 9.8 | 1 | A100 SXM4-40GB |
Citrinet | 16 | 26.8 | 16 | A100 SXM4-40GB |
Citrinet | 128 | 91.1 | 127 | A100 SXM4-40GB |
Quartznet | 1 | 9.1 | 1 | A100 SXM4-40GB |
Quartznet | 16 | 17.9 | 16 | A100 SXM4-40GB |
Quartznet | 128 | 55.5 | 127 | A100 SXM4-40GB |
Jasper | 1 | 13.5 | 1 | A100 SXM4-40GB |
Jasper | 16 | 31.5 | 16 | A100 SXM4-40GB |
Jasper | 128 | 98.5 | 127 | A100 SXM4-40GB |
A100 Offline Mode (3200 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 11.6 | 1 | A100 SXM4-40GB |
Citrinet | 512 | 366.8 | 503 | A100 SXM4-40GB |
Citrinet | 1,024 | 680.4 | 989 | A100 SXM4-40GB |
Citrinet | 1,512 | 981.3 | 1,437 | A100 SXM4-40GB |
Quartznet | 1 | 34.4 | 1 | A100 SXM4-40GB |
Quartznet | 512 | 457.5 | 504 | A100 SXM4-40GB |
Quartznet | 1,024 | 941.9 | 989 | A100 SXM4-40GB |
Quartznet | 1,512 | 1592.7 | 1,421 | A100 SXM4-40GB |
Jasper | 1 | 35.8 | 1 | A100 SXM4-40GB |
Jasper | 512 | 631.3 | 503 | A100 SXM4-40GB |
Jasper | 1,024 | 1,495.5 | 977 | A100 SXM4-40GB |
Jasper | 1,512 | 2,544.2 | 1,395 | A100 SXM4-40GB |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
A30 ASR Benchmarks
A30 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 15.3 | 1 | A30 |
Citrinet | 256 | 262.4 | 253 | A30 |
Citrinet | 512 | 494.5 | 500 | A30 |
Citrinet | 1024 | 12,500 | 690 | A30 |
Quartznet | 1 | 19.7 | 1 | A30 |
Quartznet | 256 | 177.9 | 254 | A30 |
Quartznet | 512 | 293.7 | 504 | A30 |
Quartznet | 1024 | 654.7 | 987 | A30 |
Jasper | 1 | 22.3 | 1 | A30 |
Jasper | 256 | 252.1 | 253 | A30 |
Jasper | 512 | 454.1 | 502 | A30 |
Jasper | 1024 | 7,770.8 | 722 | A30 |
A30 Best Streaming Latency Mode (100 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 14.1 | 1 | A30 |
Citrinet | 16 | 44.9 | 16 | A30 |
Citrinet | 128 | 177.6 | 127 | A30 |
Quartznet | 1 | 10.2 | 1 | A30 |
Quartznet | 16 | 25.8 | 16 | A30 |
Quartznet | 128 | 65.5 | 127 | A30 |
Jasper | 1 | 15.1 | 1 | A30 |
Jasper | 16 | 40.3 | 16 | A30 |
Jasper | 128 | 2,663.2 | 120 | A30 |
A30 Offline Mode (3200 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 16.9 | 1 | A30 |
Citrinet | 512 | 574.1 | 501 | A30 |
Citrinet | 1,024 | 1,166.1 | 979 | A30 |
Citrinet | 1,512 | 8,992.4 | 1,108 | A30 |
Quartznet | 1 | 41.4 | 1 | A30 |
Quartznet | 512 | 696.4 | 502 | A30 |
Quartznet | 1,024 | 1,536.5 | 974 | A30 |
Quartznet | 1,512 | 2,712.4 | 1,392 | A30 |
Jasper | 1 | 40.3 | 1 | A30 |
Jasper | 512 | 1,149.5 | 498 | A30 |
Jasper | 1,024 | 2,981.7 | 948 | A30 |
Jasper | 1,512 | 18,136 | 978 | A30 |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
V100 ASR Benchmarks +
V100 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 12.5 | 1 | V100 SXM2-16GB |
Citrinet | 256 | 284.5 | 253 | V100 SXM2-16GB |
Citrinet | 512 | 553.9 | 499 | V100 SXM2-16GB |
Citrinet | 768 | 4,443.9 | 650 | V100 SXM2-16GB |
Quartznet | 1 | 13.7 | 1 | V100 SXM2-16GB |
Quartznet | 256 | 196.4 | 254 | V100 SXM2-16GB |
Quartznet | 512 | 308.1 | 502 | V100 SXM2-16GB |
Quartznet | 768 | 458.1 | 748 | V100 SXM2-16GB |
Jasper | 1 | 23.6 | 1 | V100 SXM2-16GB |
Jasper | 128 | 191.7 | 127 | V100 SXM2-16GB |
Jasper | 256 | 336.6 | 253 | V100 SXM2-16GB |
Jasper | 512 | 937.1 | 497 | V100 SXM2-16GB |
V100 Best Streaming Latency Mode (100 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 11.2 | 1 | V100 SXM2-16GB |
Citrinet | 16 | 34.9 | 16 | V100 SXM2-16GB |
Citrinet | 128 | 213.3 | 127 | V100 SXM2-16GB |
Quartznet | 1 | 8.3 | 1 | V100 SXM2-16GB |
Quartznet | 16 | 17.7 | 16 | V100 SXM2-16GB |
Quartznet | 128 | 183.1 | 127 | V100 SXM2-16GB |
Jasper | 1 | 19.3 | 1 | V100 SXM2-16GB |
Jasper | 16 | 38.9 | 16 | V100 SXM2-16GB |
Jasper | 64 | 123.1 | 64 | V100 SXM2-16GB |
V100 Offline Mode (3200 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 12.7 | 1 | V100 SXM2-16GB |
Citrinet | 256 | 350.9 | 252 | V100 SXM2-16GB |
Citrinet | 512 | 627.1 | 499 | V100 SXM2-16GB |
Citrinet | 768 | 936.7 | 738 | V100 SXM2-16GB |
Citrinet | 1,024 | 1,500.5 | 972 | V100 SXM2-16GB |
Quartznet | 1 | 29.5 | 1 | V100 SXM2-16GB |
Quartznet | 256 | 365.4 | 253 | V100 SXM2-16GB |
Quartznet | 512 | 669.9 | 501 | V100 SXM2-16GB |
Quartznet | 768 | 1,199.8 | 737 | V100 SXM2-16GB |
Quartznet | 1,024 | 1,662.2 | 965 | V100 SXM2-16GB |
Jasper | 1 | 35.3 | 1 | V100 SXM2-16GB |
Jasper | 256 | 740.5 | 251 | V100 SXM2-16GB |
Jasper | 512 | 1,757.2 | 489 | V100 SXM2-16GB |
Jasper | 768 | 3,138.9 | 711 | V100 SXM2-16GB |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
T4 ASR Benchmarks +
T4 Best Streaming Throughput Mode (800 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 26 | 1 | NVIDIA T4 |
Citrinet | 64 | 178.7 | 64 | NVIDIA T4 |
Citrinet | 128 | 300.4 | 127 | NVIDIA T4 |
Citrinet | 256 | 710.4 | 249 | NVIDIA T4 |
Citrinet | 384 | 8,847.0 | 290 | NVIDIA T4 |
Quartznet | 1 | 28.4 | 1 | NVIDIA T4 |
Quartznet | 64 | 144.1 | 64 | NVIDIA T4 |
Quartznet | 128 | 190.3 | 127 | NVIDIA T4 |
Quartznet | 256 | 296.5 | 252 | NVIDIA T4 |
Quartznet | 384 | 422.7 | 376 | NVIDIA T4 |
Jasper | 1 | 74.8 | 1 | NVIDIA T4 |
Jasper | 64 | 218.8 | 64 | NVIDIA T4 |
Jasper | 128 | 359.5 | 126 | NVIDIA T4 |
Jasper | 256 | 1,030.6 | 249 | NVIDIA T4 |
T4 Best Streaming Latency Mode (100 ms chunk)
Acoustic model | # of streams | Avg Latency (ms) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 22.6 | 1 | NVIDIA T4 |
Citrinet | 16 | 66.1 | 16 | NVIDIA T4 |
Citrinet | 64 | 1,803.7 | 62 | NVIDIA T4 |
Quartznet | 1 | 16.1 | 1 | NVIDIA T4 |
Quartznet | 16 | 40.7 | 16 | NVIDIA T4 |
Quartznet | 64 | 104.5 | 64 | NVIDIA T4 |
Jasper | 1 | 46.6 | 1 | NVIDIA T4 |
Jasper | 8 | 47.4 | 8 | NVIDIA T4 |
Jasper | 16 | 72 | 16 | NVIDIA T4 |
T4 Offline Mode (3200 ms chunk)
Acoustic model | # of streams | Latency (ms) (avg) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|
Citrinet | 1 | 28.3 | 1 | NVIDIA T4 |
Citrinet | 256 | 709.2 | 250 | NVIDIA T4 |
Citrinet | 512 | 3,510.8 | 449 | NVIDIA T4 |
Quartznet | 1 | 54.2 | 1 | NVIDIA T4 |
Quartznet | 256 | 770.9 | 251 | NVIDIA T4 |
Quartznet | 512 | 1,685.9 | 486 | NVIDIA T4 |
Jasper | 1 | 96.7 | 1 | NVIDIA T4 |
Jasper | 256 | 1,888.4 | 245 | NVIDIA T4 |
ASR Throughput (RTFX) - Number of seconds of audio processed per second | Audio Chunk Size – Server side configuration indicating the amount of new data to be considered by the acoustic model | ASR Dataset: Librispeech | The latency numbers were measured using the streaming recognition mode, with the BERT-Base punctuation model enabled, a 4-gram language model, a decoder beam width of 128 and timestamps enabled. The client and the server were using audio chunks of the same duration (100ms, 800ms, 3200ms depending on the server configuration). The Riva streaming client Riva_streaming_asr_client, provided in the Riva client image was used with the --simulate_realtime flag to simulate transcription from a microphone, where each stream was doing 5 iterations over a sample audio file from the Librispeech dataset (1272-135031-0000.wav) | Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
A100 TTS Benchmarks
Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
FastPitch + Hifi-GAN | 1 | 0.03 | 0.003 | 133 | A100 SXM4-40GB |
FastPitch + Hifi-GAN | 4 | 0.04 | 0.006 | 340 | A100 SXM4-40GB |
FastPitch + Hifi-GAN | 6 | 0.06 | 0.007 | 390 | A100 SXM4-40GB |
FastPitch + Hifi-GAN | 8 | 0.07 | 0.009 | 443 | A100 SXM4-40GB |
FastPitch + Hifi-GAN | 10 | 0.07 | 0.009 | 464 | A100 SXM4-40GB |
Tacotron 2 + WaveGlow | 1 | 0.05 | 0.02 | 34 | A100 SXM4-40GB |
Tacotron 2 + WaveGlow | 4 | 0.26 | 0.03 | 59 | A100 SXM4-40GB |
Tacotron 2 + WaveGlow | 6 | 0.38 | 0.03 | 66 | A100 SXM4-40GB |
Tacotron 2 + WaveGlow | 8 | 0.51 | 0.04 | 70 | A100 SXM4-40GB |
Tacotron 2 + WaveGlow | 10 | 0.61 | 0.04 | 73 | A100 SXM4-40GB |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
A30 TTS Benchmarks
Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
FastPitch + Hifi-GAN | 1 | 0.03 | 0.003 | 133 | A30 |
FastPitch + Hifi-GAN | 4 | 0.04 | 0.006 | 340 | A30 |
FastPitch + Hifi-GAN | 6 | 0.06 | 0.007 | 390 | A30 |
FastPitch + Hifi-GAN | 8 | 0.07 | 0.009 | 443 | A30 |
FastPitch + Hifi-GAN | 10 | 0.07 | 0.009 | 464 | A30 |
Tacotron 2 + WaveGlow | 1 | 0.07 | 0.03 | 25 | A30 |
Tacotron 2 + WaveGlow | 4 | 0.33 | 0.04 | 45 | A30 |
Tacotron 2 + WaveGlow | 6 | 0.51 | 0.05 | 48 | A30 |
Tacotron 2 + WaveGlow | 8 | 0.69 | 0.06 | 50 | A30 |
Tacotron 2 + WaveGlow | 10 | 0.84 | 0.06 | 50 | A30 |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
V100 TTS Benchmarks +
Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
FastPitch + Hifi-GAN | 1 | 0.03 | 0.005 | 107 | V100 SXM2-16GB |
FastPitch + Hifi-GAN | 4 | 0.07 | 0.01 | 212 | V100 SXM2-16GB |
FastPitch + Hifi-GAN | 6 | 0.10 | 0.01 | 226 | V100 SXM2-16GB |
FastPitch + Hifi-GAN | 8 | 0.13 | 0.02 | 236 | V100 SXM2-16GB |
FastPitch + Hifi-GAN | 10 | 0.15 | 0.02 | 232 | V100 SXM2-16GB |
Tacotron 2 + WaveGlow | 1 | 0.06 | 0.03 | 25 | V100 SXM2-16GB |
Tacotron 2 + WaveGlow | 4 | 0.39 | 0.05 | 37 | V100 SXM2-16GB |
Tacotron 2 + WaveGlow | 6 | 0.60 | 0.06 | 40 | V100 SXM2-16GB |
Tacotron 2 + WaveGlow | 8 | 0.81 | 0.06 | 43 | V100 SXM2-16GB |
Tacotron 2 + WaveGlow | 10 | 0.98 | 0.07 | 43 | V100 SXM2-16GB |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
T4 TTS Benchmarks +
Model | # of streams | Avg Latency to first audio (sec) | Avg Latency between audio chunks (sec) | Throughput (RTFX) | GPU Version |
---|---|---|---|---|---|
FastPitch + Hifi-GAN | 1 | 0.05 | 0.006 | 73 | NVIDIA T4 |
FastPitch + Hifi-GAN | 4 | 0.11 | 0.02 | 132 | NVIDIA T4 |
FastPitch + Hifi-GAN | 6 | 0.15 | 0.02 | 141 | NVIDIA T4 |
FastPitch + Hifi-GAN | 8 | 0.19 | 0.03 | 148 | NVIDIA T4 |
FastPitch + Hifi-GAN | 10 | 0.21 | 0.03 | 150 | NVIDIA T4 |
Tacotron 2 + WaveGlow | 1 | 0.11 | 0.05 | 15 | NVIDIA T4 |
Tacotron 2 + WaveGlow | 4 | 0.72 | 0.11 | 18 | NVIDIA T4 |
Tacotron 2 + WaveGlow | 6 | 1.16 | 0.14 | 19 | NVIDIA T4 |
Tacotron 2 + WaveGlow | 8 | 1.64 | 0.16 | 19 | NVIDIA T4 |
Tacotron 2 + WaveGlow | 10 | 2.07 | 0.17 | 19 | NVIDIA T4 |
TTS Throughput (RTFX) - Number of seconds of audio generated per second | Dataset: LJSpeech | Performance of the Riva text-to-speech (TTS) service was measured for different number of parallel streams. Each parallel stream performed 10 iterations over 10 input strings from the LJSpeech dataset. Latency to first audio chunk and latency between successive audio chunks and throughput were measured. Riva version: v1.10.0-b1 | Hardware: DGX A100 (1x A100 SXM4-40GB), NVIDIA DGX-1 (1x V100-SXM2-16GB), NVIDIA A30, NVIDIA T4 with 2x Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
Last updated: June 29th, 2022