AI Inference

Real-world inferencing demands high throughput and low latency with maximum efficiency across use cases. NVIDIA’s industry-leading solutions let customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Click here to view other performance data.

MLPerf Inference v3.1 Performance Benchmarks

Offline Scenario, Closed Division

Network Throughput GPU Server GPU Version Target Accuracy Dataset
ResNet-50 707,537 samples/sec 8x H100 AS-8125GS-TNHR H100-SXM-80GB 76.46% Top1 ImageNet (224x224)
93,198 samples/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 76.46% Top1 ImageNet (224x224)
12,882 samples/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 76.46% Top1 ImageNet (224x224)
RetinaNet 14,196 samples/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB 0.3755 mAP OpenImages (800x800)
1,849 samples/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 0.3755 mAP OpenImages (800x800)
226 samples/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 0.3755 mAP OpenImages (800x800)
BERT 71,213 samples/sec 8x H100 GIGABYTE G593-SD0 H100-SXM-80GB 90.87% f1 SQuAD v1.1
10,163 samples/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 90.87% f1 SQuAD v1.1
1,029 samples/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 90.87% f1 SQuAD v1.1
GPT-J 107 samples/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 CNN Dailymail
13 samples/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 CNN Dailymail
1 samples/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 CNN Dailymail
DLRMv2 344,370 samples/sec 8x H100 Dell PowerEdge XE9680 H100-SXM-80GB 80.31% AUC Synthetic Multihot Criteo Dataset
49,002 samples/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 80.31% AUC Synthetic Multihot Criteo Dataset
3,673 samples/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 80.31% AUC Synthetic Multihot Criteo Dataset
3D-UNET 52 samples/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB 0.863 DICE mean KiTS 2019
7 samples/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 0.863 DICE mean KiTS 2019
1 samples/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 0.863 DICE mean KiTS 2019
RNN-T 187,469 samples/sec 8x H100 Dell PowerEdge XE9680 H100-SXM-80GB 7.45% WER Librispeech dev-clean
25,975 samples/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 7.45% WER Librispeech dev-clean
3,899 samples/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 7.45% WER Librispeech dev-clean

Server Scenario - Closed Division

Network Throughput GPU Server GPU Version Target Accuracy MLPerf Server Latency
Constraints (ms)
Dataset
ResNet-50 620,874 queries/sec 8x H100 Dell PowerEdge XE9680 H100-SXM-80GB 76.46% Top1 15 ms ImageNet (224x224)
77,018 queries/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 76.46% Top1 15 ms ImageNet (224x224)
12,204 queries/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 76.46% Top1 15 ms ImageNet (224x224)
RetinaNet 13,021 queries/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB 0.3755 mAP 100 ms OpenImages (800x800)
1,731 queries/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 0.3755 mAP 100 ms OpenImages (800x800)
200 queries/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 0.3755 mAP 100 ms OpenImages (800x800)
BERT 57,331 queries/sec 8x H100 Dell PowerEdge XE9680 H100-SXM-80GB 90.87% f1 130 ms SQuAD v1.1
7,704 queries/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 90.87% f1 130 ms SQuAD v1.1
899 queries/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 90.87% f1 130 ms SQuAD v1.1
GPT-J 86 queries/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 20 s CNN Dailymail
11 queries/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 20 s CNN Dailymail
1 queries/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 rouge1=42.9865, rouge2=20.1235, rougeL=29.9881 20 s CNN Dailymail
DLRMv2 327,051 queries/sec 8x H100 SYS-821GE-TNHR H100-SXM-80GB 80.31% AUC 60 ms Synthetic Multihot Criteo Dataset
48,517 queries/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 80.31% AUC 60 ms Synthetic Multihot Criteo Dataset
3,305 queries/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 80.31% AUC 60 ms Synthetic Multihot Criteo Dataset
RNN-T 180,017 queries/sec 8x H100 GIGABYTE G593-SD0 H100-SXM-80GB 7.45% WER 1000 ms Librispeech dev-clean
24,008 queries/sec 1x GH200 NVIDIA GH200-GraceHopper-Superchip GH200-GraceHopper-Superchip 7.45% WER 1000 ms Librispeech dev-clean
3,755 queries/sec 1x L4 ASROCKRACK 1U1G-MILAN NVIDIA L4 7.45% WER 1000 ms Librispeech dev-clean

Power Efficiency Offline Scenario - Closed Division

Network Throughput Throughput per Watt GPU Server GPU Version Dataset
ResNet-50 474,849 samples/sec 117 samples/sec/watt 8x H100 DGX-H100 H100-SXM-80GB ImageNet (224x224)
RetinaNet 10,114 samples/sec 2 samples/sec/watt 8x H100 DGX-H100 H100-SXM-80GB OpenImages (800x800)
BERT 54,050 samples/sec 11 samples/sec/watt 8x H100 DGX-H100 H100-SXM-80GB SQuAD v1.1
GPT-J 65 samples/sec 0.017 samples/sec/watt 8x H100 DGX-H100 H100-SXM-80GB CNN Dailymail
DLRMv2 273,527 samples/sec 49 samples/sec/watt 8x H100 DGX-H100 H100-SXM-80GB Synthetic Multihot Criteo Dataset
3D-UNET 38 samples/sec 0.009 samples/sec/watt 8x H100 DGX-H100 H100-SXM-80GB KiTS 2019
RNN-T 125,479 samples/sec 30 samples/sec/watt 8x H100 DGX-H100 H100-SXM-80GB Librispeech dev-clean

Power Efficiency Server Scenario - Closed Division

Network Throughput Throughput per Watt GPU Server GPU Version Dataset
ResNet-50 400,094 queries/sec 97 queries/sec/watt 8x H100 DGX-H100 H100-SXM-80GB ImageNet (224x224)
RetinaNet 8,802 queries/sec 2 queries/sec/watt 8x H100 DGX-H100 H100-SXM-80GB OpenImages (800x800)
BERT 42,416 queries/sec 8 queries/sec/watt 8x H100 DGX-H100 H100-SXM-80GB SQuAD v1.1
GPT-J 49 queries/sec 0.013 queries/sec/watt 8x H100 DGX-H100 H100-SXM-80GB CNN Dailymail
DLRMv2 244,023 queries/sec 42 queries/sec/watt 8x H100 DGX-H100 H100-SXM-80GB Synthetic Multihot Criteo Dataset
RNN-T 112,015 queries/sec 25 queries/sec/watt 8x H100 DGX-H100 H100-SXM-80GB Librispeech dev-clean

MLPerf™ v3.1 Inference Closed: ResNet-50 v1.5, RetinaNet, RNN-T, BERT 99% of FP32 accuracy target, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 3.1-0069, 3.1-0077, 3.1-0106, 3.1-0110, 3.1-0132, 3.1-0135, 3.1-0109. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
BERT-Large sequence length = 384.
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

H200 Inference Performance

Model Batch Size TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
LLaMA 13B1024112812811,819 output tokens/sec1x H200HGX H200FP8TRT-LLM 0.5.0, TensorRT 9.1.0.4NVIDIA H200
LLaMA 13B128112820484,750 output tokens/sec1x H200HGX H200FP8TRT-LLM 0.5.0, TensorRT 9.1.0.4NVIDIA H200
LLaMA 13B64120481281,349 output tokens/sec1x H200HGX H200FP8TRT-LLM 0.5.0, TensorRT 9.1.0.4NVIDIA H200
LLaMA 70B512412820486,616 output tokens/sec4x H200HGX H200FP8TRT-LLM 0.5.0, TensorRT 9.1.0.4NVIDIA H200
LLaMA 70B51211281283,014 output tokens/sec1x H200HGX H200FP8TRT-LLM 0.5.0, TensorRT 9.1.0.4NVIDIA H200
LLaMA 70B6422048128682 output tokens/sec2x H200HGX H200FP8TRT-LLM 0.5.0, TensorRT 9.1.0.4NVIDIA H200

TP: Tensor Parallelism
Batch size per GPU

H100 Inference Performance - High Throughput

Model Batch Size TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
GPT-J 6B64112812810,907 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
GPT-J 6B64112820486,179 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
GPT-J 6B64120481282,229 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
GPT-J 6B641204820482,980 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 7B6411281289,193 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 7B64112820485,367 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 7B64120481282,058 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 7B321204820482,230 output tokens/sec1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 70B6441281283,317 output tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 70B64412820482,616 output tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 70B6442048128843 output tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 70B644204820481,583 output tokens/sec4x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
Falcon 180B9681281282,686 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
Falcon 180B96812820482,073 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
Falcon 180B6482048128465 output tokens/sec8x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB

TP: Tensor Parallelism
Batch size per GPU

L40S Inference Performance - High Throughput

Model Batch Size TP Input Length Output Length Throughput GPU Server Precision Framework GPU Version
GPT-J 6B6411281283,630 output tokens/sec1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S
GPT-J 6B64112820481,859 output tokens/sec1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S
GPT-J 6B3212048128616 output tokens/sec1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S
GPT-J 6B32120482048757 output tokens/sec1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S
LLaMA 7B6411281283,240 output tokens/sec1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S
LLaMA 7B64112820481,622 output tokens/sec1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S
LLaMA 7B3212048128581 output tokens/sec1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S
LLaMA 7B16120482048531 output tokens/sec1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S

TP: Tensor Parallelism
Batch size per GPU

H100 Inference Performance - Low Latency

Model Batch Size TP Input Length 1st Token Latency GPU Server Precision Framework GPU Version
GPT-J 6B111287 ms1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
GPT-J 6B11204829 ms1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 7B111287 ms1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 7B11204836 ms1x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 70B1412826 ms4x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
LLaMA 70B142048109 ms4x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
Falcon 180B1812827 ms8x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB
Falcon 180B182048205 ms8x H100DGX H100FP8TensorRT-LLM 0.5.0H100-SXM5-80GB

TP: Tensor Parallelism
Batch size per GPU

L40S Inference Performance - Low Latency

Model Batch Size TP Input Length 1st Token Latency GPU Server Precision Framework GPU Version
GPT-J 6B1112812 ms1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S
GPT-J 6B11204871 ms1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S
LLaMA 7B1112814 ms1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S
LLaMA 7B11204873 ms1x L40Sasrockrack 4u8g-romeFP8TensorRT-LLM 0.5.0NVIDIA L40S

TP: Tensor Parallelism
Batch size per GPU

Inference Performance of NVIDIA Data Center Products

H100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)12.1 images/sec- 475.281x H100DGX H100-MixedLAION-5BTensorRT 8.6.0H100-SXM5-80GB
43.21 images/sec- 1244.731x H100DGX H100-MixedLAION-5BTensorRT 8.6.0H100-SXM5-80GB
ResNet-50820,687 images/sec73 images/sec/watt0.391x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
12860,124 images/sec110 images/sec/watt2.131x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
49670,998 images/sec- images/sec/watt6.991x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
ResNet-50v1.5820,119 images/sec65 images/sec/watt0.41x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
12858,062 images/sec101 images/sec/watt2.21x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
47368,145 images/sec- images/sec/watt6.991x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
BERT-BASE89,103 sequences/sec21 sequences/sec/watt0.881x H100DGX H10023.08-py3MixedSyntheticTensorRT 8.6.1H100 SXM5-80GB
12824,828 sequences/sec36 sequences/sec/watt5.161x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
BERT-LARGE83,948 sequences/sec9 sequences/sec/watt2.031x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
1288,313 sequences/sec12 sequences/sec/watt15.41x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
EfficientNet-B0815,945 images/sec65 images/sec/watt0.51x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
12854,823 images/sec118 images/sec/watt2.331x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
46766,972 images/sec- images/sec/watt6.991x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
EfficientNet-B484,438 images/sec14 images/sec/watt1.81x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
537,686 images/sec- images/sec/watt7.031x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
1288,479 images/sec15 images/sec/watt15.11x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
HF Swin Base83,762 samples/sec8 samples/sec/watt2.131x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
325,628 samples/sec9 samples/sec/watt5.691x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
HF Swin Large82,517 samples/sec5 samples/sec/watt3.181x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
323,409 samples/sec5 samples/sec/watt9.391x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
HF ViT Base86,717 samples/sec12 samples/sec/watt1.191x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
6410,222 samples/sec15 samples/sec/watt6.261x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
HF ViT Large82,722 samples/sec4 samples/sec/watt2.941x H100DGX H10023.08-py3MixedSyntheticTensorRT 8.6.1H100 SXM5-80GB
643,388 samples/sec5 samples/sec/watt18.891x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
Megatron BERT Large QAT84,839 sequences/sec12 sequences/sec/watt1.651x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
12812,230 sequences/sec18 sequences/sec/watt10.471x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB
QuartzNet86,339 samples/sec22 samples/sec/watt1.261x H100DGX H10023.08-py3MixedSyntheticTensorRT 8.6.1H100 SXM5-80GB
12832,670 samples/sec88 samples/sec/watt3.921x H100DGX H10023.08-py3INT8SyntheticTensorRT 8.6.1H100 SXM5-80GB

512x512 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L40S Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50823,574 images/sec80 images/sec/watt0.341x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
3239,117 images/sec118 images/sec/watt0.821x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
ResNet-50v1.5822,947 images/sec77 images/sec/watt0.351x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
3237,073 images/sec110 images/sec/watt0.861x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
BERT-BASE88,285 sequences/sec28 sequences/sec/watt0.971x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
12813,036 sequences/sec38 sequences/sec/watt9.821x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
BERT-LARGE83,184 sequences/sec10 sequences/sec/watt2.511x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
244,214 sequences/sec13 sequences/sec/watt5.521x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
EfficientDet-D022,182 images/sec13 images/sec/watt0.921x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
84,505 images/sec17 images/sec/watt1.781x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
EfficientNet-B0820,092 images/sec103 images/sec/watt0.41x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
3240,149 images/sec140 images/sec/watt0.81x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
EfficientNet-B485,022 images/sec18 images/sec/watt1.591x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
165,902 images/sec18 images/sec/watt2.711x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
HF Swin Base83,166 samples/sec9 samples/sec/watt2.531x L40SSupermicro SYS-521GE-TNRT23.08-py3MixedSyntheticTensorRT 8.6.1NVIDIA L40S
163,605 samples/sec11 samples/sec/watt4.441x L40SSupermicro SYS-521GE-TNRT23.08-py3MixedSyntheticTensorRT 8.6.1NVIDIA L40S
HF Swin Large81,615 samples/sec5 samples/sec/watt4.951x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
161,756 samples/sec5 samples/sec/watt9.111x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
HF ViT Base123,981 samples/sec13 samples/sec/watt3.011x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
HF ViT Large81,368 samples/sec4 samples/sec/watt5.851x L40SSupermicro SYS-521GE-TNRT23.08-py3MixedSyntheticTensorRT 8.6.1NVIDIA L40S
Megatron BERT Large QAT83,848 sequences/sec11 sequences/sec/watt2.081x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
244,874 sequences/sec14 sequences/sec/watt4.921x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
QuartzNet87,535 samples/sec32 samples/sec/watt1.061x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S
12822,297 samples/sec66 samples/sec/watt5.741x L40SSupermicro SYS-521GE-TNRT23.08-py3INT8SyntheticTensorRT 8.6.1NVIDIA L40S

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L40 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (1,024x1,024)10.2 images/sec- 5072.491x L40GIGABYTE G482-Z54-00-MixedLAION-5BTensorRT 8.5.2L40
ResNet-50818,572 images/sec71 images/sec/watt0.431x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
3228,637 images/sec96 images/sec/watt1.121x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
ResNet-50v1.5818,025 images/sec67 images/sec/watt0.441x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
3227,061 images/sec90 images/sec/watt1.181x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
BERT-BASE1287,629 sequences/sec26 sequences/sec/watt16.781x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
BERT-LARGE82,448 sequences/sec8 sequences/sec/watt3.271x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
122,580 sequences/sec9 sequences/sec/watt3.271x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
242,725 sequences/sec10 sequences/sec/watt8.811x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
EfficientNet-B012838,862 images/sec130 images/sec/watt3.291x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
EfficientNet-B484,707 images/sec16 images/sec/watt1.71x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
HF Swin Base82,354 samples/sec8 samples/sec/watt3.41x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
322,370 samples/sec8 samples/sec/watt13.51x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
HF Swin Large81,173 samples/sec4 samples/sec/watt6.821x L40GIGABYTE G482-Z54-0023.08-py3MixedSyntheticTensorRT 8.6.1L40
HF ViT Base82,647 samples/sec9 samples/sec/watt3.021x L40GIGABYTE G482-Z54-0023.08-py3MixedSyntheticTensorRT 8.6.1L40
642,653 samples/sec9 samples/sec/watt24.121x L40GIGABYTE G482-Z54-0023.08-py3MixedSyntheticTensorRT 8.6.1L40
HF ViT Large8829 samples/sec3 samples/sec/watt9.641x L40GIGABYTE G482-Z54-0023.08-py3MixedSyntheticTensorRT 8.6.1L40
Megatron BERT Large QAT83,594 sequences/sec13 sequences/sec/watt2.231x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
1284,335 sequences/sec14 sequences/sec/watt29.521x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
QuartzNet87,208 samples/sec31 samples/sec/watt1.111x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40
12817,638 samples/sec59 samples/sec/watt7.261x L40GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L40

1,024x1,024 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L4 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
Stable Diffusion v2.1 (512x512)10.47 images/sec- 2113.071x L4GIGABYTE G482-Z54-00-MixedLAION-5BTensorRT 8.6.0L4
ResNet-50810,172 images/sec143 images/sec/watt0.791x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
3210,413 images/sec144 images/sec/watt3.071x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
ResNet-50v1.589,663 images/sec134 images/sec/watt0.831x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
3210,146 images/sec141 images/sec/watt3.151x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
BERT-BASE83,554 sequences/sec50 sequences/sec/watt2.251x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
244,064 sequences/sec56 sequences/sec/watt5.911x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
BERT-LARGE81,097 sequences/sec15 sequences/sec/watt7.291x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
121,293 sequences/sec18 sequences/sec/watt9.281x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
EfficientNet-B481,817 images/sec25 images/sec/watt4.41x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
HF Swin Base81,052 samples/sec15 samples/sec/watt7.61x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
HF Swin Large8524 samples/sec7 samples/sec/watt15.281x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
HF ViT Base81,304 samples/sec18 samples/sec/watt6.141x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
HF ViT Large8396 samples/sec5 samples/sec/watt20.211x L4GIGABYTE G482-Z54-0023.08-py3MixedSyntheticTensorRT 8.6.1L4
Megatron BERT Large QAT81,516 sequences/sec22 sequences/sec/watt5.281x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
QuartzNet84,523 samples/sec63 samples/sec/watt1.771x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4
1286,019 samples/sec84 samples/sec/watt21.261x L4GIGABYTE G482-Z54-0023.08-py3INT8SyntheticTensorRT 8.6.1L4

512x512 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A40 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50811,504 images/sec40 images/sec/watt0.71x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
10915,968 images/sec- images/sec/watt6.891x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
12815,932 images/sec53 images/sec/watt8.031x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
ResNet-50v1.5811,011 images/sec38 images/sec/watt0.731x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
10615,458 images/sec- images/sec/watt6.921x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
12815,294 images/sec51 images/sec/watt8.371x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
BERT-BASE84,355 sequences/sec15 sequences/sec/watt1.841x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
1285,622 sequences/sec19 sequences/sec/watt22.771x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
BERT-LARGE81,572 sequences/sec5 sequences/sec/watt5.091x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
1281,937 sequences/sec6 sequences/sec/watt66.081x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
EfficientNet-B0810,848 images/sec56 images/sec/watt0.741x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
12819,747 images/sec66 images/sec/watt6.481x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
13619,875 images/sec- images/sec/watt6.891x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
EfficientNet-B482,129 images/sec7 images/sec/watt3.761x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
152,314 images/sec- images/sec/watt6.911x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
1282,624 images/sec9 images/sec/watt48.781x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
HF Swin Base81,410 samples/sec5 samples/sec/watt5.671x A40GIGABYTE G482-Z52-0023.08-py3MixedSyntheticTensorRT 8.6.1A40
321,425 samples/sec5 samples/sec/watt22.461x A40GIGABYTE G482-Z52-0023.08-py3MixedSyntheticTensorRT 8.6.1A40
HF Swin Large8802 samples/sec3 samples/sec/watt9.971x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
32819 samples/sec3 samples/sec/watt39.071x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
HF ViT Base82,129 samples/sec7 samples/sec/watt3.761x A40GIGABYTE G482-Z52-0023.08-py3MixedSyntheticTensorRT 8.6.1A40
642,152 samples/sec7 samples/sec/watt29.741x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
HF ViT Large8680 samples/sec2 samples/sec/watt11.771x A40GIGABYTE G482-Z52-0023.08-py3MixedSyntheticTensorRT 8.6.1A40
64702 samples/sec2 samples/sec/watt91.121x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
Megatron BERT Large QAT82,094 sequences/sec7 sequences/sec/watt3.821x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
1282,655 sequences/sec9 sequences/sec/watt48.211x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
QuartzNet84,383 samples/sec18 samples/sec/watt1.831x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40
1288,340 samples/sec28 samples/sec/watt15.351x A40GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A40

512x512 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A30 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50810,394 images/sec74 images/sec/watt0.771x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
11617,040 images/sec- images/sec/watt6.871x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
12817,303 images/sec105 images/sec/watt7.41x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
ResNet-50v1.5810,144 images/sec71 images/sec/watt0.791x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
11116,294 images/sec- images/sec/watt6.871x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
12816,680 images/sec101 images/sec/watt7.671x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
84,309 sequences/sec26 sequences/sec/watt1.861x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
1285,763 sequences/sec35 sequences/sec/watt22.211x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,494 sequences/sec9 sequences/sec/watt5.361x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
1282,057 sequences/sec13 sequences/sec/watt62.231x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
EfficientNet-B088,889 images/sec78 images/sec/watt0.91x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
11616,979 images/sec- images/sec/watt6.891x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
12817,101 images/sec104 images/sec/watt7.481x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
EfficientNet-B481,871 images/sec12 images/sec/watt4.281x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
142,099 images/sec- images/sec/watt7.151x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
1282,389 images/sec15 images/sec/watt53.581x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
HF Swin Base81,339 samples/sec8 samples/sec/watt5.981x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
321,462 samples/sec9 samples/sec/watt21.891x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
HF Swin Large8762 samples/sec5 samples/sec/watt10.51x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
32786 samples/sec5 samples/sec/watt40.731x A30GIGABYTE G482-Z52-0023.08-py3MixedSyntheticTensorRT 8.6.1A30
HF ViT Base82,016 samples/sec12 samples/sec/watt3.971x A30GIGABYTE G482-Z52-0023.08-py3MixedSyntheticTensorRT 8.6.1A30
642,177 samples/sec13 samples/sec/watt29.41x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
HF ViT Large8645 samples/sec4 samples/sec/watt12.41x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
64692 samples/sec4 samples/sec/watt92.421x A30GIGABYTE G482-Z52-0023.08-py3MixedSyntheticTensorRT 8.6.1A30
Megatron BERT Large QAT81,805 sequences/sec13 sequences/sec/watt4.431x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
1282,753 sequences/sec17 sequences/sec/watt46.491x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
QuartzNet83,409 samples/sec30 samples/sec/watt2.351x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
1289,875 samples/sec72 samples/sec/watt12.961x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30

512x512 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A30 1/4 MIG Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-5084,020 images/sec47 images/sec/watt1.991x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
324,634 images/sec- images/sec/watt7.121x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
1284,803 images/sec53 images/sec/watt26.651x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
ResNet-50v1.583,877 images/sec47 images/sec/watt2.061x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
314,487 images/sec- images/sec/watt7.131x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
1284,624 images/sec53 images/sec/watt27.681x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
BERT-BASE81,580 sequences/sec17 sequences/sec/watt5.061x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
1281,711 sequences/sec18 sequences/sec/watt74.831x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
BERT-LARGE8519 sequences/sec6 sequences/sec/watt15.41x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
128593 sequences/sec6 sequences/sec/watt215.951x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30

Sequence length=128 for BERT-BASE and BERT-LARGE

 

A30 4 MIG Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50815,416 images/sec94 images/sec/watt2.081x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
2917,341 images/sec- images/sec/watt6.941x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
12818,166 images/sec111 images/sec/watt28.261x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
ResNet-50v1.5814,883 images/sec90 images/sec/watt2.161x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
2816,630 images/sec- images/sec/watt71x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
12817,528 images/sec106 images/sec/watt29.281x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
BERT-BASE85,728 sequences/sec35 sequences/sec/watt5.691x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
1286,025 sequences/sec37 sequences/sec/watt86.721x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
BERT-LARGE81,889 sequences/sec11 sequences/sec/watt17.061x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30
1282,094 sequences/sec13 sequences/sec/watt245.591x A30GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A30

Sequence length=128 for BERT-BASE and BERT-LARGE

 

A10 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-5088,784 images/sec59 images/sec/watt0.911x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
7711,346 images/sec- images/sec/watt6.871x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
12811,331 images/sec76 images/sec/watt11.31x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
ResNet-50v1.588,387 images/sec56 images/sec/watt0.951x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
7010,368 images/sec- images/sec/watt7.041x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
12810,766 images/sec72 images/sec/watt11.891x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
BERT-BASE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
83,287 sequences/sec22 sequences/sec/watt2.431x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
1283,842 sequences/sec26 sequences/sec/watt33.321x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
BERT-LARGE1For Batch Size 1, please refer to Triton Inference Server page
2For Batch Size 2, please refer to Triton Inference Server page
81,136 sequences/sec8 sequences/sec/watt7.041x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
1281,271 sequences/sec9 sequences/sec/watt100.721x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
EfficientNet-B089,401 images/sec63 images/sec/watt0.851x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
12814,246 images/sec97 images/sec/watt8.981x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
EfficientNet-B481,601 images/sec11 images/sec/watt51x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
1281,868 images/sec12 images/sec/watt68.531x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
HF Swin Base81,029 samples/sec7 samples/sec/watt7.771x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
321,013 samples/sec7 samples/sec/watt31.571x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
HF Swin Large8546 samples/sec4 samples/sec/watt14.651x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
32563 samples/sec4 samples/sec/watt56.831x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
HF ViT Base81,382 samples/sec9 samples/sec/watt5.791x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
641,433 samples/sec10 samples/sec/watt44.681x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
HF ViT Large8463 samples/sec3 samples/sec/watt17.281x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
64444 samples/sec3 samples/sec/watt144.071x A10GIGABYTE G482-Z52-0023.08-py3MixedSyntheticTensorRT 8.6.1A10
Megatron BERT Large QAT81,571 sequences/sec11 sequences/sec/watt5.091x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
1281,839 sequences/sec12 sequences/sec/watt69.621x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
QuartzNet83,968 samples/sec27 samples/sec/watt2.021x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10
1285,828 samples/sec39 samples/sec/watt21.961x A10GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A10

512x512 image size, 50 denoising steps for Stable Diffusion
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

 

A2 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-5082,741 images/sec46 images/sec/watt2.921x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
203,012 images/sec- images/sec/watt6.971x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
1283,137 images/sec52 images/sec/watt40.811x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
ResNet-50v1.582,646 images/sec44 images/sec/watt3.021x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
192,889 images/sec- images/sec/watt6.971x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
1283,002 images/sec50 images/sec/watt42.641x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
BERT-BASE8920 sequences/sec15 sequences/sec/watt8.691x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
128986 sequences/sec16 sequences/sec/watt129.851x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
BERT-LARGE8297 sequences/sec5 sequences/sec/watt26.911x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
128318 sequences/sec5 sequences/sec/watt402.71x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
EfficientNet-B083,189 images/sec61 images/sec/watt2.511x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
253,714 images/sec- images/sec/watt71x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
1283,917 images/sec65 images/sec/watt32.681x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
EfficientNet-B48480 images/sec8 images/sec/watt16.681x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2
128521 images/sec9 images/sec/watt245.871x A2GIGABYTE G482-Z52-0023.08-py3INT8SyntheticTensorRT 8.6.1A2

Sequence length=128 for BERT-BASE and BERT-LARGE

NVIDIA Client Batch Size 1 and 2 Performance with Triton Inference Server

A100 Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Dynamic Batch Size (Triton) Number of Concurrent Client Requests Latency (ms) Throughput Sequence/Input Length Triton Container Version
BERT Large Inference A100-SXM4-40GB tensorrt TensorRT Mixed 2 1 1 24 32.212 745 inf/sec 384 23.07-py3
BERT Large Inference A100-SXM4-40GB tensorrt TensorRT Mixed 4 2 1 24 59.367 808 inf/sec 384 23.07-py3
BERT Base Inference A100-SXM4-80GB tensorrt TensorRT Mixed 4 1 1 20 3.464 5,772 inf/sec 128 23.07-py3
BERT Base Inference A100-SXM4-40GB tensorrt TensorRT Mixed 4 2 1 24 7.222 6,645 inf/sec 128 23.07-py3
DLRM Inference A100-SXM4-40GB ts-trace PyTorch Mixed 2 1 65,536 30 1.167 25,694 inf/sec - 23.07-py3
DLRM Inference A100-SXM4-80GB ts-trace PyTorch Mixed 2 2 65,536 30 1.212 49,484 inf/sec - 23.07-py3
ResNet-50 v1.5 A100-SXM4-80GB tensorrt PyTorch Mixed 4 1 128 512 31.198 16,400 inf/sec - 23.07-py3
ResNet-50 v1.5 A100-SXM4-80GB tensorrt PyTorch Mixed 4 2 128 384 43.656 17,591 inf/sec - 23.07-py3
BERT Large Inference A100-PCIE-80GB tensorrt TensorRT Mixed 2 1 1 24 36.777 652 inf/sec 384 23.07-py3
BERT Large Inference A100-PCIE-80GB tensorrt TensorRT Mixed 4 2 1 24 65.889 728 inf/sec 384 23.07-py3
BERT Base Inference A100-PCIE-80GB tensorrt TensorRT Mixed 2 1 1 24 4.559 5,262 inf/sec 128 23.07-py3
BERT Base Inference A100-PCIE-80GB tensorrt TensorRT Mixed 4 2 1 24 7.79 6,161 inf/sec 128 23.07-py3
DLRM Inference A100-PCIE-80GB ts-trace PyTorch Mixed 2 1 65,536 30 1.184 25,324 inf/sec - 23.07-py3
DLRM Inference A100-PCIE-80GB ts-trace PyTorch Mixed 2 2 65,536 30 1.112 53,917 inf/sec - 23.07-py3
ResNet-50 v1.5 A100-PCIE-80GB tensorrt PyTorch Mixed 4 1 128 512 32.524 15,729 inf/sec - 23.07-py3
ResNet-50 v1.5 A100-PCIE-80GB tensorrt PyTorch Mixed 2 2 128 512 57.479 17,795 inf/sec - 23.07-py3

A30 Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Dynamic Batch Size (Triton) Number of Concurrent Client Requests Latency (ms) Throughput Sequence/Input Length Triton Container Version
BERT Large Inference A30 tensorrt TensorRT Mixed 2 1 1 20 66.853 359 inf/sec 384 23.07-py3
BERT Large Inference A30 tensorrt TensorRT Mixed 2 2 1 20 105.347 380 inf/sec 384 23.07-py3
BERT Base Inference A30 tensorrt TensorRT Mixed 2 1 1 24 7.369 3,256 inf/sec 128 23.07-py3
BERT Base Inference A30 tensorrt TensorRT Mixed 4 2 1 24 13.428 3,574 inf/sec 128 23.07-py3
ResNet-50 v1.5 A30 tensorrt PyTorch Mixed 2 1 128 512 56.456 9,061 inf/sec - 23.07-py3
ResNet-50 v1.5 A30 tensorrt PyTorch Mixed 2 2 128 512 113.571 9,004 inf/sec - 23.07-py3

A10 Triton Inference Server Performance

Network Accelerator Model Format Framework Backend Precision Model Instances on Triton Client Batch Size Dynamic Batch Size (Triton) Number of Concurrent Client Requests Latency (ms) Throughput Sequence/Input Length Triton Container Version
BERT Large Inference A10 tensorrt TensorRT Mixed 4 1 1 24 101.066 237 inf/sec 384 23.07-py3
BERT Large Inference A10 tensorrt TensorRT Mixed 4 2 1 24 195.715 245 inf/sec 384 23.07-py3
BERT Base Inference A10 tensorrt TensorRT Mixed 2 1 1 24 10.726 2,237 inf/sec 128 23.07-py3
BERT Base Inference A10 tensorrt TensorRT Mixed 2 2 1 20 16.638 2,404 inf/sec 128 23.07-py3
ResNet-50 v1.5 A10 tensorrt PyTorch Mixed 2 1 128 512 87.367 5,855 inf/sec - 23.07-py3
ResNet-50 v1.5 A10 tensorrt PyTorch Mixed 2 2 128 384 131.146 5,850 inf/sec - 23.07-py3

Inference Performance of NVIDIA GPUs in the Cloud

A100 Inference Performance in the Cloud

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version
ResNet-50v1.5 8 13,486 images/sec - images/sec/watt 0.59 1x A100 GCP A2-HIGHGPU-1G 23.05-py3 INT8 Synthetic - A100-SXM4-40GB
128 30,517 images/sec - images/sec/watt 4.19 1x A100 GCP A2-HIGHGPU-1G 23.05-py3 INT8 Synthetic - A100-SXM4-40GB
8 13,673 images/sec - images/sec/watt 0.59 1x A100 AWS EC2 p4d.24xlarge 23.07-py3 INT8 Synthetic - A100-SXM4-40GB
128 30,517 images/sec - images/sec/watt 4.19 1x A100 AWS EC2 p4d.24xlarge 23.07-py3 INT8 Synthetic - A100-SXM4-40GB
8 13,733 images/sec - images/sec/watt 0.58 1x A100 Azure Standard_ND96amsr_A100_v4 23.06-py3 INT8 Synthetic - A100-SXM4-80GB
128 32,513 images/sec - images/sec/watt 3.94 1x A100 Azure Standard_ND96amsr_A100_v4 23.06-py3 INT8 Synthetic - A100-SXM4-80GB
BERT-LARGE 8 2,326 images/sec - images/sec/watt 3.44 1x A100 GCP A2-HIGHGPU-1G 23.05-py3 INT8 Synthetic - A100-SXM4-40GB
128 4,055 images/sec - images/sec/watt 31.57 1x A100 GCP A2-HIGHGPU-1G 23.05-py3 INT8 Synthetic - A100-SXM4-40GB
8 2,349 images/sec - images/sec/watt 3.41 1x A100 Azure Standard_ND96amsr_A100_v4 23.06-py3 INT8 Synthetic - A100-SXM4-80GB
128 4,177 images/sec - images/sec/watt 30.64 1x A100 Azure Standard_ND96amsr_A100_v4 23.06-py3 INT8 Synthetic - A100-SXM4-80GB
8 2,318 images/sec - images/sec/watt 3.45 1x A100 AWS EC2 p4d.24xlarge 23.07-py3 INT8 Synthetic - A100-SXM4-80GB
128 4,037 images/sec - images/sec/watt 31.7 1x A100 AWS EC2 p4d.24xlarge 23.07-py3 INT8 Synthetic - A100-SXM4-80GB

BERT-Large: Sequence Length = 128

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More