AI Training

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Click here to view other performance data.

NVIDIA Performance on MLPerf 5.0 Training Benchmarks

NVIDIA Performance on MLPerf 5.0’s AI Benchmarks: Single Node, Closed Division

Framework	Network	Time to Train (mins)	MLPerf Quality Target	GPU	Server	MLPerf-ID	Precision	Dataset	GPU Version
Nemo	Llama2-70B-lora	11	0.925 Eval loss	8x GB200	BM.GPU.GB200.4	5.0-0020	Mixed	SCROLLS GovReport	NVIDIA Blackwell GPU (GB200)
		11.2	0.925 Eval loss	8x B200	SYS-422GA-NBRT-LCC	5.0-0089	Mixed	SCROLLS GovReport	NVIDIA Blackwell GPU (B200-SXM-180GB)
	Stable Diffusion	12.9	FID⇐90 and and CLIP>=0.15	8x GB200	Tyche (1x NVIDIA GB200 NVL72)	5.0-0071	Mixed	LAION-400M-filtered	NVIDIA Blackwell GPU (GB200)
		13	FID⇐90 and and CLIP>=0.15	8x B200	SYS-422GA-NBRT-LCC	5.0-0089	Mixed	LAION-400M-filtered	NVIDIA Blackwell GPU (B200-SXM-180GB)
PyTorch	BERT	3.4	0.72 Mask-LM accuracy	8x GB200	Tyche (1x NVIDIA GB200 NVL72)	5.0-0072	Mixed	Wikipedia 2020/01/01	NVIDIA Blackwell GPU (GB200)
		3.5	0.72 Mask-LM accuracy	8x B200	1xXE9680Lx8B200-SXM-180GB	5.0-0033	Mixed	Wikipedia 2020/01/01	NVIDIA Blackwell GPU (B200-SXM-180GB)
	RetinaNet	22.3	34.0% mAP	8x GB200	Tyche (1x NVIDIA GB200 NVL72)	5.0-0072	Mixed	A subset of OpenImages	NVIDIA Blackwell GPU (GB200)
		21.8	34.0% mAP	8x B200	AS-A126GS-TNBR	5.0-0085	Mixed	A subset of OpenImages	NVIDIA Blackwell GPU (B200-SXM-180GB)
DGL	R-GAT	5	72.0 % classification	8x GB200	Tyche (1x NVIDIA GB200 NVL72)	5.0-0069	Mixed	IGBH-Full	NVIDIA Blackwell GPU (GB200)
		5.1	72.0 % classification	8x B200	G893-SD1	5.0-0046	Mixed	IGBH-Full	NVIDIA Blackwell GPU (B200-SXM-180GB)
NVIDIA Merlin HugeCTR	DLRM-dcnv2	2.2	0.80275 AUC	8x GB200	Tyche (1x NVIDIA GB200 NVL72)	5.0-0070	Mixed	Criteo 3.5TB Click Logs (multi-hot variant)	NVIDIA Blackwell GPU (GB200)
		2.3	0.80275 AUC	8x B200	Nyx (1x NVIDIA DGX B200)	5.0-0061	Mixed	Criteo 3.5TB Click Logs (multi-hot variant)	NVIDIA Blackwell GPU (B200-SXM-180GB)

NVIDIA Performance on MLPerf 5.0’s AI Benchmarks: Multi Node, Closed Division

Framework	Network	Time to Train (mins)	MLPerf Quality Target	GPU	Server	MLPerf-ID	Precision	Dataset	GPU Version
NVIDIA NeMo	Llama 3.1 405B	240.3	5.6 log perplexity	256x GB200	Tyche (4x NVIDIA GB200 NVL72)	5.0-0075	Mixed	c4/en/3.0.1	NVIDIA Blackwell GPU (GB200)
		121.1	5.6 log perplexity	512x GB200	Carina (8x NVIDIA GB200 NVL72)	5.0-0005	Mixed	c4/en/3.0.1	NVIDIA Blackwell GPU (GB200)
		62.1	5.6 log perplexity	1,024x GB200	Carina (16x NVIDIA GB200 NVL72)	5.0-0001	Mixed	c4/en/3.0.1	NVIDIA Blackwell GPU (GB200)
		27.3	5.6 log perplexity	2,496x GB200	Carina (39x NVIDIA GB200 NVL72)	5.0-0004	Mixed	c4/en/3.0.1	NVIDIA Blackwell GPU (GB200)
		20.8	5.6 log perplexity	8,192x H100	Eos-dfw (1024x NVIDIA HGX H100)	5.0-0010	Mixed	c4/en/3.0.1	NVIDIA H100-SXM5-80GB
	Llama2-70B-lora	1.9	0.925 Eval loss	64x GB200	16xXE9712x4GB200	5.0-0031	Mixed	SCROLLS GovReport	NVIDIA Blackwell GPU (GB200)
		1.1	0.925 Eval loss	144x GB200	Tyche (2x NVIDIA GB200 NVL72)	5.0-0073	Mixed	SCROLLS GovReport	NVIDIA Blackwell GPU (GB200)
		0.6	0.925 Eval loss	512x GB200	Tyche (8x NVIDIA GB200 NVL72)	5.0-0076	Mixed	SCROLLS GovReport	NVIDIA Blackwell GPU (GB200)
		6.1	0.925 Eval loss	16x B200	AS-4126GS-NBR-LCC_N2	5.0-0083	Mixed	SCROLLS GovReport	NVIDIA Blackwell GPU (B200-SXM-180GB)
		2	0.925 Eval loss	64x B200	BM.GPU.B200.8	5.0-0018	Mixed	SCROLLS GovReport	NVIDIA Blackwell GPU (B200-SXM-180GB)
	Stable Diffusion	7.6	FID⇐90 and and CLIP>=0.15	16x GB200	4xXE9712x4GB200	5.0-0040	Mixed	LAION-400M-filtered	NVIDIA Blackwell GPU (GB200)
		4.3	FID⇐90 and and CLIP>=0.15	32x GB200	8xXE9712x4GB200	5.0-0041	Mixed	LAION-400M-filtered	NVIDIA Blackwell GPU (GB200)
		2.7	FID⇐90 and and CLIP>=0.15	64x GB200	16xXE9712x4GB200	5.0-0031	Mixed	LAION-400M-filtered	NVIDIA Blackwell GPU (GB200)
		1	FID⇐90 and and CLIP>=0.15	512x GB200	Tyche (8x NVIDIA GB200 NVL72)	5.0-0076	Mixed	LAION-400M-filtered	NVIDIA Blackwell GPU (GB200)
		2.8	FID⇐90 and and CLIP>=0.15	64x B200	BM.GPU.B200.8	5.0-0018	Mixed	LAION-400M-filtered	NVIDIA Blackwell GPU (B200-SXM-180GB)
PyTorch	BERT	2.1	0.72 Mask-LM accuracy	16x GB200	4xXE9712x4GB200	5.0-0040	Mixed	Wikipedia 2020/01/01	NVIDIA Blackwell GPU (GB200)
		1.5	0.72 Mask-LM accuracy	32x GB200	8xXE9712x4GB200	5.0-0041	Mixed	Wikipedia 2020/01/01	NVIDIA Blackwell GPU (GB200)
		0.7	0.72 Mask-LM accuracy	64x GB200	Tyche (1x NVIDIA GB200 NVL72)	5.0-0065	Mixed	Wikipedia 2020/01/01	NVIDIA Blackwell GPU (GB200)
		0.3	0.72 Mask-LM accuracy	512x GB200	Tyche (8x NVIDIA GB200 NVL72)	5.0-0077	Mixed	Wikipedia 2020/01/01	NVIDIA Blackwell GPU (GB200)
		2.3	0.72 Mask-LM accuracy	16x B200	2xXE9680Lx8B200-SXM-180GB	5.0-0037	Mixed	Wikipedia 2020/01/01	NVIDIA Blackwell GPU (B200-SXM-180GB)
	RetinaNet	12.3	34.0% mAP	16x GB200	4xXE9712x4GB200	5.0-0040	Mixed	A subset of OpenImages	NVIDIA Blackwell GPU (GB200)
		9	34.0% mAP	32x GB200	8xXE9712x4GB200	5.0-0041	Mixed	A subset of OpenImages	NVIDIA Blackwell GPU (GB200)
		4.3	34.0% mAP	64x GB200	16xXE9712x4GB200	5.0-0031	Mixed	A subset of OpenImages	NVIDIA Blackwell GPU (GB200)
		1.4	34.0% mAP	512x GB200	Tyche (8x NVIDIA GB200 NVL72)	5.0-0077	Mixed	A subset of OpenImages	NVIDIA Blackwell GPU (GB200)
		14	34.0% mAP	16x B200	2xXE9680Lx8B200-SXM-180GB	5.0-0037	Mixed	A subset of OpenImages	NVIDIA Blackwell GPU (B200-SXM-180GB)
		4.4	34.0% mAP	64x B200	BM.GPU.B200.8	5.0-0018	Mixed	A subset of OpenImages	NVIDIA Blackwell GPU (B200-SXM-180GB)
DGL	R-GAT	1.1	72.0 % classification	72x GB200	Tyche (1x NVIDIA GB200 NVL72)	5.0-0066	Mixed	IGBH-Full	NVIDIA Blackwell GPU (GB200)
		0.8	72.0 % classification	256x GB200	Tyche (4x NVIDIA GB200 NVL72)	5.0-0074	Mixed	IGBH-Full	NVIDIA Blackwell GPU (GB200)
NVIDIA Merlin HugeCTR	DLRM-dcnv2	0.7	0.80275 AUC	64x GB200	SRS-GB200-NVL72-M1 (16x ARS-121GL-NBO)	5.0-0087	Mixed	Criteo 3.5TB Click Logs (multi-hot variant)	NVIDIA Blackwell GPU (GB200)

MLPerf™ v5.0 Training Closed: 5.0-0001, 5.0-0004, 5.0-0005, 5.0-0010, 5.0-0018, 5.0-0020, 5.0-0031, 5.0-0033, 5.0-0037, 5.0-0040, 5.0-0041, 5.0-0046, 5.0-0061, 5.0-0065, 5.0-0066, 5.0-0068, 5.0-0069, 5.0-0070, 5.0-0071, 5.0-0072, 5.0-0073, 5.0-0074, 5.0-0075, 5.0-0076, 5.0-0077, 5.0-0083, 5.0-0085, 5.0-0087, 5.0-0089 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For Training rules and guidelines, click here

NVIDIA Performance on MLPerf 3.0’s Training HPC Benchmarks: Closed Division

Framework	Network	Time to Train (mins)	MLPerf Quality Target	GPU	Server	MLPerf-ID	Precision	Dataset	GPU Version
PyTorch	CosmoFlow	2.1	Mean average error 0.124	512x H100	eos	3.0-8006	Mixed	CosmoFlow N-body cosmological simulation data with 4 cosmological parameter targets	H100-SXM5-80GB
	DeepCAM	0.8	IOU 0.82	2,048x H100	eos	3.0-8007	Mixed	CAM5+TECA climate simulation with 3 target classes (atmospheric river, tropical cyclone, background)	H100-SXM5-80GB
	OpenCatalyst	10.7	Forces mean absolute error 0.036	640x H100	eos	3.0-8008	Mixed	Open Catalyst 2020 (OC20) S2EF 2M training split, ID validation set	H100-SXM5-80GB
	OpenFold	7.5	Local Distance Difference Test (lDDT-Cα) >= 0.8	2,080x H100	eos	3.0-8009	Mixed	OpenProteinSet and Protein Data Bank	H100-SXM5-80GB

MLPerf™ v3.0 Training HPC Closed: 3.0-8006, 3.0-8007, 3.0-8008, 3.0-8009 | MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
For MLPerf™ v3.0 Training HPC rules and guidelines, click here

LLM Training Performance on NVIDIA Data Center Products

B200 Training Performance

Framework	Model	Time to Train (days)	Throughput per GPU	GPU	Server	Container Version	Sequence Length	TP	PP	CP	Precision	Global Batch Size	GPU Version
Nemo	GPT3 175B	7	1,523 tokens/sec	512x B200	DGX B200	nemo:25.04	2048	4	4	1	FP8	2048	NVIDIA B200
	Llama3 70B	3	3,562 tokens/sec	64x B200	DGX B200	nemo:25.04	8192	2	4	2	FP8	128	NVIDIA B200
	Llama3 405B	17	651 tokens/sec	128x B200	DGX B200	nemo:25.04	8192	4	8	2	FP8	64	NVIDIA B200
	Nemotron 15B	0.7	16,222 tokens/sec	64x B200	DGX B200	nemo:25.04	4096	1	1	1	FP8	256	NVIDIA B200
	Nemotron 340B	18	632 tokens/sec	128x B200	DGX B200	nemo:25.04	4096	8	4	1	FP8	32	NVIDIA B200
	Mixtral 8x7B	0.6	17,617 tokens/sec	64x B200	DGX B200	nemo:25.04	4096	1	1	1	FP8	256	NVIDIA B200
	Mixtral 8x22B	5	2,399 tokens/sec	256x B200	DGX B200	nemo:25.04	65536	2	4	8	FP8	1	NVIDIA B200

TP: Tensor Parallelism
PP: Pipeline Parallelism
CP: Context Parallelism
Time to Train is estimated time to train on 1T tokens with 1K GPUs

Converged Training Performance on NVIDIA Data Center GPUs

H200 Training Performance

Framework	Framework Version	Network	Time to Train (mins)	Accuracy	Throughput	GPU	Server	Container	Precision	Batch Size	Dataset	GPU Version
PyTorch	2.4.0a0	Tacotron2	65	.56 Training Loss	496,465 total output mels/sec	8x H200	DGX H200	24.12-py3	TF32	128	LJSpeech 1.1	NVIDIA H200
	2.4.0a0	WaveGlow	106	-5.7 Training Loss	4,124,433 output samples/sec	8x H200	DGX H200	24.12-py3	Mixed	10	LJSpeech 1.1	NVIDIA H200
	2.4.0a0	NCF		.96 Hit Rate at 10	252,318,096 samples/sec	8x H200	DGX H200	24.12-py3	Mixed	131072	MovieLens 20M	NVIDIA H200
	2.4.0a0	FastPitch	66	.17 Training Loss	1,465,568 frames/sec	8x H200	DGX H200	24.12-py3	TF32	32	LJSpeech 1.1	NVIDIA H200
	2.4.0a0	Transformer XL Large	264	17.82 Perplexity	317,663 total tokens/sec	8x H200	DGX H200	24.12-py3	Mixed	16	WikiText-103	NVIDIA H200
	2.4.0a0	Transformer XL Base	116	21.6 Perplexity	1,163,450 total tokens/sec	8x H200	DGX H200	24.12-py3	Mixed	128	WikiText-103	NVIDIA H200
	2.4.0a0	EfficientDet-D0	303	.33 BBOX mAP	2,793 images/sec	8x H200	DGX H200	24.12-py3	Mixed	150	COCO 2017	NVIDIA H200
	2.4.0a0	HiFiGAN	915	9.75 Training Loss	120,606 total output mels/sec	8x H200	DGX H200	24.12-py3	Mixed	16	LJSpeech-1.1	NVIDIA H200

H100 Training Performance

Framework	Framework Version	Network	Accuracy	Throughput	GPU	Server	Container	Precision	Batch Size	Dataset	GPU Version
PyTorch	2.4.0a0	Tacotron2	. Training Loss	477,113 total output mels/sec	8x H100	DGX H100	24.12-py3	Mixed	128	LJSpeech 1.1	H100-SXM5-80GB
	2.4.0a0	WaveGlow	. Training Loss	3,809,464 output samples/sec	8x H100	DGX H100	24.12-py3	Mixed	10	LJSpeech 1.1	H100-SXM5-80GB
	2.4.0a0	NCF	. Hit Rate at 10	212,174,107 samples/sec	8x H100	DGX H100	24.12-py3	TF32	131072	MovieLens 20M	H100-SXM5-80GB
	2.4.0a0	FastPitch	. Training Loss	1,431,758 frames/sec	8x H100	DGX H100	24.12-py3	TF32	32	LJSpeech 1.1	H100-SXM5-80GB

A30 Training Performance

Framework	Framework Version	Network	Time to Train (mins)	Accuracy	Throughput	GPU	Server	Container	Precision	Batch Size	Dataset	GPU Version
PyTorch	2.4.0a0	Tacotron2	129	.53 Training Loss	237,526 total output mels/sec	8x A30	GIGABYTE G482-Z52-00	24.09-py3	Mixed	104	LJSpeech 1.1	NVIDIA A30
	2.4.0a0	WaveGlow	402	-5.88 Training Loss	1,047,359 output samples/sec	8x A30	GIGABYTE G482-Z52-00	24.09-py3	Mixed	10	LJSpeech 1.1	NVIDIA A30
	2.4.0a0	GNMT v2	49	24.23 BLEU Score	306,590 total tokens/sec	8x A30	GIGABYTE G482-Z52-00	24.09-py3	Mixed	128	wmt16-en-de	NVIDIA A30
	2.4.0a0	NCF	1	.96 Hit Rate at 10	41,902,951 samples/sec	8x A30	GIGABYTE G482-Z52-00	24.09-py3	Mixed	131072	MovieLens 20M	NVIDIA A30
	2.4.0a0	FastPitch	153	.17 Training Loss	547,338 frames/sec	8x A30	GIGABYTE G482-Z52-00	24.09-py3	Mixed	16	LJSpeech 1.1	NVIDIA A30
	2.4.0a0	Transformer XL Base	196	22.82 Perplexity	168,548 total tokens/sec	8x A30	GIGABYTE G482-Z52-00	24.09-py3	Mixed	32	WikiText-103	NVIDIA A30
	2.4.0a0	EfficientNet-B0	785	77.15 Top 1	11,335 images/sec	8x A30	GIGABYTE G482-Z52-00	24.09-py3	Mixed	128	Imagenet2012	NVIDIA A30
	2.4.0a0	EfficientNet-WideSE-B0	800	77.08 Top 1	11,029 images/sec	8x A30	GIGABYTE G482-Z52-00	24.09-py3	Mixed	128	Imagenet2012	NVIDIA A30
	2.4.0a0	MoFlow	99	86.8 NUV	12,284 molecules/sec	8x A30	GIGABYTE G482-Z52-00	24.09-py3	Mixed	512	ZINC	A30

A10 Training Performance

Framework	Framework Version	Network	Time to Train (mins)	Accuracy	Throughput	GPU	Server	Container	Precision	Batch Size	Dataset	GPU Version
PyTorch	2.4.0a0	Tacotron2	145	.53 Training Loss	210,315 total output mels/sec	8x A10	GIGABYTE G482-Z52-00	24.09-py3	Mixed	104	LJSpeech 1.1	NVIDIA A10
	2.4.0a0	WaveGlow	543	-5.8 Training Loss	776,028 output samples/sec	8x A10	GIGABYTE G482-Z52-00	24.09-py3	Mixed	10	LJSpeech 1.1	NVIDIA A10
	2.4.0a0	GNMT v2	57	24.29 BLEU Score	262,936 total tokens/sec	8x A10	GIGABYTE G482-Z52-00	24.09-py3	Mixed	128	wmt16-en-de	NVIDIA A10
	2.4.0a0	NCF	2	.96 Hit Rate at 10	33,005,044 samples/sec	8x A10	GIGABYTE G482-Z52-00	24.09-py3	TF32	131072	MovieLens 20M	NVIDIA A10
	2.4.0a0	FastPitch	180	.17 Training Loss	462,052 frames/sec	8x A10	GIGABYTE G482-Z52-00	24.09-py3	Mixed	16	LJSpeech 1.1	NVIDIA A10
	2.4.0a0	Transformer XL Base	262	22.82 Perplexity	126,073 total tokens/sec	8x A10	GIGABYTE G482-Z52-00	24.09-py3	Mixed	32	WikiText-103	NVIDIA A10
	2.4.0a0	EfficientNet-B0	1,035	77.06 Top 1	8,508 images/sec	8x A10	GIGABYTE G482-Z52-00	24.09-py3	Mixed	128	Imagenet2012	NVIDIA A10
	2.4.0a0	EfficientNet-WideSE-B0	1,061	77.23 Top 1	8,301 images/sec	8x A10	GIGABYTE G482-Z52-00	24.09-py3	Mixed	128	Imagenet2012	NVIDIA A10
	2.4.0a0	MoFlow	100	88.14 NUV	12,237 images/sec	8x A10	GIGABYTE G482-Z52-00	24.09-py3	Mixed	512	Medical Segmentation Decathlon	NVIDIA A10

View More Performance Data

AI Inference

Real-world inferencing demands high throughput and low latencies with maximum efficiency across use cases. An industry-leading solution lets customers quickly deploy AI models into real-world production with the highest performance from data center to edge.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More