Supercharging the World’s Fastest AI Supercomputing Platform on NVIDIA HGX A100 80GB GPUs

Exploding model sizes in deep learning and AI, complex simulations in high-performance computing (HPC), and massive datasets in data analytics all continue to demand faster and more advanced GPUs and platforms.

At SC20, we announced the NVIDIA A100 80GB GPU, the latest addition to the NVIDIA Ampere family, to help developers, researchers, and scientists tackle their toughest challenges. A100 80GB uses the new faster and higher capacity HBM2e GPU memory, which substantially increases the GPU memory size and bandwidth. Table 1 shows the key issues for how A100 80GB improves compared to the original A100 40GB.

	A100 40GB	A100 80GB	Comment
GPU Memory Size	40 GB HBM2	80 GB HBM2e	2X capacity
GPU Memory Bandwidth	1,555 GB/s	2,039 GB/s	>1.3X higher bandwidth, industry’s 1^st over 2 TB/s
GPU Peak Compute	–	Same as A100 40GB	Seamless portability
Multi-Instance GPU	Up to 7 GPU instances with 5 GB each	Up to 7 GPU instances with 10 GB each	More versatile GPU instances
Node Max GPU Memory (16 GPU Nodes)	640 GB	1,280 GB	For largest datasets
Form Factor	NVIDIA HGX 4 or 8 GPU board	NVIDIA HGX 4 or 8 GPU board	Form-factor compatible

Table 1. A100 80GB and A100 40GB key comparison.

Higher GPU memory bandwidth is one of the key enablers for faster compute application performance. With over 30% higher memory bandwidth, many compute applications will see a meaningful application-level gain with A100 80GB. Figure 1 shows the top application speedups enabled by the higher memory bandwidth.

The chart shows that A100 80GB is up to 25% faster than A100 40GB on key applications. — *Figure 1. A100 80GB vs A100 40GB deep learning and HPC applications speedup ratio.*

Putting the extra memory to work

Faster memory bandwidth is only the start. The extra memory can be used to train bigger AI models for better predictions, improve energy efficiency, and run jobs with much higher throughputs. Here are a few examples.

Train bigger models to produce a higher-quality result

It is well-known that bigger AI models enable higher-quality results. For example, OpenAI shows that the few-shot translation quality on language pairs improves as model capacity increases (Figure 2).

Diagram shows translation quality improvements for multiple language pairs (French to English, English to French, and so on) as parameters in the model increase (in billions). — *Figure 2. Translation quality increases as model capacity increases. Source:* Open AI GPT-3 whitepaper.

To train such a large model with billions of parameters, GPU memory size is critical. On a single NVIDIA HGX A100 40GB 8-GPU machine, you can train a ~10B-parameters model. With the new HGX A100 80GB 8-GPU machine, the capacity doubles so you can now train a ~20B-parameter model, which enables close to 10% improvement on translation quality (BLEU). Advanced users can train even larger models with model parallel using multiple machines.

Improved energy efficiency

Larger GPU memory enables solving memory size–intensive compute problems with fewer server nodes. In addition to the reduction in compute hardware, the associated networking and storage infrastructure overhead also goes down. As a result, the data center becomes more energy-efficient.

As an example, the NVIDIA DGX SuperPOD system, based on HGX A100 80GB, captured the top spot on the recent Green500 list of most efficient supercomputers, achieving a new world record in power efficiency of 26.2 gigaflops per watt. The additional GPU memory enables A100 to execute the Green500 workload much more efficiently.

Run the same jobs with fewer nodes to substantially improve data center throughput

With a larger GPU memory enabling fewer server nodes, the number of internode communications is also dramatically reduced. More communication happens within nodes using the high-speed NVLINK on the HGX platform. On a per-GPU basis, NVLINK is close to a 10X higher bandwidth than even the fastest internode networking at 400 Gb.

Today, many large dataset workloads are limited by internode network bandwidth. As a result, the same compute problems can be solved with less hardware and faster performance, substantially improving overall data center throughput. Table 2 shows a few examples on how using A100 80GB can save the number of nodes required to run the same jobs. Figure 3 shows the resulting data center throughput improvement for the same use cases.

Use case	Application	# A100 40GB required to run	# A100 80GB required to run	Execution time comparison
Deep learning recommender training	DLRM, Criteo click log 1-TB dataset	16 GPU	8 GPU	8 A100 80GB close to 1.5x faster than 16 A100 40GB
Data analytics	Retail benchmark, 10-TB dataset	96 GPU	48 GPU	Close to same execution time
HPC (Material Science)	Quantum Espresso (PRACE-large), 1.6TB dataset	40 GPU	20 GPU	Close to same execution time

Table 2. A100 80GB runs the same jobs with fewer nodes.

A100 80GB can boost throughput up to 3x compared to A100 40GB — *Figure 3. A100 80GB boosts throughput over A100 40GB. The same number of A100 80GB and A100 40GB GPUs are used in each case.*

Summary

The A100 80GB GPU doubles the memory capacity and increases memory bandwidth by over 30% when compared to the original A100 40GB GPU.

For memory size–intensive compute problems—such as natural language processing that needs large model capacity, deep learning recommender systems with large embedding tables, data analytics and HPC applications that use large datasets—the benefit of the A100 80GB GPU is especially apparent. You get faster performance, better result quality, a more energy-efficient data center, and substantially increased throughput.

The HGX A100 80GB platform is another powerful tool for developers, researchers, and scientists to take advantage of. We look forward to helping to advance the most important HPC and AI applications.