SONY Breaks ResNet-50 Training Record with NVIDIA V100 Tensor Core GPUs

Researchers from SONY today announced a new speed record for training ImageNet/ResNet 50 in only 224 seconds (three minutes and 44 seconds) with 75 percent accuracy using 2,100 NVIDIA Tesla V100 Tensor Core GPUs. This achievement represents the fastest reported training time ever published on ResNet-50.
The team also achieved over 90% GPU scaling efficiency with 1,088 NVIDIA Tesla V100 Tensor Core GPUs.

GPU scaling efficiency with ImageNet/ResNet-50 training

	Processor	Interconnect	GPU scaling efficiency
Goyal et al. [1]	Tesla P100 x256	50Gbit Ethernet	∼90%
Akiba et al. [5]	Tesla P100 x1024	Infiniband FDR	80%
Jia et al. [6]	Tesla P40 x2048	100Gbit Ethernet	87.90%
This work	Tesla V100 x1088	Infiniband EDR x2	91.62%

Training time and top-1 validation accuracy with ImageNet/ResNet-50“As the size of datasets and deep neural network (DNN) model for deep learning increase, the time required to train a model is also increasing,” the SONY team wrote in their paper.

	Batch Size	Processor	DL Library	Time	Accuracy
He et al.	256	Tesla P100 x8	Caffe	29 hours	75.30%
Goyal et al.	8K	Tesla P100 x256	Caffe2	1 hour	76.30%
Smith et al.	8K→16K	full TPU Pod	TensorFlow	30 mins	76.10%
Akiba et al.	32K	Tesla P100 x1024	Chainer	15 mins	74.90%
Jia et al.	64K	Tesla P40 x2048	TensorFlow	6.6 mins	75.80%
This work	34K→68K	Tesla V100 x2176	NNL	224 secs	75.03%

\To achieve the record, the researchers addressed two primary issues with large-scale distributed training: instability of large mini-batch training and the synchronization communication overhead, the team said.
“We adopt a batch size control technique to address large mini-batch instability,” the researchers said. “We [also] develop a 2D-Torus all-reducing scheme to efficiently exchange gradients across GPUs.”
The 2D-Torus serves as an efficient communication topology that reduces the communications overhead of a collective operation.
Software: “We used Neural Network Libraries (NNL) and its CUDA extension as a DNN training framework,” the team said. “We also used development branches based on NNL version 1.0.0. CUDA version 9.0 with cuDNN version 7.3.1 is employed to train DNN in GPUs.”
“We used NCCL version 2.3.5 and OpenMPI version 2.1.3 as communication libraries. The 2D-Torus all-reduce is implemented with NCCL2. The above software is packaged in a Singularity container. We used Singularity version 2.5.2 to run distributed DNN training.,” the team wrote in their paper.
The work leveraged the development framework “Core Library: Neural Network Libraries developed by Sony, and the AI Bridging Cloud Infrastructure (ABCI) supercomputer, a world-class computing infrastructure constructed and operated by Japan’s National Institute of Advanced Industrial Science and Technology (AIST). The system is powered by over 4,300 NVIDIA Volta Tensor Core GPUs.
Read more>