GTC Silicon Valley-2019 ID:S9146:Training ImageNet in Four Minutes
Haidong Rong(Tencent Technology (Beijing) Co., Ltd.),Yangzihao Wang(Tencent Technology (Beijing) Co., Ltd.)
We'll discuss how we build a highly scalable deep learning training system and training ImageNet in four minutes. For dense GPU clusters we optimize the training system by proposing a mixed-precision training method that significantly improves training throughput of a single GPU without losing accuracy. We also propose an optimization approach for extremely large mini-batch size (up to 64k) that can train CNN models on ImageNet dataset without losing accuracy. And we propose highly optimized all-reduce algorithms that achieve up to 3x and 11x speedup on AlexNet and ResNet-50 respectively than NCCL-based training on a cluster with 1024 Tesla P40 GPUs. Our training system can achieve 75.8% top-1 test accuracy in only 6.6 minutes using 2048 Tesla P40 GPUs. When training AlexNet with 95 epochs, our system can achieve 58.7% top-1 test accuracy within 4 minutes using 1024 Tesla P40 GPUs,which also outperforms all other existing systems.