Use Automatic Mixed Precision on Tensor Cores in Frameworks Today

NVIDIA Tensor Core GPU architecture now automatically and natively supported in TensorFlow, PyTorch and MXNet

NVIDIA CUDA X AI enables mixed precision AI training with just two lines of code, delivering up to 3x speedup

Mixed precision training utilizes half-precision to speed up training, achieving the same accuracy as single-precision training using the same hyper-parameters. Memory requirements are also reduced, allowing larger models and minibatches. Enabling mixed precision involves two steps: porting the model to use the half-precision data type where appropriate; and using loss scaling to preserve small gradient values. Automatic mixed precision feature is also available for PyTorch, read the developer blog for more information. We are working on bringing automatic mixed precision feature for MXNet as well, learn more.

Today we introduce Automatic Mixed Precision feature for TensorFlow – a feature that will greatly benefit deep learning researchers and engineers by automatically enabling mixed precision training. This feature makes all the required model and optimizer adjustments internally within TensorFlow. Speedups depend on model architecture, a set of models trained up to 3x faster with this feature. Enabling Automatic Mixed Precision (AMP) feature in the existing TensorFlow training scripts requires setting an environment variable or changing just a few lines of code. Read the developer blog showcasing ResNet-50 example

All models: https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow

Except ssd-rn50-fpn-640: https://github.com/tensorflow/models/tree/master/research/object_detection

All performance collected on 1xV100-16GB, except bert-squadqa on 1xV100-32GB

Batch sizes measured as follows. rn50: 128 for FP32, 256 for AMP+XLA; ssd-rn50-fpn-640: 8 for FP32, 16 for AMP+XLA; ncf: 1M for FP32 and AMP+XLA; bert-squadqa: 4 for FP32, 16 for AMP+XLA; gnmt: 128 for FP32, 192 for AMP.

You can find the example training scripts that we used to generate the performance charts above in the NVIDIA NGC model script registry, or on GitHub.

Enabling this feature for existing TensorFlow model scripts requires setting an environment variable or changing only a few lines of code and delivers speedups up to 3X. Today, the Automatic Mixed Precision feature is available inside the TensorFlow container available on NVIDIA NGC container registry.

To enable this feature inside the container, simply set one environment variable:

export TF_ENABLE_AUTO_MIXED_PRECISION=1

As an alternative, the environment variable can be set inside the TensorFlow Python script:

os.environ[‘TF_ENABLE_AUTO_MIXED_PRECISION’] = ‘1’

Once mixed precision is enabled further speedups can be achieved by:

Enabling the TensorFlow XLA compiler, although note that Google still lists XLA as an experimental tool.
Increasing the minibatch size. Larger mini-batches often lead to better GPU utilization, mixed-precision enables up to 2x larger minibatches.

Here are some of our customers who are already seeing benefits from automatic mixed precision feature with NVIDIA Tensor Core GPUs

“Automated mixed precision powered by NVIDIA Tensor Core GPUs on Alibaba allows us to instantly speedup AI models nearly 3X. Our researchers appreciated the ease of turning on this feature to instantly accelerate our AI.”

Wei Lin，Senior Director at Alibaba Computing Platform, Alibaba

“TensorFlow developers will greatly benefit from NVIDIA automatic mixed precision feature. This easy integration enables them to get up to 3X higher performance with mixed precision training on NVIDIA Tensor Core GPUs while maintaining model accuracy.”

Rajat Monga, Engineering Director, TensorFlow, Google

Availability

Automatic mixed precision feature is also available for PyTorch, read the developer blog for more information. Stay tuned, we are bringing automatic mixed precision feature for MXNet as well, learn more.

Automatic mixed precision feature is available in the NVIDIA optimized TensorFlow 19.03 NGC container starting today. We are also working closely with the TensorFlow team at Google to merge this feature directly into the TensorFlow framework core.

Pull NVIDIA optimized TensorFlow container and experience the leap in performance improvements. Feel free to leave feedback or questions for our team in our TensorFlow forum.