GTC Silicon Valley-2019: cuTENSOR: High-performance Tensor Operations in CUDA
GTC Silicon Valley-2019 ID:S9593:cuTENSOR: High-performance Tensor Operations in CUDA
Andrew Kerr(NVIDIA),Paul Springer(NVIDIA)
We'll discuss cuTENSOR, a high-performance CUDA library for tensor operations that efficiently handles the ubiquitous presence of high-dimensional arrays (i.e., tensors) in today's HPC and DL workloads. This library supports highly efficient tensor operations such as tensor contractions (a generalization of matrix-matrix multiplications), point-wise tensor operations such as tensor permutations, and tensor decompositions (a generalization of matrix decompositions). While providing high performance, cuTENSOR also allows users to express their mathematical equations for tensors in a straightforward way that hides the complexity of dealing with these high-dimensional objects behind an easy-to-use API. CUDA 10.1 enables CUDA programmers to utilize Tensor Cores directly with the new mma.sync instruction. In this presentation, we describe the functionality of mma.sync and present strategies for implementing efficient matrix multiply computations in CUDA that maximize performance on NVIDIA Volta GPUs. We then describe how CUTLASS 1.3 provides reusable components embodying these strategies. CUTLASS 1.3 demonstrates a median 44% speedup of CUDA kernels executing layers from real-world Deep Learning workloads.