After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
Developing CUDA kernels to push Tensor Cores to the Absolute Limit on NVIDIA A100
Andrew Kerr, NVIDIA
NVIDIA Ampere GPU Architecture pushes the performance envelope by doubling the math throughput of Tensor Cores for mixed precision and also adds support for double precision, Tensor Float 32, and bfloat16 data types. We'll describe how to implement high-performance CUDA kernels using Tensor Cores on A100, applying techniques such as register blocking, software pipelining, and carefully constructed memory layouts to avoid bank conflicts. Then we'll describe abstractions for programming Tensor Cores available in CUTLASS, as well as other new features. This talk is intended for advanced CUDA C++ programmers who are eager to write kernels pushing Tensor Cores to peak performance. We recommend that you review previous presentations on this topic such as the introduction to CUTLASS (GTC 2018) and Programming Volta Tensor Cores in CUTLASS (GTC 2019).