cuTENSOR

Tensor Linear Algebra on NVIDIA GPUs

NVIDIA cuTENSOR is a GPU-accelerated tensor linear algebra library for tensor contraction, reduction, and elementwise operations. Using cuTENSOR, applications can harness the specialized tensor cores on NVIDIA GPUs for high-performance tensor computations and accelerate deep learning training and inference, computer vision, quantum chemistry, and computational physics workloads.

Download

Resources:

cuTENSOR 2.0 Available Now

cuTENSOR 2.0 offers new features, such as just-in-time compiled kernels for tensor contraction, that significantly boost performance. The library’s APIs also have been made uniform to help make features easily extensible for all operations.

cuTENSOR 2.0 is a more efficient and flexible library to accelerate your applications at the intersection of AI and HPC.

Read the cuTENSOR 2.0 migration guide

cuTENSOR Performance

The cuTENSOR library is highly optimized for performance on NVIDIA GPUs with support for DMMA, TF32, and now 3xTF32.

Chart shows cuTENSOR 2.0 performance gains over cuTENSOR 1.7

cuTENSOR 2.0 achieves significant performance gains over cuTENSOR 1.7, even before enabling just-in-time compiled kernels.

Chart shows cuTENSOR performance gains with JIT kernels

Just-in-time compiled kernels for tensor contraction enables speedups in tensor software benchmarks, including in rand1000.

cuTENSOR Key Features

Just-in-time compiled kernels for tensor contraction
Plan-based multi-stage APIs for all operations
Support for arbitrarily dimensional tensor descriptors
Support for 3xTF32 compute type

Support for int64 extents
Tensor contraction, reduction, and elementwise operations
Mixed precision support
Expressive API allowing elementwise operation fusion