Tensor Linear Algebra on NVIDIA GPUs

NVIDIA cuTENSOR is a GPU-accelerated tensor linear algebra library for tensor contraction, reduction, and elementwise operations. Using cuTENSOR, applications can harness the specialized tensor cores on NVIDIA GPUs for high-performance tensor computations and accelerate deep learning training and inference, computer vision, quantum chemistry, and computational physics workloads.



cuTENSOR 2.0 Available Now

cuTENSOR 2.0 offers new features, such as just-in-time compiled kernels for tensor contraction, that significantly boost performance. The library’s APIs also have been made uniform to help make features easily extensible for all operations.

cuTENSOR 2.0 is a more efficient and flexible library to accelerate your applications at the intersection of AI and HPC.

Read the cuTENSOR 2.0 migration guide

cuTENSOR Performance

The cuTENSOR library is highly optimized for performance on NVIDIA GPUs with support for DMMA, TF32, and now 3xTF32.

Chart shows cuTENSOR 2.0 performance gains over cuTENSOR 1.7

cuTENSOR 2.0 achieves significant performance gains over cuTENSOR 1.7, even before enabling just-in-time compiled kernels.

Chart shows cuTENSOR performance gains with JIT kernels

Just-in-time compiled kernels for tensor contraction enables speedups in tensor software benchmarks, including in rand1000.

cuTENSOR Key Features

  • Just-in-time compiled kernels for tensor contraction
  • Plan-based multi-stage APIs for all operations
  • Support for arbitrarily dimensional tensor descriptors
  • Support for 3xTF32 compute type
  • Support for int64 extents
  • Tensor contraction, reduction, and elementwise operations
  • Mixed precision support
  • Expressive API allowing elementwise operation fusion

Ready to get started with cuTENSOR?