Programming Distributed Multi-GPU Tensor Operations with cuTENSOR v1.4

Today, NVIDIA is announcing the availability of cuTENSOR, version 1.4, which supports up to 64-dimensional tensors, distributed multi-GPU tensor operations, and helps improve tensor contraction performance models. This software can be downloaded now free of charge.

Download the cuTENSOR software.

What’s New?

Supports up to 64-dimensional tensors.
Supports distributed, multi-GPU tensor operations.
Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT).
Improved performance for tensor contraction that have an overall large contracted dimension (i.e., a parallel reduction was added).
Improved performance for tensor contraction that have a tiny contracted dimension (<= 8).
Improved performance for outer-product-like tensor contractions (e.g., C[a,b,c,d] = A[b,d] * B[a,c]).
Additional bug fixes.

For more information, see the cuTENSOR Release Notes.

About cuTENSOR

cuTENSOR is a high-performance CUDA library for tensor primitives; its key features include:

Extensive mixed-precision support:
- FP64 inputs with FP32 compute.
- FP32 inputs with FP16, BF16, or TF32 compute.
- Complex-times-real operations.
- Conjugate (without transpose) support.

Support for up to 64-dimensional tensors.
Supports arbitrary data layouts.
Supports trivially serializable data structures.
Enhancements to main computational routines:
- Direct (i.e., transpose-free) tensor contractions.
- Tensor reductions (including partial reductions).
- Element-wise tensor operations:
  - Support for various activation functions.
  - Arbitrary tensor permutations.
  - Conversion between different data types