Programming Distributed Multi-GPU Tensor Operations with cuTENSOR v1.4

Today, NVIDIA is announcing the availability of cuTENSOR, version 1.4, which supports up to 64-dimensional tensors, distributed multi-GPU tensor operations, and helps improve tensor contraction performance models. This software can be downloaded now free of charge.

What’s New?

  • Supports up to 64-dimensional tensors.
  • Supports distributed, multi-GPU tensor operations.
  • Improved tensor contraction performance model (i.e., algo CUTENSOR_ALGO_DEFAULT).
  • Improved performance for tensor contraction that have an overall large contracted dimension (i.e., a parallel reduction was added).
  • Improved performance for tensor contraction that have a tiny contracted dimension (<= 8).
  • Improved performance for outer-product-like tensor contractions (e.g., C[a,b,c,d] = A[b,d] * B[a,c]).
  • Additional bug fixes.

About cuTENSOR

cuTENSOR is a high-performance CUDA library for tensor primitives; its key features include:

  • Extensive mixed-precision support:
    • FP64 inputs with FP32 compute.
    • FP32 inputs with FP16, BF16, or TF32 compute.
    • Complex-times-real operations.
    • Conjugate (without transpose) support.

