Basic Linear Algebra on NVIDIA GPUs


The cuBLAS Library provides a GPU-accelerated implementation of the basic linear algebra subroutines (BLAS). cuBLAS accelerates AI and HPC applications with drop-in industry standard BLAS APIs highly optimized for NVIDIA GPUs. The cuBLAS library contains extensions for batched operations, execution across multiple GPUs, and mixed and low precision execution. Using cuBLAS, applications automatically benefit from regular performance improvements and new GPU architectures

cuBLAS Multi-GPU Extension

cuBLASMg provides a state-of-the-art multi-GPU matrix-matrix multiplication for which each matrix can be distributed — in a 2D block-cyclic fashion — among multiple devices. cuBLASMg is currently a part of the CUDA Math Library Early Access Program. Apply for access today!

Explore what’s new in the latest release...

cuBLAS Performance

The cuBLAS library is highly optimized for performance on NVIDIA GPUs, and leverages tensor cores for acceleration of low and mixed precision matrix multiplication.

cuBLAS Key Features

  • Complete support for all 152 standard BLAS routines
  • Support for half-precision and integer matrix multiplication
  • GEMM and GEMM extensions optimized for Volta and Turing Tensor Cores
  • GEMM performance tuned for sizes used in various Deep Learning models
  • Supports CUDA streams for concurrent operations