Dense Linear Algebra on GPUs

The NVIDIA cuBLAS library is a fast GPU-accelerated implementation of the standard basic linear algebra subroutines (BLAS). Using cuBLAS APIs, you can speed up your applications by deploying compute-intensive operations to a single GPU or scale up and distribute work across multi-GPU configurations efficiently.

NVBLAS is a GPU-accelerated version of BLAS that further accelerates BLAS Level-3 routines by dynamically routing BLAS calls to one or more NVIDIA GPUs as well as CPUs in the system through the cuBLAS-XT interface.

Researchers and scientists use cuBLAS for developing gpu-accelerated algorithms in areas including high performance computing, image analysis and machine learning.

Download Now
Explore what’s new in the latest release...


cuBLAS performs up to 35X faster than the latest version of the MKL BLAS on common benchmarks

Key Features

  • Complete support for all 152 standard BLAS routines
  • Turing optimized GEMMs and GEMM extensions for Tensor Cores
  • GEMM performance tuned for sizes used in various Deep Learning models
  • API and error logging for debug and traceability
  • Supports single, double, complex, and double complex data types
  • Supports half-precision (FP16) and integer (INT8) matrix multiplication operations
  • Support for multiple GPUs and concurrent kernels
  • Supports CUDA streams for concurrent operations
  • Fortran bindings
  • Batch processing APIs for high performance GEMM operations, LU factorization, and matrix inverse operations
  • Device API that can be called from with your own CUDA kernels
  • Fast implementation of TRSV (Triangular solve)

Product Resources


The cuBLAS library is freely available as part of the CUDA Toolkit and OpenACC Toolkit.

Additional Resources