The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library that delivers 6x to 17x faster performance than the latest MKL BLAS.


cuBLAS-XT is a set of routines which further accelerates Level 3 BLAS calls by spreading work across multiple GPUs connected to the same motherboard, with near-perfect scaling as more GPUs are added. By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size. cuBLAS-XT is included with CUDA 7 Toolkit and no additional license is required.


NVBLAS is a CPU BLAS implementation which automatically accelerates eligible BLAS calls via cuBLAS-XT, and is included with the CUDA tookit. All versions of cuBLAS-XT work with NVBLAS.

Building on the GPU-accelerated BLAS routines in the cuBLAS library, heterogeneous LAPACK implementations such as CULA Tools and MAGMA are also available.


Key Features

  • Complete support for all 152 standard BLAS routines
  • Single, double, complex, and double complex data types
  • Supports half-precision (FP16) and integer (INT8) matrix multiplication operations
  • Support for CUDA streams
  • Fortran bindings
  • Support for multiple GPUs and concurrent kernels
  • Batched GEMM API
  • Device API that can be called from CUDA kernels
  • Batched LU factorization API
  • Batched matrix inverse API
  • New implementation of TRSV (Triangular solve), up to 7x faster than previous implementation.


Review the latest CUDA performance report to learn how much you could accelerate your code.

cuBLAS on K40m, ECC on, input and output data on device. m=n=k=4096, transpose=no, side=right, fill=lower


cuBLAS on K40m, ECC ON, input and output data on device. MKL 11.0.4 on Intel IvyBridge single socket 12 -core E5-2697 v2 @ 2.70GHz



Performance of cuBLAS-XT DGEMM for a 16,384-by-16,384 matrix.

Performance of cuBLAS-XT SGEMM for a 16,384-by-16,384 matrix.

Performance measured on 1-4 K40 cards with ECC enabled, connected via PCI-E Gen 3 to a dual-socket Intel(R) Xeon(R) CPU E5-2650@ 2.00GHz.

Short presentation using an GNU/Octave implementation as an example of using cuBlas


The cuBLAS library is freely available as part of the CUDA Toolkit.
For more information on cuBLAS and other CUDA math libraries: