The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library that delivers 6x to 17x faster performance than the latest MKL BLAS. New in CUDA 6.0 is multi-GPU support in cuBLAS-XT.

Building on the GPU-accelerated BLAS routines in the cuBLAS library, heterogeneous LAPACK implementations such as CULA Tools and MAGMA are also available.


Key Features

  • Complete support for all 152 standard BLAS routines
  • Single, double, complex, and double complex data types
  • Support for CUDA streams
  • Fortran bindings
  • Support for multiple GPUs and concurrent kernels
  • Batched GEMM API
  • Device API that can be called from CUDA kernels
  • Batched LU factorization API
  • Batched matrix inverse API
  • New implementation of TRSV (Triangular solve), up to 7x faster than previous implementation.


Review the latest CUDA 6.5 performance report to learn how much you could accelerate your code.

cuBLAS 6.5 on K40m, ECC on, input and output data on device. m=n=k=4096, transpose=no, side=right, fill=lower

cuBLAS 6.5 on K40m, ECC ON, input and output data on device. MKL 11.0.4 on Intel IvyBridge single socket 12 -core E5-2697 v2 @ 2.70GHz

Short presentation using an GNU/Octave implementation as an example of using cuBlas


The cuBLAS library is freely available as part of the CUDA Toolkit.
For more information on cuBLAS and other CUDA math libraries: