cuBLAS-XT is a set of routines which accelerate Level 3 BLAS (Basic Linear Algebra Subroutine) calls by spreading work across more than one GPU. By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.

Starting with CUDA 6.0, a free version of cuBLAS-XT is included in the CUDA toolkit as part of the cuBLAS library.  The free version supports operation on single GPUs and dual-GPU cards such as the Tesla K10 or GeForce GTX690.

The premier version of cuBLAS-XT supports scaling across multiple GPUs connected to the same motherboard, with near-perfect scaling as more GPUs are added.  A single system with 4 Tesla K40 GPUs is able to achieve over 4.5 TFLOPS of double precision performance!


NVBLAS is a CPU BLAS implementation which automatically accelerates eligible BLAS calls via cuBLAS-XT, and is included with the CUDA tookit. All versions of cuBLAS-XT work with NVBLAS.


Review the latest CUDA 6.5 performance report to learn how much you could accelerate your code.

Performance of cuBLAS-XT DGEMM for a 16,384-by-16,384 matrix.

Performance of cuBLAS-XT SGEMM for a 16,384-by-16,384 matrix,.

Performance measured on 1-4 K40 cards with ECC enabled, connected via PCI-E Gen 3 to a dual-socket Intel(R) Xeon(R) CPU E5-2650@ 2.00GHz.


The free version of cuBLAS-XT is included with the CUDA Tookit in version 6.0 and beyond.

A free evaluation version of cuBLAS-XT Premier now available to members of  the CUDA Registered Developer Program.

cuBLAS-XT Premier will be available without license restrictions in an upcoming release of CUDA

cuBLAS-XT is supported on all 64 bit platforms. cuBLAS-XT Premier is available only on Windows and Linux.