cuBLAS-XT – Accelerate BLAS calls with multiple GPUs!
cuBLAS-XT is a set of routines which accelerate Level 3 BLAS (Basic Linear Algebra Subroutine) calls by spreading work across more than one GPU. By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size.
Starting with CUDA 6.0, a free version of cuBLAS-XT is included in the CUDA toolkit as part of the cuBLAS library. The free version supports operation on single GPUs and dual-GPU cards such as the Tesla K10 or GeForce GTX690.
The premier version of cuBLAS-XT supports scaling across multiple GPUs connected to the same motherboard, with near-perfect scaling as more GPUs are added. A single system with 4 Tesla K40 GPUs is able to achieve over 4.5 TFLOPS of double precision performance!
NVBLAS is a CPU BLAS implementation which automatically accelerates eligible BLAS calls via cuBLAS-XT, and is included with the CUDA tookit. All versions of cuBLAS-XT work with NVBLAS.
Review the latest CUDA 6.5 performance report to learn how much you could accelerate your code.
Performance of cuBLAS-XT DGEMM for a 16,384-by-16,384 matrix.
Performance of cuBLAS-XT SGEMM for a 16,384-by-16,384 matrix,.
Performance measured on 1-4 K40 cards with ECC enabled, connected via PCI-E Gen 3 to a dual-socket Intel(R) Xeon(R) CPU E5-2650@ 2.00GHz.
The free version of cuBLAS-XT is included with the CUDA Tookit in version 6.0 and beyond.
cuBLAS-XT Premier will be available without license restrictions in an upcoming release of CUDA
cuBLAS-XT is supported on all 64 bit platforms. cuBLAS-XT Premier is available only on Windows and Linux.