The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library that delivers 6x to 17x faster performance than the latest MKL BLAS.
cuBLAS-XT is a set of routines which further accelerates Level 3 BLAS calls by spreading work across multiple GPUs connected to the same motherboard, with near-perfect scaling as more GPUs are added. By using a streaming design, cuBLAS-XT efficiently manages transfers across the PCI-Express bus automatically, which allows input and output data to be stored on the host’s system memory. This provides out-of-core operation – the size of operand data is only limited by system memory size, not by GPU on-board memory size. cuBLAS-XT is included with CUDA 7 Toolkit and no additional license is required.
NVBLAS is a CPU BLAS implementation which automatically accelerates eligible BLAS calls via cuBLAS-XT, and is included with the CUDA tookit. All versions of cuBLAS-XT work with NVBLAS.
Building on the GPU-accelerated BLAS routines in the cuBLAS library, heterogeneous LAPACK implementations such as CULA Tools and MAGMA are also available.
|
Review the latest CUDA performance report to learn how much you could accelerate your code.

cuBLAS on K40m, ECC on, input and output data on device. m=n=k=4096, transpose=no, side=right, fill=lower

cuBLAS on K40m, ECC ON, input and output data on device. MKL 11.0.4 on Intel IvyBridge single socket 12 -core E5-2697 v2 @ 2.70GHz

Performance of cuBLAS-XT DGEMM for a 16,384-by-16,384 matrix.

Performance of cuBLAS-XT SGEMM for a 16,384-by-16,384 matrix.
Performance measured on 1-4 K40 cards with ECC enabled, connected via PCI-E Gen 3 to a dual-socket Intel(R) Xeon(R) CPU E5-2650@ 2.00GHz.
Short presentation using an GNU/Octave implementation as an example of using cuBlas
The cuBLAS library is freely available as part of the CUDA Toolkit.
For more information on cuBLAS and other CUDA math libraries:
|