cuBLAS

Feb 01, 2023
New cuBLAS 12.0 Features and Matrix Multiplication Performance on NVIDIA Hopper GPUs
The NVIDIA H100 Tensor Core GPU, based on the NVIDIA Hopper architecture with the fourth generation of NVIDIA Tensor Cores, recently debuted delivering...
10 MIN READ

Dec 05, 2017
CUTLASS: Fast Linear Algebra in CUDA C++
Update May 21, 2018: CUTLASS 1.0 is now available as Open Source software at the CUTLASS repository. CUTLASS 1.0 has changed substantially from our preview...
25 MIN READ

May 11, 2017
CUDA 9 Features Revealed: Volta, Cooperative Groups and More
[caption id="attachment_7875" align="alignright" width="200"] Figure 1: CUDA 9 provides a preview API for programming Tesla V100 Tensor Cores, providing a huge...
17 MIN READ

Feb 27, 2017
Pro Tip: cuBLAS Strided Batched Matrix Multiply
There’s a new computational workhorse in town. For decades, general matrix-matrix multiply—known as GEMM in Basic Linear Algebra Subroutines (BLAS)...
10 MIN READ

Feb 25, 2015
Deep Speech: Accurate Speech Recognition with GPU-Accelerated Deep Learning
Speech recognition is an established technology, but it tends to fail when we need it the most, such as in noisy or crowded environments, or when the speaker is...
9 MIN READ

Jun 05, 2014
Drop-in Acceleration of GNU Octave
cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs. However, cuBLAS can not be used as a...
7 MIN READ

Mar 05, 2014
CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran
[caption id="attachment_8972" align="alignright" width="242"] CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...
7 MIN READ

Jul 02, 2012
Six Ways to SAXPY
For even more ways to SAXPY using the latest NVIDIA HPC SDK with standard language parallelism, see N Ways to SAXPY: Demonstrating the Breadth of GPU...
8 MIN READ

Jun 22, 2011
Accelerated Solution of Sparse Linear Systems
Fresh from the NVIDIA Numeric Libraries Team, a white paper illustrating the use of the CUSPARSE and CUBLAS libraries to achieve a 2x speedup of incomplete-LU-...
1 MIN READ