GPU, cell phone, woman on monitor
Technical Walkthrough 3

New cuBLAS 12.0 Features and Matrix Multiplication Performance on NVIDIA Hopper GPUs

The NVIDIA H100 Tensor Core GPU, based on the NVIDIA Hopper architecture with the fourth generation of NVIDIA Tensor Cores, recently debuted delivering... 10 MIN READ
Technical Walkthrough 8

CUTLASS: Fast Linear Algebra in CUDA C++

Update May 21, 2018: CUTLASS 1.0 is now available as Open Source software at the CUTLASS repository. CUTLASS 1.0 has changed substantially from our preview... 25 MIN READ
Technical Walkthrough 0

CUDA 9 Features Revealed: Volta, Cooperative Groups and More

[caption id="attachment_7875" align="alignright" width="200"] Figure 1: CUDA 9 provides a preview API for programming Tesla V100 Tensor Cores, providing a huge... 17 MIN READ
GPU Pro Tip
Technical Walkthrough 0

Pro Tip: cuBLAS Strided Batched Matrix Multiply

There’s a new computational workhorse in town. For decades, general matrix-matrix multiply—known as GEMM in Basic Linear Algebra Subroutines (BLAS)... 10 MIN READ
Technical Walkthrough 0

Deep Speech: Accurate Speech Recognition with GPU-Accelerated Deep Learning

Speech recognition is an established technology, but it tends to fail when we need it the most, such as in noisy or crowded environments, or when the speaker is... 9 MIN READ
Technical Walkthrough 0

Drop-in Acceleration of GNU Octave

cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs.  However, cuBLAS can not be used as a... 7 MIN READ