Technical Walkthrough 9

CUTLASS: Fast Linear Algebra in CUDA C++

Update May 21, 2018: CUTLASS 1.0 is now available as Open Source software at the CUTLASS repository. CUTLASS 1.0 has changed substantially from our preview... 25 MIN READ
Technical Walkthrough 1

Programming Tensor Cores in CUDA 9

[caption id="attachment_7875" align="alignright" width="200"] Tensor cores provide a huge boost to convolutions and matrix operations. Tensor cores are... 16 MIN READ
GPU Pro Tip
Technical Walkthrough 0

Pro Tip: cuBLAS Strided Batched Matrix Multiply

There’s a new computational workhorse in town. For decades, general matrix-matrix multiply—known as GEMM in Basic Linear Algebra Subroutines (BLAS)... 10 MIN READ
Technical Walkthrough 0

Graph Coloring: More Parallelism for Incomplete-LU Factorization

In this blog post I will briefly discuss the importance and simplicity of graph coloring and its application to one of the most common problems in sparse linear... 12 MIN READ
CUDA 7
Technical Walkthrough 1

Parallel Direct Solvers with cuSOLVER: Batched QR

[Note: Lung Sheng Chien from NVIDIA also contributed to this post.] A key bottleneck for most science and engineering simulations is the solution of sparse... 15 MIN READ
Technical Walkthrough 0

Optimizing the High Performance Conjugate Gradient Benchmark on GPUs

[This post was co-written by Everett Phillips and Massimiliano Fatica.] The High Performance Conjugate Gradient Benchmark (HPCG) is a new benchmark intended to... 11 MIN READ