Pro Tip
Aug 20, 2019
CUDA Pro Tip: The Fast Way to Query Device Properties
CUDA applications often need to know the maximum available shared memory per block or to query the number of multiprocessors in the active GPU. One way to do...
3 MIN READ
Apr 29, 2019
Pro Tip: Improved GLSL Syntax for Vulkan DescriptorSet Indexing
Sometimes the evolution of programming languages creates situations where "simple" tasks take a bit more complexity to express. Syntax annoyance slows down...
4 MIN READ
Nov 16, 2017
Pro Tip: Pinpointing Runtime Errors in CUDA Fortran
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...
4 MIN READ
Aug 16, 2017
Pro Tip: Linking OpenGL for Server-Side Rendering
Visualization is a great tool for understanding large amounts of data, but transferring the data from an HPC system or from the cloud to a local workstation for...
6 MIN READ
Feb 27, 2017
Pro Tip: cuBLAS Strided Batched Matrix Multiply
There’s a new computational workhorse in town. For decades, general matrix-matrix multiply—known as GEMM in Basic Linear Algebra Subroutines (BLAS)...
10 MIN READ
Sep 29, 2015
Customize CUDA Fortran Profiling with NVTX
The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the...
5 MIN READ
Aug 06, 2015
Voting and Shuffling to Optimize Atomic Operations
2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to...
10 MIN READ
Jun 29, 2015
GPU Pro Tip: Fast Great-Circle Distance Calculation in CUDA C++
This post demonstrates the practical utility of CUDA’s sinpi() and cospi() functions in the context of distance calculations on earth. With the advent of...
3 MIN READ
Jun 10, 2015
GPU Pro Tip: Lerp Faster in C++
Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It's so common in computer graphics that programmers often use...
2 MIN READ
Mar 17, 2015
GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell
Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical...
9 MIN READ
Feb 11, 2015
GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA
Sometimes you need to use small per-thread arrays in your GPU kernels. The performance of accessing elements in these arrays can vary depending on a number of...
12 MIN READ
Jan 22, 2015
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency
Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions...
8 MIN READ
Oct 01, 2014
CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics
Note: This post has been updated (November 2017) for CUDA 9 and the latest GPUs. The NVCC compiler now performs warp aggregation for atomics automatically in...
14 MIN READ
Sep 24, 2014
CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing
Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the...
10 MIN READ
Sep 04, 2014
CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs
We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don't mean just parallelism...
3 MIN READ
Aug 07, 2014
CUDA Pro Tip: Optimize for Pointer Aliasing
Often cited as the main reason that naïve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering...
6 MIN READ