Pro Tip

Aug 20, 2019

CUDA Pro Tip: The Fast Way to Query Device Properties

CUDA applications often need to know the maximum available shared memory per block or to query the number of multiprocessors in the active GPU. One way to do...

3 MIN READ

Apr 29, 2019

Pro Tip: Improved GLSL Syntax for Vulkan DescriptorSet Indexing

Sometimes the evolution of programming languages creates situations where "simple" tasks take a bit more complexity to express. Syntax annoyance slows down...

4 MIN READ

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.

Nov 16, 2017

Pro Tip: Pinpointing Runtime Errors in CUDA Fortran

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

4 MIN READ

Aug 16, 2017

Pro Tip: Linking OpenGL for Server-Side Rendering

Visualization is a great tool for understanding large amounts of data, but transferring the data from an HPC system or from the cloud to a local workstation for...

6 MIN READ

Feb 27, 2017

Pro Tip: cuBLAS Strided Batched Matrix Multiply

There’s a new computational workhorse in town. For decades, general matrix-matrix multiply—known as GEMM in Basic Linear Algebra Subroutines (BLAS)...

10 MIN READ

Sep 29, 2015

Customize CUDA Fortran Profiling with NVTX

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the...

5 MIN READ

Aug 06, 2015

Voting and Shuffling to Optimize Atomic Operations

2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to...

10 MIN READ

Jun 29, 2015

GPU Pro Tip: Fast Great-Circle Distance Calculation in CUDA C++

This post demonstrates the practical utility of CUDA’s sinpi() and cospi() functions in the context of distance calculations on earth. With the advent of...

3 MIN READ

Jun 10, 2015

GPU Pro Tip: Lerp Faster in C++

Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It's so common in computer graphics that programmers often use...

2 MIN READ

Mar 17, 2015

GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical...

9 MIN READ

Feb 11, 2015

GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA

Sometimes you need to use small per-thread arrays in your GPU kernels. The performance of accessing elements in these arrays can vary depending on a number of...

12 MIN READ

Jan 22, 2015

GPU Pro Tip: CUDA 7 Streams Simplify Concurrency

Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions...

8 MIN READ

Oct 01, 2014

CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

Note: This post has been updated (November 2017) for CUDA 9 and the latest GPUs. The NVCC compiler now performs warp aggregation for atomics automatically in...

14 MIN READ

Sep 24, 2014

CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing

Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the...

10 MIN READ

Sep 04, 2014

CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don't mean just parallelism...

3 MIN READ

Aug 07, 2014

CUDA Pro Tip: Optimize for Pointer Aliasing

Often cited as the main reason that naïve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering...

6 MIN READ