Shared Memory

Jul 03, 2024
Just Released: cuDSS 0.3.0
cuDSS (Preview) is an accelerated direct sparse solver. It now supports multi-GPU multi-node platforms, and introduces a hybrid memory mode.
1 MIN READ

Mar 23, 2022
Boosting Application Performance with GPU Memory Prefetching
NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also...
10 MIN READ

Sep 22, 2020
Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture
The NVIDIA Ampere architecture provides new mechanisms to control data movement within the GPU and CUDA 11.1 puts those controls into your hands. These...
9 MIN READ

Mar 17, 2015
GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell
Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical...
9 MIN READ

Feb 03, 2014
CUDA Pro Tip: Do The Kepler Shuffle
When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use...
2 MIN READ

Jan 01, 2014
Peer-to-Peer Multi-GPU Transpose in CUDA Fortran (Book Excerpt)
This post is an excerpt from Chapter 4 of the book CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we...
12 MIN READ

Apr 08, 2013
Finite Difference Methods in CUDA C++, Part 2
In the previous CUDA C++ post we dove in to 3D finite difference computations in CUDA C/C++, demonstrating how to implement the x derivative part of the...
6 MIN READ

Apr 01, 2013
Finite Difference Methods in CUDA Fortran, Part 2
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...
6 MIN READ

Mar 04, 2013
Finite Difference Methods in CUDA C/C++, Part 1
In the previous CUDA C/C++ post we investigated how we can use shared memory to optimize a matrix transpose, achieving roughly an order of magnitude improvement...
9 MIN READ

Feb 26, 2013
Finite Difference Methods in CUDA Fortran, Part 1
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...
9 MIN READ

Feb 18, 2013
An Efficient Matrix Transpose in CUDA C/C++
My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance...
8 MIN READ

Feb 07, 2013
An Efficient Matrix Transpose in CUDA Fortran
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...
8 MIN READ

Jan 28, 2013
Using Shared Memory in CUDA C/C++
In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...
10 MIN READ

Jan 15, 2013
Using Shared Memory in CUDA Fortran
In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...
11 MIN READ