Shared Memory

Jul 03, 2024

Just Released: cuDSS 0.3.0

cuDSS (Preview) is an accelerated direct sparse solver. It now supports multi-GPU multi-node platforms, and introduces a hybrid memory mode.

1 MIN READ

Mar 23, 2022

Boosting Application Performance with GPU Memory Prefetching

NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, because GPUs also...

10 MIN READ

Sep 22, 2020

Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture

The NVIDIA Ampere architecture provides new mechanisms to control data movement within the GPU and CUDA 11.1 puts those controls into your hands. These...

9 MIN READ

Mar 17, 2015

GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical...

9 MIN READ

Feb 03, 2014

CUDA Pro Tip: Do The Kepler Shuffle

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use...

2 MIN READ

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.

Jan 01, 2014

Peer-to-Peer Multi-GPU Transpose in CUDA Fortran (Book Excerpt)

This post is an excerpt from Chapter 4 of the book CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we...

12 MIN READ

Apr 08, 2013

Finite Difference Methods in CUDA C++, Part 2

In the previous CUDA C++ post we dove in to 3D finite difference computations in CUDA C/C++, demonstrating how to implement the x derivative part of the...

6 MIN READ

Apr 01, 2013

Finite Difference Methods in CUDA Fortran, Part 2

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

6 MIN READ

Mar 04, 2013

Finite Difference Methods in CUDA C/C++, Part 1

In the previous CUDA C/C++ post we investigated how we can use shared memory to optimize a matrix transpose, achieving roughly an order of magnitude...

9 MIN READ

Feb 26, 2013

Finite Difference Methods in CUDA Fortran, Part 1

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

9 MIN READ

Feb 18, 2013

An Efficient Matrix Transpose in CUDA C/C++

My last CUDA C++ post covered the mechanics of using shared memory, including static and dynamic allocation. In this post I will show some of the performance...

8 MIN READ

Feb 07, 2013

An Efficient Matrix Transpose in CUDA Fortran

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can...

8 MIN READ

Jan 28, 2013

Using Shared Memory in CUDA C/C++

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...

10 MIN READ

Jan 15, 2013

Using Shared Memory in CUDA Fortran

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and...

11 MIN READ