Memory

Jun 27, 2022
Boosting Application Performance with GPU Memory Access Tuning
NVIDIA GPUs have enormous compute power and typically must be fed data at high speed to deploy that power. That is possible, in principle, as GPUs also have...
13 MIN READ

Jul 27, 2021
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 2
In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be...
9 MIN READ

Jul 27, 2021
Using the NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1
Most CUDA developers are familiar with the cudaMalloc and cudaFree API functions to allocate GPU accessible memory. However, there has long been an obstacle...
14 MIN READ

Jul 19, 2021
Reducing Acceleration Structure Memory with NVIDIA RTXMU
Acceleration structures spatially organize geometry to accelerate ray tracing traversal performance. When you create an acceleration structure, a conservative...
11 MIN READ

May 20, 2021
Tips: Acceleration Structure Compaction
In ray tracing, more geometries can reside in the GPU memory than with the rasterization approach because rays may hit the geometries out of the view frustum....
7 MIN READ

Jan 29, 2021
Managing Memory for Acceleration Structures in DirectX Raytracing
In Microsoft Direct3D, anything that uses memory is considered a resource: textures, vertex buffers, index buffers, render targets, constant buffers, structured...
6 MIN READ

Dec 18, 2020
Making Apache Spark More Concurrent
Apache Spark provides capabilities to program entire clusters with implicit data parallelism. With Spark 3.0 and the open source RAPIDS Accelerator for Spark,...
7 MIN READ

Dec 08, 2020
Fast, Flexible Allocation for NVIDIA CUDA with RAPIDS Memory Manager
When I joined the RAPIDS team in 2018, NVIDIA CUDA device memory allocation was a performance problem. RAPIDS cuDF allocates and deallocates memory at high...
24 MIN READ

Apr 15, 2020
Introducing Low-Level GPU Virtual Memory Management
There is a growing need among CUDA applications to manage memory as quickly and as efficiently as possible. Before CUDA 10.2, the number of options available to...
23 MIN READ

Aug 06, 2019
GPUDirect Storage: A Direct Path Between Storage and GPU Memory
As AI and HPC datasets continue to increase in size, the time spent loading data for a given application begins to place a strain on the total application’s...
17 MIN READ

Feb 21, 2019
Optimizing End-to-End Memory Networks Using SigOpt and GPUs
Natural language systems have become the go-between for humans and AI-assisted digital services. Digital assistants, chatbots, and automated HR systems all rely...
18 MIN READ

Nov 19, 2017
Maximizing Unified Memory Performance in CUDA
Many of today's applications process large volumes of data. While GPU architectures have very fast HBM or GDDR memory, they have limited capacity. Making the...
18 MIN READ

Mar 25, 2014
NVLink, Pascal and Stacked Memory: Feeding the Appetite for Big Data
For more recent info on NVLink, check out the post, "How NVLink Will Enable Faster, Easier Multi-GPU Computing". NVIDIA GPU accelerators have emerged in...
5 MIN READ

Dec 04, 2013
CUDA Pro Tip: Increase Performance with Vectorized Memory Access
Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it...
6 MIN READ

Nov 18, 2013
Unified Memory in CUDA 6
With CUDA 6, NVIDIA introduced one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or...
12 MIN READ

Jan 28, 2013
Using Shared Memory in CUDA C/C++
In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride...
10 MIN READ