When a CUDA kernel requires more hardware registers than are available, the compiler is forced to move the excess variables into local memory, a process known as register spilling. Register spilling affects performance because the kernel must access local memory—physically located in global memory—to read and write the spilled data.
In CUDA Toolkit 13.0, NVIDIA introduced a new optimization feature in the compilation flow: shared memory register spilling for CUDA kernels. This post explains the new feature, highlights the motivation behind its addition, and details how it can be enabled. It also provides guidance on when to consider using it and how to evaluate its potential impact.
Feature | Shared memory register spilling |
Feature details | Enable shared memory as backing storage for registers, giving priority to spilling high-cost registers to shared memory first. |
Platforms affected | All platforms in PTXAS whole program compilation mode (-rdc=false). This is the default PTXAS mode |
User impact | Reduces spill latency and L2 pressure for register-heavy kernels; shared memory usage increases |
Opt-in (CUDA 13.0+) | .pragma enable_smem_spilling inline assembly at the kernel definition. Default in CUDA 13.0: false (read PTX documentation) |
How does shared memory register spilling optimize performance?
In CUDA 13.0, PTXAS adds support for spilling registers to shared memory for CUDA kernels. When this feature is enabled, the compiler prioritizes spilling registers into shared memory. If adequate shared memory is not available, any remaining spills fall back to local memory, consistent with previous behavior. This change introduces a performance optimization by taking advantage of the lower-latency, on-chip memory for storing spilled values when available.
Overview of the problem with example
In toolkits prior to CUDA 13.0, all register spills were placed in local memory, which resides off-chip in device global memory. Although larger L1 cache sizes helped reduce spilling costs for many applications, spilled data could still be written to the L2 cache. This could lead to the eviction of important cache lines and negatively affect overall performance. The impact was particularly noticeable in performance-critical regions with high register pressure, such as loops and frequently executed sections of code.
In many workloads, a significant portion of shared memory often remained unused at runtime. This could happen when shared memory requirements per thread block were low, or when the kernel was not designed to maximize occupancy. For example, if the number of thread blocks per SM was limited by launch bounds or register pressure rather than shared memory usage, each block could end up with more shared memory allocated than it actually needed. Without a way to use this extra shared memory, much of it would go to waste.
Consider the following code example. It’s not critical to understand every line, but note that it is a kernel designed to use enough registers to cause spilling.
/-- main.cu --
#include <cuda_runtime.h>
#include <stdio.h>
extern "C" __launch_bounds__(256)
__global__ void foo(float *output_tensor, int num_elements) {
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
if (thread_id >= num_elements) return;
volatile float input_feature[89], weight_scaled[89], bias_added[89], pre_activation[89];
volatile float activation_sin[89], activation_cos[89], output_accum[89];
#pragma unroll
for (int i = 0; i < 89; ++i) {
input_feature[i] = (float)thread_id + i;
weight_scaled[i] = input_feature[i] * 2.0f;
bias_added[i] = 5 + weight_scaled[i];
activation_sin[i] = __sinf(bias_added[i] * pre_activation[i]);
activation_cos[i] = __cosf(activation_sin[i % 2] + pre_activation[i]);
float product = input_feature[i] * weight_scaled[i];
float squared = product * product;
float biased = squared + bias_added[i % 4];
float shifted_sin = __sinf(biased * 0.5f);
float shifted_cos = __cosf(shifted_sin + 1.0f);
float amplified = shifted_cos * bias_added[i % 5];
float combined = amplified + activation_cos[i];
output_accum[i] = combined;
}
volatile float sum = 0.0f;
#pragma unroll
for (int i = 0; i < 89; ++i) {
sum += input_feature[i] + weight_scaled[i] + bias_added[i] + pre_activation[i]
+ activation_sin[i] + activation_cos[i] + output_accum[i];
}
output_tensor[thread_id] = sum;
}
int main() {
const int num_elements = 896;
const int ARRAY_BYTES = num_elements * sizeof(float);
float host_output[num_elements];
float *device_output;
cudaMalloc(&device_output, ARRAY_BYTES);
const int blockSize = 256;
const int gridSize = (num_elements + blockSize - 1) / blockSize;
foo<<<gridSize, blockSize>>>(device_output, num_elements);
cudaDeviceSynchronize();
cudaMemcpy(host_output, device_output, ARRAY_BYTES, cudaMemcpyDeviceToHost);
for (int i = 0; i < num_elements; ++i) {
printf("host_output[%d] = %f\n", i, host_output[i]);
}
cudaFree(device_output);
return 0;
}
nvcc -arch=sm_90 -Xptxas -v main.cu
When this program is compiled normally (without specifying shared memory register spilling), the output is as follows:
ptxas info : Compiling entry function 'foo' for 'sm_90'
ptxas info : Function properties for foo
176 bytes stack frame, 176 bytes spill stores, 176 bytes spill loads
ptxas info : Used 255 registers, used 0 barriers, 176 bytes cumulative stack size
Notice the output shows the “spills” of stores and loads. This indicates that registers will be spilled to local memory.
Additionally, in this example, the compiled kernel does not utilize any shared memory, leaving the per-block shared memory allocation entirely unused.
What is an additional solution for register spills introduced in CUDA 13.0?
To improve performance in register-limited kernels, CUDA 13.0 introduces a new optimization that enables register spills to be redirected to shared memory instead of local memory. By utilizing on-chip shared memory, the compiler keeps spilled data closer to the streaming multiprocessor, significantly reducing access latency and alleviating pressure on the L2 cache. This enhancement leads to notable performance gains in scenarios where local memory spilling would typically become a bottleneck.
With the optimization enabled, the compiler first attempts to spill registers into available shared memory and falls back to local memory if there isn’t enough space, ensuring program correctness is maintained.
When the previous kernel is compiled to enable shared memory register spilling, the output is as follows:
ptxas info : Compiling entry function 'foo' for 'sm_90'
ptxas info : Function properties for foo
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 255 registers, used 0 barriers, 46080 bytes smem
Notice that, compared to the previous example, there is no spilling. The use of shared memory is indicated by 46080 bytes smem
.
Nsight Compute results with and without shared memory register spilling optimization
To show the value of this optimization, the CUDA kernel shown above was benchmarked both with and without the shared memory spilling feature enabled. Table 2 shows Nsight Compute results comparing kernel performance before and after enabling shared memory register spilling optimization. It highlights changes in three key metrics: duration, elapsed cycles, and SM active cycles, demonstrating the efficiency gains from improved register spill handling.
Metrics | Value without optimization (baseline) | Value with optimization | Improvement |
Duration [us] | 8.35 | 7.71 | 7.76% |
Elapsed cycles [cycle] | 12477 | 11503 | 7.8% |
SM active cycles [cycle] | 218.43 | 198.71 | 9.03% |
How to opt in to shared memory register spilling
The shared memory register spilling feature was introduced in CUDA 13.0 and is not available in earlier toolkit versions. Developers targeting CUDA 13.0 and later versions must explicitly opt in by adding the PTX pragma enable_smem_spilling
through inline assembly inside the function, directly after the function declaration:
#include <cuda_runtime.h>
#include <stdio.h>
extern "C" __launch_bounds__(256)
__global__ void foo(float *output_tensor, int num_elements) {
asm volatile (".pragma \"enable_smem_spilling\";");
int thread_id = blockIdx.x * blockDim.x + threadIdx.x;
if (thread_id >= num_elements) return;
volatile float input_feature[89], weight_scaled[89], bias_added[89], pre_activation[89];
volatile float activation_sin[89], activation_cos[89], output_accum[89];
#pragma unroll
for (int i = 0; i < 89; ++i) {
input_feature[i] = (float)thread_id + i;
weight_scaled[i] = input_feature[i] * 2.0f;
bias_added[i] = 5 + weight_scaled[i];
activation_sin[i] = __sinf(bias_added[i] * pre_activation[i]);
activation_cos[i] = __cosf(activation_sin[i % 2] + pre_activation[i]);
float product = input_feature[i] * weight_scaled[i];
float squared = product * product;
float biased = squared + bias_added[i % 4];
float shifted_sin = __sinf(biased * 0.5f);
float shifted_cos = __cosf(shifted_sin + 1.0f);
float amplified = shifted_cos * bias_added[i % 5];
float combined = amplified + activation_cos[i];
output_accum[i] = combined;
}
volatile float sum = 0.0f;
#pragma unroll
for (int i = 0; i < 89; ++i) {
sum += input_feature[i] + weight_scaled[i] + bias_added[i] + pre_activation[i]
+ activation_sin[i] + activation_cos[i] + output_accum[i];
}
output_tensor[thread_id] = sum;
}
What are the limitations of shared memory register spilling?
This optimization presents a performance opportunity for device code and the PTXAS compiler, but it comes with important constraints. It is only valid within a function scope and shouldn’t be used in the following scenarios, or it can lead to compilation errors:
- Per-function compilation mode, such as
nvcc -rdc=true
orptxas -c, nvcc -G
orptxas -g, nvcc -ewp
orptxas -ewp
. Note that device-debug compilation modes (nvcc -G
orptxas -g
) also imply per-function compilation. - Kernels that use dynamically allocated shared memory.
- Kernels that perform dynamic reallocation of registers across warps.
If launch bounds are not explicitly specified, PTXAS assumes the maximum possible number of threads per thread block when estimating shared memory usage. If the kernel is launched with fewer threads than estimated, the actual shared memory allocated per block may exceed what is necessary, potentially limiting the number of thread blocks that can run concurrently on a streaming multiprocessor. This may lead to reduced occupancy and performance regressions. To ensure more predictable behavior and better performance, it’s recommended to use this feature only when launch bounds are explicitly defined.
What performance gains are possible for real workloads?
This optimization was evaluated across a variety of CUDA kernels from the QUDA library, which is used for lattice QCD calculations on GPUs. As shown in the chart, the optimization led to performance gains typically in the range of 5-10%. These improvements stem from either a reduction or complete elimination of register spills to local memory, enabled by redirecting spills to shared memory.

Get started with shared memory register spilling optimization
CUDA 13.0 now includes a PTXAS optimization that enables register spills to be efficiently handled by shared memory, leading to enhanced performance in kernels experiencing high register pressure. If your CUDA kernel has well-defined launch bounds and consistent shared memory utilization, try shared memory spilling by opting in using the inline pragma enable_smem_spilling
.
Acknowledgments
Thanks to the following NVIDIA contributors: Jerry Zheng, Kate Clark, Howard Chen, Neumann Hon, Jaewook Shin, Abhishek Patwardhan, and Yufan Cheng.