Redundancy and Latency in Structured Buffer Use

In a recent post, we discussed a simple pothole that developers can often hit in the use of structured buffers. This post dives into a much more subtle issue where shader structure impacts the efficiency of processing structured buffers.

Developers can benefit substantially in performance by giving some attention to the subject of redundancy in structured buffer processing. It is quite common to see code like the following. The code certainly isn’t wrong, and it may even be the best solution. However, every thread is redundantly fetching the same data. For a large light list, this is potentially a large amount of redundant work.

StructuredBuffer<Light> LightBuf;

for (int i = 0; i < num_lights; i++)
{
    Light cur_light = LightBuf[i];

    // Do Lighting
}

In the case of Structured Buffers, the mechanisms implementing them are architected around good performance for divergent accesses. This means each fetch can have a fair amount of latency. When all threads are completely coherent, the cache hit ratio is fantastic, but it still doesn’t resolve the latency. In a case like the code above, fetching multiple light indices in parallel would likely have approximately the same latency cost, but more useful work would be accomplished. Batching the data into shared memory could be a win in this case. Below is a snippet of what you could do in a compute shader:


StructuredBuffer<Light> LightBuf;
groupshared Light[MAX_LIGHTS] LocalLights;

LocalLights[ThreadIndex] = LightBuf[ThreadIndex];

GroupSharedMemoryBarrierWithGroupSync();

for (int i = 0; i < num_lights; i++)
{
    Light cur_light = LocalLights[i];

    // Do Lighting
}

Obviously, an optimization like this adds complexity, and it may not always be a win due to issues like shared memory pressure or extra barrier instructions. Also, the size of the structure will have an impact on how efficiently this works. (For example, a structure that is 1024 bytes in size will lead to some inefficiency, as the stride between threads is quite large.) In some cases, using a simpler structure where you flatten things to an array of float or float4 and compute the index offsets manually can be a win. The code is obviously a bit ugly, but this is often an inner loop, and the redundancy elimination may well be worth a couple ugly macros. As with many things, your mileage will vary, but it is at least something to consider when working with Structured Buffers.

Experienced readers may be asking themselves how constant buffers compare to these issues with buffers. The answer is that they actually can be dramatically faster. I’ll demonstrate this in the final blog post in the series.