How About Constant Buffers?

Continuing along from my previous two posts in the series, this post talks about how constant buffers offer a way to avoid pitfalls that you can encounter with structured buffers. As a developer you should consider why you are using a structured buffer. If the buffer fits in the restricted 64 KB of a constant buffer, then using a constant buffer may be a good choice. This applies doubly when the code is using a coherent access pattern like I described in the last post. Coherent constant buffer accesses are quite efficient, with less latency than structured buffers. If your code intends to have a lot of incoherency in the access pattern, then structured buffers really are a good idea.

Recently, I worked with a shader derived from real game code that experienced the exact pattern described above: structured buffers, 128-bit stride, coherent access patterns, and under 64 KB of data. Converting that structured buffer to a constant buffer resulted in a 25-33% improvement in the execution time of a single shader and a 10% improvement in overall framerate. I think everyone can agree that this is a pretty impressive improvement for simply changing the type of a single resource. The code snippet below shows the changes necessary:

 
struct Light
{
    float3 Position;
    float Radius;
    float4 Color;
    float3 AttenuationParams;
    uint Type;
};

// Original code
StructuredBuffer<Light> LightBuf;

for (int i = 0; i < num_lights; i++)
{
    Light cur_light = LightBuf[i];

    // Do Lighting
}

// Revised code
cbuffer LightCBuf
{
    Light LightBuf[MAX_LIGHTS];
};
 
for (int i = 0; i < num_lights; i++)
{
    Light cur_light = LightBuf[i];

    // Do Lighting
}

As you can see, the body of the code doesn’t really need to change, as long as you take care to ensure that the layout of the struct allows consistent packing for both structured buffers and constant buffers. Achieving that consistency is actually quite simple and generally beneficial as well. 90% of the problem is covered by aligning your vectors to 128-bit boundaries, which was shown to improve performance in the first post in this series of blogs. The worst stumbling block is probably something like an array of scalar floats, which I’ve never personally seen used in a structured buffer. As a warning, D3D 11 will unfortunately not allow you to create a resource with the flags to work as both a structured buffer and a constant buffer. If you need both, you must create a second resource and use CopyResource to move the data.

On a more practical note, one of the most common uses for structured buffers is to drive tiled deferred lighting. Tiled deferred lighting can run into the two limitations of using constant buffers. First, the amount of data required may not fit within a constant buffer. Second, while the shading pass is coherent, the culling pass is typically completely divergent. Luckily, there are a couple pretty simple solutions to both of these.

Supporting data beyond what fits within a single constant buffer can be as easy as dividing up your structure. If you require a 256 byte structure to represent your light sources and you need to support more than 256 lights, your data obviously won’t fit in a constant buffer. However, you have two fairly simple solutions demonstrated in the code below. You can split the struct into multiple structs, or you can leave the structured buffer as is, and create a parallel constant buffer with just the most commonly used data.

 
  struct Light // 144 bytes, only 455 lights possible w/ CB
{
    float3 Position;
    float Radius;
    float4 Color;
    float3 AttenuationParams;
    uint Type;
    float4 SpotDirectionAndAngle;
    float4 ShadowRect;
    float4x4 ShadowMatrix;
};

// Original structured buffer version
StructuredBuffer<Light> LightBuf;

/*
 * Two constant buffers, with lesser-used shadow data in second
 */
struct LightBase // 64 bytes, 1024 lights possible
{
    float3 Position;
    float Radius;
    float4 Color;
    float3 AttenuationParams;
    uint Type;
    float4 SpotDirectionAndAngle;
};

struct LightShadow // 80 bytes, 819 lights possible
{
    float4 ShadowRect;
    float4x4 ShadowMatrix;
};

// MAX_LIGHTS restricted to min of two structures
#define MAX_LIGHTS 819

cbuffer LightCBuf
{
    LightBase LightBufBase[MAX_LIGHTS];
};

cbuffer LightCBufShadow
{
    LightShadow LightBufShadow[MAX_LIGHTS];
};


/*
 * One constant buffer for core parameters with structured buffer
 * for infrequently used parameters and divergent access
 */

// MAX_LIGHTS restricted by only the one structure
#define MAX_LIGHTS 1024

cbuffer LightCBuf
{
    LightBase LightBufBase[MAX_LIGHTS];
};

StructuredBuffer<Light> LightBuf;

Finally, dealing with the divergence is even easier. As you probably noted from the comment above, simply binding a copy of the data as a structured buffer to use during the culling phase offers a solution to this. Sure, you need to make a redundant copy of the data, but an extra 64-256 KB of data seems like a small price to pay for gains that can add up to 10% of your whole framerate.

To close, I wanted to thank a few people that helped bring this series together. Thanks to Alex Dunn for code that was used to test some of these findings. Thanks to Dmitry Zhdan and Holger Gruen for validating my results in additional cases. Finally, thanks to others in our Devtech, Applied Architecture, and Driver Performance team for discussion, feedback, and edits.