Simulation / Modeling / Design

Managing Memory for Acceleration Structures in DirectX Raytracing

In Microsoft Direct3D, anything that uses memory is considered a resource: textures, vertex buffers, index buffers, render targets, constant buffers, structured buffers, and so on. It’s natural to think that each individual object, such as a texture, is always one resource. In this post, I discuss DXR’s Bottom Level Acceleration Structures (BLASes) and best practices with regard to managing them. Specifically, I talk about the problems caused by having one Direct3D resource for each BLAS, and why it isn’t actually necessary or desirable.

DirectX 12 introduced the concept of placed resources, which decouples the allocation of memory from the resource itself. That means that you can now manage the memory allocation of resources yourself, but you still have one ID3D12Resource for each object.

Ray tracing adds a new resource type: the acceleration structure. It’s natural to assume the same pattern applies here too. The DirectX Ray tracing (DXR) API leads you into thinking that you should create a Direct3D resource for every bottom-level acceleration structure (BLAS) that you want to create. There’s the D3D12_RESOURCE_STATE_RAYTRACING_ACCELERATION_STRUCTURE resource state that you must put the resources into, and you have to use resource barriers between updating the resources and using them.

Placed vs. committed resources

So one BLAS is one ID3D12Resource. Great, what’s the problem?

First, look at placed resources as opposed to committed resources. Using placed resources means that you manage the memory for the resources yourself. Why would you want to do that? A good reason for acceleration structures specifically relates to something called TLB (Translation Lookaside Buffer) thrashing. I’m not going to dive too much into the details for this post, but if you have a separate committed resource for each BLAS, then you have no control over where they live in GPU memory. They might be scattered over many different memory pages. The nature of ray tracing means that, depending on the distribution of rays being fired, BLAS structures can be accessed randomly. If that means accessing lots and lots of different memory pages,  it can be a performance bottleneck.

So you allocate large memory buffers and allocate your BLAS resources from within, using placed resources. Great. Problem solved?

Alignment

Well, a hint that something isn’t quite right might come when you notice that D3D12_RAYTRACING_ACCELERATION_STRUCTURE_BYTE_ALIGNMENT is 256 bytes. A 256 B alignment is appealing because often BLAS structures can be quite small, and the waste due to aligning to 64 KiB can become significant if you have a lot of small BLASes. However, changing your BLAS placed-resource allocation strategy to use 256 B alignment results in the following error:

D3D12 ERROR: ID3D12Device::CreatePlacedResource: This resource cannot be created on this heap, due to unsatisfied resource alignment requirements. The resource must be aligned to 65536….

The problem is that the resource that you create has to be of dimension D3D12_RESOURCE_DIMENSION_BUFFER, which has an alignment requirement of 64 KiB. D3D12_RAYTRACING_ACCELERATION_STRUCTURE_BYTE_ALIGNMENT is only 256 B, so how do you create a placed resource with 256 B alignment for your BLAS resources?

The answer is, you don’t. Instead, you create large container resources, and then allocate memory within them, using your favorite memory allocation scheme.

All of the Direct3D APIs that use BLASes use D3D12_GPU_VIRTUAL_ADDRESS, not ID3D12Resource; for example, D3D12_BUILD_RAYTRACING_ACCELERATION_STRUCTURE_DESC. To build, update, or refer to a BLAS in a TLAS, you can use D3D12_GPU_VIRTUAL_ADDRESS. Surprisingly, you don’t need to create a Direct3D resource for each BLAS at all.

Managing your own memory means that each BLAS can now be aligned to 256 B instead of 64 KiB, so the amount of wasted memory on average is potentially much lower for small BLASes. Using BLAS compaction might now be more effective too. Imagine a case where an uncompacted BLAS is just smaller than 64 KiB. If you’re using 64 KiB alignment, then the BLAS uses 64 KiB, even if compaction reduces it to something tiny. However, using 256 B alignment means that you get to see the results of compaction in terms of actual memory used much more, at least for small BLASes.

Resource barriers

That leaves the question of resource barriers. You must insert UAV barriers between creating or updating a BLAS and using that BLAS. However, barriers refer to resources—they are called resource barriers, after all. So how do you use a resource barrier to indicate that you’ve updated or created a BLAS, when that BLAS is only a small part of a larger container resource?  There are a couple of options:

Null UAV barriers are normally considered lazy and are potentially suboptimal because there is no opportunity for the driver to optimize the barrier. A null UAV barrier is effectively saying to the driver, “At some point, we wrote something to a UAV, and we’re probably going to read from it soon, but we’re not going to tell you what it was.”  However, if you’re only going to use one or two null UAV barriers, then nobody is going to get too upset,  especially if you can use them on an async compute queue.

Results

This technique is used in Minecraft with RTX. This game has a lot of small BLASes, leading to good savings. Here are some figures from a reasonably complex world, after playing for some time to generate some fragmentation.

 Individual resource for each BLASBLASes packed within 8 MiB blocks
Number of BLASes1298212982
Total BLAS data392 MiB392 MiB
Number of allocations1298265 (One per 8 MiB block)
Total memory allocated962 MiB520 MiB
Alignment waste570 MiB1 MiB
Fragmentation waste0127 MiB
Total waste570 MiB128 MiB
Table 1. Example of savings in Minecraft.

In the packed scheme, there is still wasting a large amount of memory to fragmentation, but it is considerably less than the waste caused by 64 KiB alignment in this case. Better memory management techniques would mitigate the fragmentation. However, when you are allocating individual resources, the waste due to the 64 KiB alignment is greater than the amount of actual data in the BLASes.

Summary

By not creating a Direct3D resource for each BLAS, you can put many BLASes into a small number of memory pages, reducing TLB pressure. You can also reduce the alignment requirement from 64 KiB to 256 B, reducing the amount of memory wasted due to alignment overhead. That means more performance and less memory, which everyone likes.

Discuss (1)

Tags