Simulation / Modeling / Design

Advanced API Performance: Memory and Resources

A graphic of a computer sending code to multiple stacks.

This post covers best practices for memory and resources on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

Memory

Optimal memory management in DirectX 12 is critical to a performant application. The following advice should be followed for the best performance while avoiding stuttering.

  • Video memory management:
    • Use IDXGIAdapter3:: QueryVideoMemoryInfo to gain accurate information about the available video memory as the foreground app isn’t necessarily allocated all or even a high percentage of video memory.
    • Respond to budget changes from OS using IDXGIAdapter3::RegisterVideoMemoryBudgetChangeNotificationEvent.
    • Use ID3D12Device1::SetResidencyPriority to provide the OS with information on which heaps should stay in memory and which should demoted first when video memory is limited.
      • The highest priority should be given to GPU-written resources such as render target, depth buffer, and UAVs.
    • Use MakeResident and Evict to stay within budget
      • Drop mip levels of tiled resources as needed.
      • Consider using MakeResident or Evict before or after executing command lists when you are going over the vidmem budget.
      • Applications must handle cases when MakeResident fails.
    • Batch up MakeResident calls, but expect a CPU and GPU cost for page table updates. This lowers the overhead inside the driver and the GPU.
  • Resource creation:
    • Resource creation should be done off the critical path and in advance of when an allocation is required
    • Use CreateHeap with CreatePlacedResource and CreateReservedResource to maximize memory reuse and reduce the number of memory allocation calls when the application is running.
    • Be aware of the fact that certain resource types have different alignment rules within a heap.
    • Check resource heap tier capabilities and devise ways to deal with varying resource binding within a device feature level.
    • Align any updates to placed or tiled resources at a 2-MB granularity for best performance.
  • Memory aliasing:
    • For best performance when aliasing resources, group resources which can be compressed on GPU hardware together and not with resources which can not be compressed. And vice versa.
      • Uncompressed resources:
        • Any resource with size less than 512×512 pixels
        • All block-compressed resources
        • All other textures and UAVs that do not meet the compressed resource requirements
      • Compressed resources:
        • All depth buffers
        • Render targets greater than 512×512 pixels
        • Texture and UAVs greater than 512×512
        • Formats which support arithmetic compression natively (8-8-8-8, 16-16, and so on)
        • 64-KB aligned buffers, which start out uncompressed.
    • Using committed resources over heaps is preferred for uncompressed resources. This removes the CPU cost of tracking these resources in case of any alignment with compressed resources.
      • If this is not possible, then all uncompressed textures should get their own heap, as they can alias amongst themselves
    • If an application is low on memory, then a heap can be split into uncompressed and compressed segments and map resources into the appropriate regions.
  • Reserved or tiled resources:
    • Move all UpdateTileMapping calls to an asynchronous copy queue to hide the OS scheduling and submit costs.
    • There is no need for explicit unmap calls as long as you remap the same tiles.
    • Use resources that can be compressed on GPU hardware; textures, render targets, and UAVs.
    • Align updates to tiles resources at a 2-MB granularity for optimal compression.
  • Don’t rely on the availability of tiled resources. Check cap bits.
    • You still need to think about different DX12 hardware classes.
  • Don’t rely on being able to allocate all GPU memory in one go.
  • Don’t expect an immediate cost for an Evict call. The cost might be deferred until another MakeResident call uses the memory.
  • Don’t use a pattern of creation and destruction of resources.
    • Make use of MakeUnresident and MakeResident where possible, as this saves the overhead of creation and destruction of resources
  • Avoid explicit unmap calls through updateTileMapping when pHeap is NULL or when D3D12_TILE_RANGE_FLAG_NULL is set. It forces the driver to iterate over all mappings and remove the tiles, which are no longer mapped. Instead, switch to a different tile.

Resources

Selection of resource formats and types should be carefully considered, based on the application’s requirements and performance. These recommendations are not likely universal for all applications as they are dependent on workloads and limiters. For example, preferring a D24 depth format may not matter for a tiny, infrequently used buffer. However, it may be critical for an 8k shadow map. The following advice should be combined with using NVIDIA Nsight to diagnose performance issues and verify improvements.

  • Use 32-bit color formats (DXGI_FORMAT_R11G11B10_FLOAT) over 64-bit color formats (DXGI_FORMAT_R16G16B16A16_FLOAT) to reduce the bandwidth required.
  • Use D24 or D16 depth formats for optimal performance.
  • Use 16-bit indices where the number of vertices permits.
  • When using CopyTextureRegion, take care when copying depth stencil textures, as copying only the depth part of the resource may hit a slow path.
  • Constant buffer and structure buffer performance is similar on modern GPUs but be aware that constant buffers should only be used when the contents of the buffer are uniformly accessed

Acknowledgments 

Thanks to Patrick Neil, Dhiraj Kumar, Ivan Fedorov, and Juha Sjoholm for their advice and assistance. 

Discuss (1)

Tags