Unlocking GPU Intrinsics in HLSL

There are some useful intrinsic functions in the NVIDIA GPU instruction set that are not included in standard graphics APIs.

Updated from the original 2016 post to add information about new intrinsics and cross-vendor APIs in DirectX and Vulkan.

For example, a shader can use warp shuffle instructions to exchange data between threads in a warp without going through shared memory, which is especially valuable in pixel shaders where there is no shared memory. Or a shader can perform atomic additions on half-precision floating-point numbers in global memory.

The Reading Between The Threads: Shader Intrinsics post showed you how the intrinsic instructions worked. Now, I take you into the machinery to make them work in DirectX.

None of the intrinsics are possible in standard DirectX or OpenGL. [2023: This is no longer true. More information is shared later in this post.] But they have been supported and well-documented in CUDA for years. A mechanism to support them in DirectX has been available for a while but not widely documented. I happen to have an old NVAPI version 343 on my system from October 2014 and the intrinsics are supported in DirectX by that version and probably earlier versions. This post explains the mechanism for using them in DirectX.

Unlike OpenGL or Vulkan, DirectX unfortunately doesn’t have a native mechanism for vendor-specific extensions. However, there is still a way to make all this functionality available in DirectX 11 or 12 through custom intrinsics. That mechanism is implemented in the graphics driver and accessible through the NVAPI library.

Extending HLSL shaders

To use the intrinsics, they have to be encoded as special sequences of regular HLSL instructions that the driver can recognize and turn into the intended operations. These special sequences are provided in one of the header files that comes with the NVAPI SDK: nvHLSLExtns.h.

One important thing about these instruction sequences is that they have to pass through the HLSL compiler without optimizations because the compiler does not understand their true meaning and therefore could modify them beyond recognition, change their order, or even completely remove them.

To prevent the compiler from doing that, the sequences use atomic operations on a UAV buffer. The HLSL compiler cannot optimize away these instructions because it is unaware of possible dependencies, even though there are none. That UAV buffer is basically a fake and it is not used by the actual shader after it’s passed through the NVIDIA GPU driver. But the applications still have to allocate a UAV slot for it and tell the driver which slot that is.

For example, the NvShfl function that implements warp shuffle looks like the following code example, as defined in nvHLSLExtns.h:

int NvShfl(int val, uint srcLane, int width = NV_WARP_SIZE)
{
     uint index = g_NvidiaExt.IncrementCounter();
     g_NvidiaExt[index].src0u.x  =  val;          // variable to be shuffled
     g_NvidiaExt[index].src0u.y  =  srcLane;      // source lane
     g_NvidiaExt[index].src0u.z  =  __NvGetShflMaskFromWidth(width);
     g_NvidiaExt[index].opcode   =  NV_EXTN_OP_SHFL;
	    
// result is returned as the return value of IncrementCounter on fake UAV slot
     return g_NvidiaExt.IncrementCounter();
}

A shader that uses this function would look something like the following code example:

// Declare that the driver should use UAV 0 to encode the instruction sequences.
// It's a pixel shader with one output, so u0 is taken by the render target - use u1.
#define NV_SHADER_EXTN_SLOT u1

// On DirectX12 and Shader Model 5.1, you can also define the register space for that UAV.
#define NV_SHADER_EXTN_REGISTER_SPACE space0

// Include the header - note that the UAV slot has to be declared before including it.
#include "nvHLSLExtns.h"

Texture2D tex : register(t0);
SamplerState samp : register(s0);

float4 main(in float2 texCoord : UV) : SV_Target
{
     float4 color = tex.Sample(samp, texCoord);

     // Use NvShfl to distribute the color from lane 0 to all other lanes in the warp.
     // The NvShfl function accepts and returns uint data, so use asuint/asfloat to pass float values.
	color.r = asfloat(NvShfl(asuint(color.r), 0));
	color.g = asfloat(NvShfl(asuint(color.g), 0));
	color.b = asfloat(NvShfl(asuint(color.b), 0));
	color.a = asfloat(NvShfl(asuint(color.a), 0));

	return color;
}

This example may look like it’s doing something meaningless, and it is. Realistic use cases of the intrinsics in graphics applications are usually complicated. For example, warp shuffle can be used to optimize memory access in algorithms like light culling. Floating-point atomics are used in VXGI to accumulate emittance during voxelization. However, those applications require a significant amount of shader and host code to work. This example, on the other hand, can be plugged into virtually any pixel shader, and the effect is obvious.

When you compile this shader, each call to NvShfl is expanded into this sequence, give or take the register names:

imm_atomic_alloc r1.x, u1
mov r3.yz, l(0,0,31,0)
mov r3.x, r2.z
store_structured u1.xyz, r1.x, l(76), r3.xyzx
store_structured u1.x, r1.x, l(0), l(1)
imm_atomic_alloc r0.y, u1

And when this shader passes through the driver’s JIT compiler, each NvShfl function maps to just one GPU instruction:

SHFL.IDX        PT, R3, R3, RZ, 0x1f;

Creating extended shaders in DirectX 11

To actually use this shader, its runtime object has to be created in a special way. A regular call to ID3D11Device::CreatePixelShader does not suffice because the driver must know that the shader intends to use intrinsics. It also has to know which UAV slot is used.

If you’re working with DirectX 11, use the NvAPI_D3D11_SetNvShaderExtnSlot function before and after calling CreatePixelShader:

// Do this one time during app initialization.
NvAPI_Initialize();

ID3D11PixelShader* pShader = nullptr;
HRESULT D3DResult = E_FAIL;

// First, enable compilation of intrinsics. 
// The second parameter is the UAV slot index that is used in the shader: u1.
NvAPI_Status NvapiStatus = NvAPI_D3D11_SetNvShaderExtnSlot(pDevice, 1);
if(NvapiStatus == NVAPI_OK)
{
     // Then create the shader as usual...
     D3DResult = pDevice->CreatePixelShader(pBytecode, BytecodeLength, nullptr, &pShader);

     // And disable again by telling the driver to use an invalid UAV slot.
     NvAPI_D3D11_SetNvShaderExtnSlot(pDevice, ~0u);
}

if(FAILED(D3DResult))
{
     // ...Handle the error...
}

This method works with any shader that can reference a UAV. So, in DirectX 11.0 it works with pixel and compute shaders. In DirectX 11.1 and later, it should work with all kinds of shaders.

Creating extended pipeline state objects in DirectX 12

If you’re working with DirectX 12, there are no individual shader objects. Instead, complete pipeline states (PSOs) are created.

There are various other NVIDIA-specific pipeline state extensions that can be accessed through NVAPI, so to avoid a combinatorial explosion of functions that create PSOs with various sets of extensions, NVIDIA made just two functions, one for graphics and one for compute, that accept a list of extensions to use:

NvAPI_D3D12_CreateGraphicsPipelineState
NvAPI_D3D12_CreateComputePipelineState

The HLSL extension is described by the NVAPI_D3D12_PSO_SET_SHADER_EXTENSION_SLOT_DESC structure. There’s only one for the whole pipeline state though, so if two or more shaders in the pipeline use intrinsics, they must use the same UAV slot for it.

// Do this one time during app initialization.
NvAPI_Initialize();

// Fill the PSO description structure
D3D12_GRAPHICS_PIPELINE_STATE_DESC PsoDesc;
PsoDesc.VS = { pVSBytecode, VSBytecodeLength };
// ...And so on, as usual...

// Also fill the extension structure. 
// Use the same UAV slot index and register space that are declared in the shader.
NVAPI_D3D12_PSO_SET_SHADER_EXTENSION_SLOT_DESC ExtensionDesc;       
ExtensionDesc.baseVersion = NV_PSO_EXTENSION_DESC_VER;
ExtensionDesc.psoExtension = NV_PSO_SET_SHADER_EXTNENSION_SLOT_AND_SPACE;
ExtensionDesc.version = NV_SET_SHADER_EXTENSION_SLOT_DESC_VER;
ExtensionDesc.uavSlot = 1;
ExtensionDesc.registerSpace = 0;

// Put the pointer to the extension into an array. There can be multiple extensions enabled at one time.
// Other supported extensions are: 
       //     - Extended rasterizer state
       //  - Pass-through geometry shader, implicit or explicit
       //  - Depth bound test
       const NVAPI_D3D12_PSO_EXTENSION_DESC* pExtensions[] = { &ExtensionDesc };

// Now create the PSO.
ID3D12PipelineState* pPSO = nullptr;
NvAPI_Status NvapiStatus = NvAPI_D3D12_CreateGraphicsPipelineState(pDevice, &PsoDesc, ARRAYSIZE(pExtensions), pExtensions, &pPSO);

if(NvapiStatus != NVAPI_OK)
     {
        // ...Handle the error...
     }
}

Querying GPU feature support

Finally, before trying to use the intrinsics, you’ll probably want to know whether the device that the app’s working with actually supports those intrinsics. There are two NVAPI functions that can tell you just that:

NvAPI_D3D11_IsNvShaderExtnOpCodeSupported
NvAPI_D3D12_IsNvShaderExtnOpCodeSupported

The opCode parameter identifies the specific operation that you’re interested in. Operation codes are defined in the nvShaderExtnEnums.h file supplied with NVAPI SDK. For example, to test whether a DirectX 11 device supports warp shuffle, use the following code example:

#include "nvShaderExtnEnums.h"

bool bSupported = false;
NvAPI_Status NvapiStatus = NvAPI_D3D11_IsNvShaderExtnOpCodeSupported(pDevice, NV_EXTN_OP_SHFL, &bSupported);

if(NvapiStatus == NVAPI_OK && bSupported)
{
     // Yay, the device is no older than 2012!
}

Update 2023: New intrinsics and cross-vendor APIs

The intrinsics supported by NVIDIA GPUs are not limited to warp shuffle. In fact, warp shuffle and related functions are now available through cross-vendor intrinsics in both DirectX 12 and Vulkan, and there is no need to use NVAPI for them. For more information about DirectX 12 wave intrinsics, see Wave Intrinsics. For more information about Vulkan subgroup operations, see the Vulkan subgroup tutorial.

The complete list of intrinsics supported by NVIDIA GPUs can be found in the NVAPI header file called nvHLSLExtns.h, which is now available on GitHub. The functions declared in this file can be subdivided into a few general categories:

Older warp operations: shuffle, vote, ballot, lane index (NvShfl*, NvAny, NvAll, NvBallot, NvGetLaneId)
Newer warp operations: wave match (NvWaveMatch). NvWaveMatch returns a mask of active lanes in the warp that passed the same parameter value as the current lane.
Special register access (NvGetSpecial)
Extended atomic operations on FP16, FP32, and Uint64 variables (NvInterlocked*)
Variable rate shading (NvGetShadingRate, NvEvaluateAttribute*)
Texture footprint evaluation (NvFootprint*)
WaveMultiPrefix functions (NvWaveMultiPrefix*). These functions are just algorithms built on top of other intrinsics.
Ray tracing micromap extensions (NvRtMicroTriangle*, NvRtMicroVertex*)
Ray tracing shader execution reordering (NvHitObject, NvReorderThread)

Update: Compiling shaders with the correct options

Currently, there is a known issue in the NVIDIA GPU drivers that affects HLSL intrinsics. Specifically, the intrinsics do NOT work properly if the shader is compiled with the D3DCOMPILE_SKIP_OPTIMIZATION flag, or the /Od command line option passed to FXC. If you see that the intrinsics have no effect, please make sure that this flag is not specified.

Conclusion

For more information about NVAPI functions and structures, see the comments in NVAPI header files. For more use cases and examples of intrinsics, see the following resources: