Unlocking GPU Intrinsics in HLSL

Introduction

There are some very useful "intrinsic" functions in the NVIDIA GPU instruction set that are not included in standard graphics APIs. For example, a shader can use warp shuffle instructions to exchange data between threads in a warp without going through shared memory - which is especially valuable in pixel shaders where there is no shared memory. Or a shader can perform atomic additions on half-precision floating-point numbers in global memory. We showed you how the intrinsic instructions worked in Mathias' API-neutral blog post; now we take you into the machinery to make them work in DirectX.

None of the intrinsics are possible in standard DirectX or OpenGL. But they have been supported and well-documented in CUDA for years. A mechanism to support them in DirectX has been available for a while but not widely documented. I happen to have an old NVAPI version 343 on my system from October 2014 and the intrinsics are supported in DirectX by that version and probably earlier versions. This blog explains the mechanism for using them in DirectX.

Unlike OpenGL or Vulkan, DirectX unfortunately doesn't have a native mechanism for vendor-specific extensions. But there is still a way to make all this functionality available in DirectX 11 or 12 through custom intrinsics. That mechanism is implemented in our graphics driver and accessible through the NVAPI library.

Extending HLSL Shaders

In order to use the intrinsics, they have to be encoded as special sequences of regular HLSL instructions that the driver can recognize and turn into the intended operations. These special sequences are provided in one of the header files that comes with the NVAPI SDK: nvHLSLExtns.h.

One important thing about these instruction sequences is that they have to pass through the HLSL compiler without optimizations, because the compiler does not understand their true meaning and therefore could modify them beyond recognition, change their order, or even completely remove them. To prevent the compiler from doing that, the sequences use atomic operations on a UAV buffer. The HLSL compiler cannot optimize away these instructions because it is unaware of possible dependencies (even though there are none). That UAV buffer is basically a fake and it will not be used by the actual shader once it's passed through the NVIDIA GPU driver. But the applications still have to allocate a UAV slot for it and tell the driver which slot that is.

For example, the NvShfl function that implements warp shuffle looks like this, as defined in nvHLSLExtns.h:

	int NvShfl(int val, uint srcLane, int width = NV_WARP_SIZE)
	{
	    uint index = g_NvidiaExt.IncrementCounter();
	    g_NvidiaExt[index].src0u.x  =  val;                             // variable to be shuffled
	    g_NvidiaExt[index].src0u.y  =  srcLane;                         // source lane
	    g_NvidiaExt[index].src0u.z  =  __NvGetShflMaskFromWidth(width);
	    g_NvidiaExt[index].opcode   =  NV_EXTN_OP_SHFL;
	    
	    // result is returned as the return value of IncrementCounter on fake UAV slot
	    return g_NvidiaExt.IncrementCounter();
	}

A shader that uses this function would look something like this:

	// Declare that the driver should use UAV 0 to encode the instruction sequences.
	// It's a pixel shader with one output, so u0 is taken by the render target - use u1.
	#define NV_SHADER_EXTN_SLOT u1

	// On DirectX12 and Shader Model 5.1, you can also define the register space for that UAV.
	#define NV_SHADER_EXTN_REGISTER_SPACE space0

	// Include the header - note that the UAV slot has to be declared before including it.
	#include "nvHLSLExtns.h"

	Texture2D tex : register(t0);
	SamplerState samp : register(s0);

	float4 main(in float2 texCoord : UV) : SV_Target
	{
		float4 color = tex.Sample(samp, texCoord);

		// Use NvShfl to distribute the color from lane 0 to all other lanes in the warp.
		// The NvShfl function accepts and returns uint data, so use asuint/asfloat to pass float values.
		color.r = asfloat(NvShfl(asuint(color.r), 0));
		color.g = asfloat(NvShfl(asuint(color.g), 0));
		color.b = asfloat(NvShfl(asuint(color.b), 0));
		color.a = asfloat(NvShfl(asuint(color.a), 0));

		return color;
	}

This example may look like it's doing something meaningless, and it is. Realistic use cases of the intrinsics in graphics applications are usually complicated. For example, warp shuffle can be used to optimize memory access in algorithms like light culling. Floating-point atomics are used in VXGI to accumulate emittance during voxelization. But those applications require a significant amount of shader and host code to work. This example, on the other hand, can be plugged into virtually any pixel shader, and the effect will be obvious.

When you compile this shader, each call to NvShfl will be expanded into this sequence, give or take the register names:

        imm_atomic_alloc r1.x, u1
	mov r3.yz, l(0,0,31,0)
	mov r3.x, r2.z
	store_structured u1.xyz, r1.x, l(76), r3.xyzx
	store_structured u1.x, r1.x, l(0), l(1)
	imm_atomic_alloc r0.y, u1

And when this shader passes through the driver's JIT compiler, each NvShfl maps to just one GPU instruction:

        SHFL.IDX        PT, R3, R3, RZ, 0x1f;

Creating Extended Shaders in DirectX 11

In order to actually use this shader, its runtime object has to be created in a special way. A regular call to ID3D11Device::CreatePixelShader will not suffice because the driver needs to know that the shader intends to use intrinsics. It also needs to know which UAV slot is used. If you're working with DirectX 11, use the NvAPI_D3D11_SetNvShaderExtnSlot function before and after calling CreatePixelShader.

	// Do this once during app initialization.
	NvAPI_Initialize();

	ID3D11PixelShader* pShader = nullptr;
	HRESULT D3DResult = E_FAIL;

	// First, enable compilation of intrinsics. 
	// The second parameter is the UAV slot index that is used in the shader: u1.
	NvAPI_Status NvapiStatus = NvAPI_D3D11_SetNvShaderExtnSlot(pDevice, 1);
	if(NvapiStatus == NVAPI_OK)
	{
		// Then create the shader as usual...
		D3DResult = pDevice->CreatePixelShader(pBytecode, BytecodeLength, nullptr, &pShader);

		// And disable again by telling the driver to use an invalid UAV slot.
		NvAPI_D3D11_SetNvShaderExtnSlot(pDevice, ~0u);
	}

	if(FAILED(D3DResult))
	{
		// ...Handle the error...
	}

This method works with any shader that can reference a UAV. So, in DirectX 11.0 it works with pixel and compute shaders; in DirectX 11.1 and above it should work with all kinds of shaders.

Creating Extended Pipeline State Objects in DirectX 12

If you're working with DirectX 12, there are no individual shader objects, and complete pipeline states (PSOs) are created instead. There are various other NVIDIA-specific pipeline state extensions that can be accessed through NVAPI, so in order to avoid a combinatorial explosion of functions that create PSOs with various sets of extensions, we made just two functions, one for graphics and one for compute, that accept a list of extensions to use. The functions are called NvAPI_D3D12_CreateGraphicsPipelineState and NvAPI_D3D12_CreateComputePipelineState. And the HLSL extension is described by the NVAPI_D3D12_PSO_SET_SHADER_EXTENSION_SLOT_DESC structure; there's only one for the whole pipeline state though, so if two or more shaders in the pipeline use intrinsics, they need to use the same UAV slot for it.

        // Do this once during app initialization.
       NvAPI_Initialize();

       // Fill the PSO description structure
       D3D12_GRAPHICS_PIPELINE_STATE_DESC PsoDesc;
       PsoDesc.VS = { pVSBytecode, VSBytecodeLength };
       // ...And so on, as usual...

       // Also fill the extension structure. 
       // Use the same UAV slot index and register space that are declared in the shader.
       NVAPI_D3D12_PSO_SET_SHADER_EXTENSION_SLOT_DESC ExtensionDesc;       
       ExtensionDesc.baseVersion = NV_PSO_EXTENSION_DESC_VER;
       ExtensionDesc.psoExtension = NV_PSO_SET_SHADER_EXTNENSION_SLOT_AND_SPACE;
       ExtensionDesc.version = NV_SET_SHADER_EXTENSION_SLOT_DESC_VER;
       ExtensionDesc.uavSlot = 1;
       ExtensionDesc.registerSpace = 0;

       // Put the pointer to the extension into an array - there can be multiple extensions enabled at once.
       // Other supported extensions are: 
       //     - Extended rasterizer state
       //  - Pass-through geometry shader, implicit or explicit
       //  - Depth bound test
       const NVAPI_D3D12_PSO_EXTENSION_DESC* pExtensions[] = { &ExtensionDesc };

       // Now create the PSO.
       ID3D12PipelineState* pPSO = nullptr;
       NvAPI_Status NvapiStatus = NvAPI_D3D12_CreateGraphicsPipelineState(pDevice, &PsoDesc, ARRAYSIZE(pExtensions), pExtensions, &pPSO);

       if(NvapiStatus != NVAPI_OK)
       {
              // ...Handle the error...
       }

	}

Querying GPU Feature Support

Finally, before trying to use the intrinsics, you'll probably want to know whether the device that the app's working with actually supports those intrinsics. There are two NVAPI functions that can tell you just that: NvAPI_D3D11_IsNvShaderExtnOpCodeSupported and NvAPI_D3D12_IsNvShaderExtnOpCodeSupported. Their opCode parameter identifies the specific operation that you're interested in; the operation codes are defined in the nvShaderExtnEnums.h file supplied with NVAPI SDK. For example, in order to test whether a DirectX 11 device supports warp shuffle, do this:

	#include "nvShaderExtnEnums.h"

	bool bSupported = false;
	NvAPI_Status NvapiStatus = NvAPI_D3D11_IsNvShaderExtnOpCodeSupported(pDevice, NV_EXTN_OP_SHFL, &bSupported);

	if(NvapiStatus == NVAPI_OK && bSupported)
	{
		// Yay, the device is no more than 4 years old!
	}

The intrinsics supported by NVIDIA GPUs are not limited to warp shuffle and ballot. Other supported operations include 32-bit and 16-bit floating-point atomics. Regular DirectX 11/12 only supports 32-bit integer atomics, accessible through InterlockedAdd and similar functions. With NVAPI, you can also use the following functions:

	NvInterlockedAddFp32 (Kepler and newer)

	NvInterlockedAddFp16x2 (Maxwell-2 and newer)
	NvInterlockedMinFp16x2
	NvInterlockedMaxFp16x2

	NvInterlockedAddFp16x4 (Maxwell-2 and newer)
	NvInterlockedMinFp16x4
	NvInterlockedMaxFp16x4

For additional documentation on all of the NVAPI functions and structures mentioned here, please refer to the comments in NVAPI header files. More use cases and examples of intrinsics can be found in Mathias' blog post and in the CUDA/Parallel For All references below.

Additional References

Update: Compiling Shaders with Correct Options

Currently, there is a known issue in the NVIDIA GPU drivers that affects HLSL intrinsics. Specifically, the intrinsics will NOT work properly if the shader is compiled with the D3DCOMPILE_SKIP_OPTIMIZATION flag, or the /Od command line option passed to FXC. If you see that the intrinsics have no effect, please make sure that this flag is not specified.