Intrinsics can be thought of as higher-level abstractions of specific hardware instructions. They offer direct access to low-level operations or hardware-specific features, enabling increased performance. In this way, operations can be performed across threads within a warp, also known as a wavefront.
Recommended
- Wave intrinsics can noticeably speed up your shaders.
- Many sorting or reduction algorithms can use much less or no shared memory with fewer memory barriers, providing a noticeable performance boost.
- Different types of shuffles and ballots can be useful.
- Use wave instructions with
GroupSize
orWorkGroup
values larger than the warp or subgroup size (32 threads) wave instructions. There are fewer memory barriers and shared memory accesses that are needed. - For more information, see Reading Between The Threads: Shader Intrinsics and Unlocking GPU Intrinsics in HLSL.
- Use
GroupSize
andWorkGroup
as a multiplier of warp size (32 * N
), 64 is usually a sweet spot.- With intrinsic
GroupSize
andWorkGroup
size equal, 32 could be a better choice to avoid shared memory usage.
- With intrinsic
- Use native HLSL code when vendor-specific extensions are not applicable or are hard to implement.
- Some instructions can be implemented with recent shader model versions.
The following code example is an example with SM6:
float(4) NvShflXor (float(4) input, uint LaneMask)
{
float(4) output = WaveReadLaneAt(input, WaveGetLaneIndex() ^ LaneMask);
return output;
}