Intrinsics can be thought of as higher-level abstractions of specific hardware instructions. They offer direct access to low-level operations or hardware-specific features, enabling increased performance. In this way, operations can be performed across threads within a warp, also known as a wavefront.
Recommended
- Wave intrinsics can noticeably speed up your shaders.
- Many sorting or reduction algorithms can use much less or no shared memory with fewer memory barriers, providing a noticeable performance boost.
 - Different types of shuffles and ballots can be useful.
 - Use wave instructions with 
GroupSizeorWorkGroupvalues larger than the warp or subgroup size (32 threads) wave instructions. There are fewer memory barriers and shared memory accesses that are needed. - For more information, see Reading Between The Threads: Shader Intrinsics and Unlocking GPU Intrinsics in HLSL.
 
 - Use 
GroupSizeandWorkGroupas a multiplier of warp size (32 * N), 64 is usually a sweet spot.- With intrinsic 
GroupSizeandWorkGroupsize equal, 32 could be a better choice to avoid shared memory usage. 
 - With intrinsic 
 - Use native HLSL code when vendor-specific extensions are not applicable or are hard to implement.
- Some instructions can be implemented with recent shader model versions.
 
 
The following code example is an example with SM6:
float(4) NvShflXor (float(4) input, uint LaneMask)
{
float(4) output = WaveReadLaneAt(input, WaveGetLaneIndex() ^ LaneMask);
return output;
}