Advanced API Performance: Intrinsics

Intrinsics can be thought of as higher-level abstractions of specific hardware instructions. They offer direct access to low-level operations or hardware-specific features, enabling increased performance. In this way, operations can be performed across threads within a warp, also known as a wavefront.

Wave intrinsics can noticeably speed up your shaders.
- Many sorting or reduction algorithms can use much less or no shared memory with fewer memory barriers, providing a noticeable performance boost.
- Different types of shuffles and ballots can be useful.
- Use wave instructions with GroupSize or WorkGroup values larger than the warp or subgroup size (32 threads) wave instructions. There are fewer memory barriers and shared memory accesses that are needed.
- For more information, see Reading Between The Threads: Shader Intrinsics and Unlocking GPU Intrinsics in HLSL.
Use GroupSize and WorkGroup as a multiplier of warp size (32 * N), 64 is usually a sweet spot.
- With intrinsic GroupSize and WorkGroup size equal, 32 could be a better choice to avoid shared memory usage.
Use native HLSL code when vendor-specific extensions are not applicable or are hard to implement.
- Some instructions can be implemented with recent shader model versions.

The following code example is an example with SM6:

float(4) NvShflXor (float(4) input, uint LaneMask)
{
float(4) output = WaveReadLaneAt(input, WaveGetLaneIndex() ^ LaneMask);
return output;
}

Advanced API Performance: Intrinsics

Recommended

Related resources

Tags

About the Authors

Advanced API Performance: Intrinsics

Recommended

Related resources

Tags

About the Authors

Comments

Related posts

Unlocking GPU Intrinsics in HLSL

Advanced API Performance: Descriptors

Advanced API Performance: Shaders

Reading Between The Threads: Shader Intrinsics

CUDA Pro Tip: Do The Kepler Shuffle

Related posts

Networking for Data Centers and the Era of AI

Advanced API Performance: Shaders

Advanced API Performance: Pipeline State Objects

Advanced API Performance: CPUs

Advanced API Performance: Sampler Feedback