DEVELOPER BLOG

Graphics / Simulation |

Using Mesh Shaders for Professional Graphics

Mesh shaders were introduced with the Turing architecture and are shipping with Ampere as well. In this post, I offer a detailed look over mesh shader experiences for these hardware architectures so far. The context of these results was primarily CAD and DCC viewport or VR-centric. However, some of it may be applicable to games as well, which increase in geometric complexity. Games also have passes where geometric complexity can dominate, such as visibility buffer, depth pre-pass, or shadow-map.

Featured decorative image.

For more general information, see Introduction to Turing Mesh Shaders or watch SIGGRAPH 2018: Turing – Mesh Shaders.

Keep in mind that mesh shaders are deliberately designed to expand existing capabilities and allow you to optimize certain use cases. They are not meant to replace the existing geometry pipeline completely. Due to the existing optimizations in the traditional VTG (vertex-, tessellation-, and geometry-shader) pipeline, do not expect mesh shader pipelines to always provide substantial wins in performance.

Meshlet data structure

At the core of mesh shading are meshlets, a data structure that represents a small mesh with a predefined upper limit of V vertices and P primitives. They are represented using a local primitive index list, combined with a set of unique vertices. At a minimum, the unique vertices could be regular vertex indices alone. There are no restrictions on connectivity.

Similar representations are already in use in some games when culling geometry using compute shaders.

ExampleMeshlet
{
  uint  numPrimitives;
  uint  numVertices;
  uvecX primitiveIndices[P]; // into local vertex array
#if TYPICAL_INPUT
  uint  vertexIndices[V];
#else
  Vertex vertices[V];
#endif
}

It is often beneficial to have a higher primitive count than vertex count due to vertex sharing in triangle meshes. Ideally, you should maximize the packing utilization, just like index buffers can be optimized for vertex cache. Further strategies include using the minimal number of bits required for indices, as well as storing other values as deltas to a reference value or else locally quantized.

For cluster culling, you may want to compute additional data in separate buffers for cache efficiency, such as bounding boxes, bounding spheres, and normal cones. In the following examples, I used a 128-bit header encoding this information and then packed the indices arrays separately.

ExampleHeader
{
  unsigned bboxMinX  : 8;
  unsigned bboxMinY  : 8;
  unsigned bboxMinZ  : 8;
  unsigned vertexMax : 8; // numVertices - 1

  unsigned bboxMaxX : 8;
  unsigned bboxMaxY : 8;
  unsigned bboxMaxZ : 8;
  unsigned primMax  : 8; // numPrimitives - 1

  signed   coneOctX   : 8;
  signed   coneOctY   : 8;
  signed   coneAngle  : 8;
  unsigned vertexPack : 8; // for example 16- or 32-bit indices

  unsigned meshletIndicesBufferOffset : 32;
}

For more information about a basic implementation, see the  nvmeshlet_packbasic.hpp file.

CAD rendering sample for OpenGL or Vulkan

The nvpro-samples/gl_vk_meshlet_cadscene GitHub example showcases many details and allows benchmarking and playing with settings such as meshlet configurations, task shader and primitive culling, and number of outputs per-vertex.

Use cases

In this post, I focus mostly on content with lots of triangles, as seen in CAD or DCC applications.

Avoiding index buffer bottleneck

Especially in CAD, 32-bit index buffers may be used to render large geometries. Mesh shaders using meshlet encodings can substantially decrease the amount of memory used compared to the original index buffer, while still using 32-bit indices. This bandwidth saving also results in much better performance when rendering such models. While you can also use a traditional 16-bit index buffer and chunk the model, meshlet encoding avoids doing such vertex chunking.

Example: Lucy statue scan

Here’s some example data for a dense triangle model: the 27M-triangle Lucy statue scan. To isolate the costs of the pre-rasterization stages, the model was rendered completely invisibly with no rasterization, and no culling logic was applied. Performance tests were run on an RTX 3090, both by using a single draw call using 32-bit indices, and by spatially splitting the model into multiple vertex and index buffers to use 16-bit indices.

performance32-bit indices, single16-bit indices, split
vertex shader1.8 ms1.1 ms
mesh shader0.9 ms0.8 ms
Table 1. Preliminary performance from RTX 3090.

You can see the substantial gains in the 32-bit index test case, but even with 16-bit indices mesh shaders can be able to add performance.

Custom mesh representation

The flexibility of the mesh shader allows you to experiment with alternative data representations and encodings. In this section, I show the memory impact of some different data representations, using the same scene as before.

Sizes32-bit indices, single16-bit indices, split
vertex buffer215 MB218 MB
index buffer320 MB160 MB
meshlet basic indices170 MB130 MB
meshlet delta indices bits123 MB116 MB
meshlet w. quantized positions160 MB161 MB
Table 2. Geometry data sizes for Lucy Statue model rendered via 32-bit or 16-bit indices.
  • Meshlet basic indices: This variant was used for the rendering times above, which nicely shows the memory and bandwidth savings of the regular index buffer. It stores vertex indices either as 16- or 32-bit values, depending on the highest value within the meshlet.
  • Meshlet delta indices: Here each meshlet stores its first vertex index as a 32-bit value, as well as the length of its delta bit mask (the bit length of the mask that records the bits of indices in the meshlet that differ from the reference). Then the rest of the vertex indices are stored compactly, each using (length of delta mask) bits.
  • Meshlet with quantized positions: This technique applies the previous delta-encoding technique to storing the vertex positions for each meshlet as well. The positions are quantized to use only a few bits per component and are stored relative to the bounding box of the meshlet. Be aware that such quantization can lead to cracks in the mesh. Strictly speaking this position quantization doesn’t require a mesh shader, as programmable vertex fetching is also possible, although it cannot benefit from sharing computations or decoding logic and most likely requires encoding of the cluster + local vertex index into a single 32-bit traditional index value, which increases index buffer traffic.

Micro instancing

When rendering lots of objects with few primitives—such as <500 triangles (this number varies depending on GPU)—your application can become front-end bound even when using instancing. CAD data often contains a mix of the following:

  • Highly pre-tessellated smooth surfaces
  • Lots of small auxiliar objects, such as nuts and bolts, stitch patterns, electronic parts, letters in geometric text, symbols
  • Surfaces comprised of smaller instanced details, such as drilling holes

In DCC, vegetation is often made of heavily instanced geometry, such as groups of leaves and twigs and grass bushes, or detail objects for environments, such as greebles and rocks.

The task shader can serve as a distributed dispatcher for such small objects, which is more efficient than indirect drawing or hardware instancing.

Procedural geometry

Instead of using geometry shaders, you should prefer mesh shaders, as they give you more control over primitive topologies and allow for sharing computations, for example, compute center position, then add deltas for left and right vertices. For example, triangle strips can be done more efficiently across the warp, subgroup/wave intrinsics can be used to calculate distances along the strip, and so on.

These procedural representations can be used to do draw annotation markers (arrows, special stipple patterns, and so on) but also other shapes especially in 2D scenarios. These meshes may not always need to describe the shape at full detail but can serve as proxy hull for the actual shape generated in the fragment shader using signed distance field evaluations on a per pixel level.

Example: Path distance

Here’s how to use subgroup intrinsics to compute a running distance along a path:

  vec3 posA = getStipplePos(pos);
  vec3 posB = subgroupShuffleUp(posA,1);
  if (SubgroupInvocation == 0){
    posB = posA;
  }
  float dist = distance(posA,posB);
  dist = subgroupInclusiveAdd(dist);

The liberal thread mapping also means that you can apply delta values to a common centerline for triangle strip vertices, rather than computing everything per-vertex.

In-pipeline level of detail

Instead of using ExecuteIndirect (DX12) or MultiDrawIndirect (GL/VK), a task shader can be used directly to choose the level of detail for an object. Depending on the setup, this may not be faster than using the indirect drawing. It does, however, avoid a dedicated pass to compute the indirect data as well as the storage buffers for it. Draw indirect should still be used to manage draw call-level culling, so a single draw call where a single task shader invocation decides whether to draw an object is not efficient.

Cluster culling

If there is still a decent amount of geometry that can be culled at finer granularity than what the existing culling solution delivers, then cluster culling in the task shader is beneficial. The task shader typically culls one meshlet per thread. Like compute shaders performing cluster culling, it is common to use a bounding shape for frustum or occlusion tests and a cone to approximate the triangle normals in the cluster for back-face culling. 

Example: Typical task shader

Here’s an example of a typical task shader:

  bool  doRender    = cullCluster(meshlets[GlobalInvocation]);
  uvec4 vote        = subgroupBallot(doRender);
  uint  numMeshlets = subgroupBallotBitCount(vote);

  if (LocalInvocation == 0) {
    gl_TaskCountNV = numMeshlets;
  }

  uint idxOffset = subgroupBallotExclusiveBitCount(vote);
  if (render)
  {
    OUT.meshletIDs[idxOffset] = GlobalInvocation;

    // in GLSL it is possible to use uint8_t to store just the relative
    // invocation to the Task Shader WorkGroup and therefore save more output space.

    // OUT.meshletBaseID = WorkGroupID; // write once per workgroup
    // OUT.meshletSubIDs[idxOffset] = uint8_t(LocaLInvocation);
  }

Primitive culling

When the geometry contains a lot of small triangles, doing additional per-triangle culling in the mesh shader can be beneficial. CAD datasets often have many triangles that are smaller than a pixel, but this also can apply to film-quality or scanned assets.

Doing this level of culling adds additional complexity to the mesh shader, so it is not always a win. However, my research has shown that in scenarios where the mesh shaders are simple, there can be gains close to 2x. For example, this might be when you are rendering a z-buffer or visibility buffer, such as rendering per-primitive triangle IDs.

In the opposite case, when the mesh shader has a lot of computationally intensive outputs and a lot of triangles are invisible, per-triangle culling can help skip computations and data fetches for all those vertex outputs. However, this is one of the cases where it’s not clear in advance if it’s a win, due to the cost of the large output space. In these scenarios, it may be better to use hardware barycentrics (also a feature since Turing) and fetch and compute additional shading data in the pixel/fragment shader, keeping the mesh shader output size small.

Example: Rungholt model

In this test, I rendered the Rungholt scene using a near-orthographic projection multiple times. The scene is rendered at different zoom levels, with an increasing amount of instances the more it is zoomed out. This makes pixels across the screen have roughly similar costs while decreasing the average triangle size. All meshlets used up to 64 vertices and 84 triangles.

The ids variant outputed primitive IDs like those used in visibility buffer scenarios. There were no vertex outputs. Other variants performed basic Phong shading and passed eight outputs from the vertex to the pixel shader. The model was split into chunks to leverage 16-bit indices. This reduced the benefit of mesh shaders slightly but is a more realistic scenario for games.

There was no cluster culling, therefore no task shader. All triangles were contained in the camera frustum.

renderer4.5 pixel²1.7 pixel²1 pixel²0.46 pixel²0.26 pixel²
mesh1.051.051.041.041.04
mesh tricull1.001.051.121.261.31
mesh tricull “ids”1.111.401.641.831.78
Table 3. Performance factor relative to vertex shader (higher is better) on Quadro RTX 6000.

Tips and techniques

Mesh and task shaders have some new unique properties compared to existing shaders. Here are some tips from my colleagues and me about how to use mesh and task shaders well on hardware with the Turing or Ampere architectures.

Common tips

  • Mesh and task shaders are sensitive to their output sizes, as these influence warp occupancy and other in-pipeline on-chip data flow, much like shared memory in compute shaders.
  • Do not use configurations where a local work group size is smaller than a warp. For Turing and Ampere GPUs, design your code to use 32 threads for both mesh and task shaders. Higher numbers may be exposed in the APIs but can yield sub-optimal performance.

Task shader tips

  • Task shaders performing cluster culling likely provides the most benefits.
  • We recommend using small task shader outputs, ideally below 236 or 108 bytes. The driver reserves up to 20 bytes, and ideally a single or few cache-lines are used. Make use of smaller data types such as uint8_t, if sufficient and available.
  • Task shaders can come with some overhead. For instance, if a single draw call only generates a few meshlets (mesh shader invocations) at most, then adding the task shader to perform culling introduces latency not worth the improvements from culling. In this case, either don’t use a task shader or batch multiple such small draw calls into a big chunk of work where there are many task shader workgroups kicked off. For more information, see Task Shader Overhead.
  • If you expect the task shader to do a substantial amount of work, consider using a dedicated compute pass, and using its results later. The hardware can focus on one regime, rather than schedule between multiple different workloads—task, mesh, or pixel shading—and the cost of writing and reading intermediate data may be mitigated by the data cache.

Mesh shader tips

While task shaders are typically straightforward to setup, mesh shaders have a few more intricacies. There are more settings to configure and often existing vertex shader code must be ported.

Migration from vertex shaders

If you transition your vertex shading to be based on gl_VertexIndex / SV_VertexID and fetching vertex data by itself, then you should be able to generate mesh shaders from such vertex shaders using DXIL/SPIR-V. Or, you might be able to refactor your code a bit to express vertex shading in such a way that it can be easily used as a function in a skeleton vertex or mesh shader. This should make it straightforward to add mesh shaders, given that meshlet decoding logic is typically always the same.

WorkGroup and output configuration

  • As Turing and Ampere support a single warp (32 threads) natively, DX12 must emulate work group sizes of higher number of threads. Therefore, it can be beneficial to do some of this looping manually as it may yield better code than the emulation. We recommend still using 64 vertices, rather than just 32, as it improves vertex-reuse over the classic vertex shader pipeline, which batches at most 32 vertices.
  • When there are hardly any other per-primitive or per-vertex outputs, which are cost multipliers for the output space, you can use larger meshlet configurations with 96 or 128 vertices.
  • The number of primitives should typically be a bit higher than the number of vertices, due to vertex re-use in triangle topologies. Due to alignment padding, we recommend a maximum of 84 or 124 primitives.
  • Most of the time, a maximum of 64 vertices and 84 primitives with a 32-thread configuration and some unrollable loops to handle vertex/primitive processing works well on Turing and Ampere. Be aware that other hardware may prefer using higher number of threads and keeping a fixed thread to primitive or vertex mapping when writing outputs. Check with the appropriate vendors.
  • To allow the hardware to batch data loads in those unrollable loops, try to avoid too much conditional logic. For example, clamp load addresses:
  vertexLoaded = max(thread + loopIteration * 32, meshlet.vertexMax);

Primitive culling

For per-primitive culling, use subgroup intrinsics to compact the output triangle indices. While it is possible to create degenerate triangles instead, we recommend using compaction of indices for NVIDIA hardware as it means that less data is processed by further stages.

  outTriangles = 0;
  for (loop < primitiveLoops)
  {
    uvec3 triIndices = getTriangleIndices(loop * SubgroupSize + SubgroupInvocation);
    bool  triVisible = cullTriangle(triIndices, ...);

    uvec4 voteTris   = subgroupBallot(triVisible);
    uint  numTris    = subgroupBallotBitCount(vote);

    uint  idx        = outTriangles + subgroupBallotExclusiveBitCount(vote);
    if (triVisible)
    {
      idx = idx * 3;
      gl_PrimitiveIndicesNV[idx + 0] = triIndices.x;
      gl_PrimitiveIndicesNV[idx + 1] = triIndices.y;
      gl_PrimitiveIndicesNV[idx + 2] = triIndices.z;
    }
    outTriangles += numTris;
  }
  gl_PrimitiveCountNV = outTriangles;

Conclusion

As you can see, the task and mesh shaders are quite versatile and can offer additional flexibility or performance over previous shaders.

Some details on the optimal mesh shader behavior—like the preferred number of threads in a work group or how to do primitive culling—are hardware-specific and may change over generations and vendors. However, with some minor code adjustments to account for these, the bulk of the work and shader setup should stay the same. 

I hope that these recommendations help you when geometric complexity becomes a bottleneck, and you are looking for alternatives that allow in-pipeline decisions and mesh allocation.

Acknowledgements

Special thanks to Ziyad Hakura, Neil Bickford, and Matthijs De Smedt for their contributions.