Today's games are visually more interesting and complex than ever before. Geometric complexity—how many objects are visible and how detailed each looks—is one of the dimensions in which games are making leaps and bounds.
Advances in technology are partly responsible for these leaps and bounds: CPUs, memory, and buses all have become faster, but specifically GPUs are undergoing significant change and are becoming ever more powerful—at a rate faster than Moore's Law.
These GPU changes include incorporating fixed-function processing for the vertex- and pixel-shading units, then generalizing those to be fully programmable. GPUs also have gained more units to process pixels and vertices in parallel: the GeForce 6800 Ultra, for example, incorporates 6 vertex shader units and 16 pixel pipelines.
Despite these performance advances, rendering complex scenes is still more difficult than simply dumping all geometry onto the GPU and forgetting about it. The simple approach tends to fail either because the generated GPU workload turns out to be excessive, or because the associated CPU overhead is prohibitive. This part of the book discusses the challenges today's games face in rendering complex geometric scenes.
Chapter 1, "Toward Photorealism in Virtual Botany" by David Whatley of Simutronics Corporation, provides a holistic view on how to render nature scenes. It explains the multitude of different techniques, from scene management and rendering various plant layers to post-processing effects, that Simutronics' Hero's Journey employs to generate complex and stunning visuals.
Rendering terrain is a good example of why simply dumping all available data to the GPU cannot work: the horizon represents a near-infinite amount of vertex data and thus workload. Arul Asirvatham and Hugues Hoppe of Microsoft Research use vertex texture fetches for a new highly efficient terrain-rendering algorithm. Their technique avoids overloading the GPU even as it shifts most work onto the GPU and away from the CPU, which too often is the bottleneck in modern games. Chapter 2, "Terrain Rendering Using GPU-Based Geometry Clipmaps," provides all the implementation details.
As already mentioned, another way to increase geometric complexity is to increase the number of visible objects in a scene. The straightforward solution of drawing each object independently of the others, however, quickly bogs down even a high-end system. It is much easier to efficiently draw ten objects that are one million triangles each, than it is to draw one million objects that are ten triangles each. Francesco Carucci of Lionhead Studios faces this very problem while developing Black & White 2, the sequel to Lionhead's critically acclaimed Black & White. Chapter 3, "Inside Geometry Instancing," describes his solution: a framework of instancing techniques that applies to legacy GPUs as well as to GPUs supporting DirectX 9's instancing API. Jon Olick of 2015 provides further optimizations to the instancing technique that prove beneficial for 2015's title Men of Valor: Vietnam. Jon describes his findings in Chapter 4, "Segment Buffering."
Also, as games incorporate more and more data—more complex scenes of more complex meshes rendered in multiple, disparate passes supporting the gamut of differing functionality from legacy to current high-end GPUs—managing this glut of data efficiently becomes paramount. Oliver Hoeller and Kurt Pelzer of Piranha Bytes are currently working on Piranha Bytes' Gothic III engine. They share their solutions in Chapter 5, "Optimizing Resource Management with Multistreaming."
The best way to render lots of geometry to create geometric complexity is to avoid rendering the occluded parts. Michael Wimmer and Jirí Bittner of the Vienna University of Technology explore how best to apply that idea in Chapter 6, "Hardware Occlusion Queries Made Useful." Occlusion queries are a GPU feature that provides high-latency feedback on whether an object is visible or not after it is rendered. Unlike earlier occlusion-query culling techniques, Michael and Jirí's algorithm is pixel-perfect. That is, it introduces no rendering artifacts, generates a near-optimal set of visible objects to render, does not put unnecessary load on the GPU, and has minimal CPU overhead.
Similarly, increasing geometric detail only where visible and simplifying it when and where it isn't visible is a good way to avoid excessive GPU loads. View-dependent and adaptive subdivision schemes are an appealing solution that the offline-rendering world already employs to render their highly detailed models to subpixel geometric accuracy. Subdivision surfaces have not yet found a place in today's real-time applications, partly because they are not directly supported in graphics hardware. Rendering subdivision surfaces thus seems out of reach for real-time applications. Not so, says Michael Bunnell of NVIDIA Corporation. In Chapter 7, Michael shows how his implementation of "Adaptive Tessellation of Subdivision Surfaces with Displacement Mapping" is already feasible on modern GPUs and results in movie-quality geometric detail at real-time rates.
Finally, faking geometric complexity with methods that are cheaper than actually rendering geometry allow for higher apparent complexity at faster speeds. Replacing geometry with textures that merely depict it used to be an acceptable trade-off—and in the case of grates and wire-mesh fences, often still is. Normal mapping is a more sophisticated fake that properly accounts for lighting information. Parallax mapping is the latest craze that attempts to also account for intra-object occlusions. William Donnelly of the University of Waterloo one-ups parallax mapping: he describes "Per-Pixel Displacement Mapping with Distance Functions" in Chapter 8. Displacement mapping provides correct intra-object occlusion information, yet minimally increases computation cost. His technique gives excellent results while taking full advantage of the latest programmable pixel-shading hardware. Even better, it is practical for applications today.
Matthias Wloka, NVIDIA Corporation
David Whatley
Simutronics Corporation
Rendering natural scenes in real time, while leaving enough CPU and GPU resources for other game-engine requirements, is a difficult proposition. Images of botany require a great deal of visual depth and detail to be convincing. This chapter describes strategies for rendering more photorealistic natural scenes in a manner that is friendly to real-time game engines. The methods presented here work together to create a convincing illusion of grassy fields teeming with plants and trees, while not overwhelming either the CPU or the GPU. These techniques are used in Simutronics' Hero's Journey, as shown in Figure 1-1.
Figure 1-1 Babbling Brook: A Nature Scene from Hero's Journey
We begin by describing the foundation for managing scene data in large outdoor environments. Next, we provide details on how to maximize throughput of the GPU to achieve the required visual density in grass. Then we expand on these techniques to add ground clutter and larger-scale botany, such as trees. Finally, we tie the visuals together with shadowing and environmental effects.
Game engines must manage their rendering techniques to match the scope of the environment they hope to visualize. Game levels that feature nature scenes, made up of thousands of trees and bushes and perhaps millions of blades of grass, present significant data management problems that must be solved to allow rendering at interactive frame rates.
Rendering a virtual nature scene convincingly is both an artistic and a technical challenge. We can approach the rendering of nature much like a painter: break down the elements into layers and treat each layer independently to ultimately create a unified whole. For example, a layer of grass, a layer of ground clutter, a layer of trees, and so on. All these layers share some common properties, which we can leverage to compress our data representation.
Our goal is to travel the game camera over long distances of convincing outdoor scenes without having to dedicate excessive memory resources to managing the task. With guided deterministic random-number generation, we have an algorithm that can "plant" all of the elements of nature in a reasonable manner while achieving the same visual results each time we revisit the same spot on the map. In an online game, everyone would see the same thing right down to the placement of a blade of grass without this placement being permanently stored in memory.
We establish a world-space fixed grid around the camera to manage the planting data for each layer of plants and other natural objects. Each grid cell contains all of the data to render its layer in the physical space it occupies. In particular, the cell data structure stores the corresponding vertex and index buffers, along with material information to represent what is drawn.
For each layer of botany, we establish a distance from the camera that the layer needs to generate visuals; this determines the size of our virtual grid. As the camera moves, the virtual grids travel with it. When a grid cell is no longer in the virtual grid, we discard it and add new cells where needed to maintain our full grid structure. As each cell is added, a planting algorithm is used to fill in the layer with the appropriate data for rendering. See Figure 1-2.
Figure 1-2 The Virtual Grid
For each cell that is filled with natural objects, we need to pick suitable spots on the ground where those objects are placed. The heuristic used to choose these spots depends on the type of object being placed. Generally, we would like to pick random spots, at some desired density, and then see if the corresponding points on the ground are acceptable for what we are planting. In our implementation, a ground polygon's material determines what layers are applicable.
The obvious approach is to randomly cast rays straight down at the ground within the volume of the cell. Each time we hit a polygon, we check to see if it is suitable (Can grass be planted here? Is the slope too severe?). If we succeed, then we have a planted point. We continue until we reach the proper density.
This approach yields good results but has significant problems. First, in grid cells where there are few suitable places to plant (for example, just the top of a polygon that is marked for grass), we can burn inordinate amounts of CPU time trying to randomly achieve our density requirement. So in the worst case, we must abandon our search if we reach some maximum limit of planting attempts. Second, we cannot handle overlapping terrain (such as a land bridge) with this approach.
A better approach is to collect all of the polygons that intersect the cell, discard all polygons inappropriate for planting, and then scan-convert them to find suitable spots for planting. This is similar to rasterizing a polygon for rendering, but instead each "pixel" of our traversal is a world-space potential planting point. We must be careful to keep the scan conversion rate appropriate to the density, while not exceeding the boundaries of the triangle. Further, at each planting point we select, it is important to offset along the plane of the polygon by some suitable random distance to eliminate repeating patterns. All of these values are adjustable coefficients that should be determined during design time. In our implementation, the designer can interactively tweak these values to achieve the desired result for the layer.
Finally, when scan-converting we also must take care to clip to the polygon edges (when offsetting) as well as to the cell's border, because the polygon may extend beyond it (and another cell is managing the planting there).
Planting in this manner can take place in real time or as part of offline level preprocessing. In the latter case, the grass planting spots should be stored in a highly compressed form; the data should be uncompressed at run time as each cell is added to the set of potentially visible cells by the moving camera.
If this planting operation is done in real time, care must be taken to ensure that planting is a fast operation. Collecting polygons in a grid cell can be done quickly by using an AABB tree or a similar data structure. Because many cells may need to be planted suddenly due to continuous camera movement, it is also effective to queue up this task so that we spend only a relatively fixed amount of CPU on the task for each frame. By extending the size of the grid, we can be reasonably sure that all the appropriate planting will take place before the contents of the cell come into view.
Achieving interactive frame rates for endless fields of grass requires a careful balance of GPU techniques and algorithms. The key challenge is to create a visual that has high apparent visual complexity at relatively low computational and rendering cost. Doing so creates a convincing volume of grass. Here we introduce a technique similar to the one presented by Pelzer (2004) in "Rendering Countless Blades of Waving Grass." Our technique yields higher-quality and more-robust results at a reduced GPU and CPU burden. Figure 1-3 shows a scene rendered with our technique.
Figure 1-3 A Convincing Grass Layer
Obviously, drawing each grass blade is out of the question. But we can create such an illusion with clumps of grass, which are best represented by camera-facing quads with a suitable grass texture. Billboards of this nature create the illusion of volume at a minimal cost. However, a large field of grass can still require an excessive number of draw calls, so we must carefully structure our usage of the GPU to achieve sufficient volume and density.
GPUs work best when they are presented with large batch sizes to draw at once. Therefore, our goal is to figure out how to draw fields of grass with a relatively small number of draw calls to the API. The naive approach is to draw one billboard at a time. Instead, what we want is to draw as many as is practical in one draw call.
To achieve this, we use a technique whereby we create a vertex and an index buffer and fill it with a large number of grass billboards. We then draw all these billboards in one call. This algorithm is similar to speeding up a CPU loop by unrolling it.
For our purposes, each layer of grass—that is, all grass that uses the same texture and other parameters—is represented by a vertex and an index buffer pair per grid cell, as shown in Figure 1-4. For each clump of grass (or billboard) we plant, we write its positions into the vertex buffer and update the index buffer accordingly. We need four vertices and six indices per billboard. For each vertex, we set the position to the point where we have planted the grass clump. This means that all four vertices of a billboard have an identical position, but we offset this position in the vertex shader to create the proper camera-facing quad shape. Alternatively, if the grass texture fits within a triangular shape, we can save processing one vertex each. Even better, at this point, indices become unnecessary and can be skipped altogether without loss of performance; no vertex is ever reused out of the post-transformation-and-lighting cache when rendering this sort of triangle soup.
Figure 1-4 Structures for Drawing Each Grid Cell
Once the vertex buffer is created and sent to video memory, we can draw each grid cell's worth of botany with a single draw call. On the GPU, we use a vertex shader to offset each of the vertices so that they form a screen-aligned quad. Since each vertex moves in a different direction, we have to identify which vertex forms what corner of the quad. To do this, we augment our vertex data with two additional floats that contain -1, 0, or 1. The first float is for the x direction on the screen, and the second is for the y. We multiply this factor by our scale in x and y to offset as necessary. Additionally, we can randomly set all -1 and 1 values to slightly different values (such as 0.98 or -1.2) to add size variety to each grass clump.
Though we intend to move the vertex in screen space, we do all our work in world space so that we get the perspective effect for free. To do this, we provide our vertex shader with a vector that points to the right of the camera and another that points up from the camera. Simple math moves the vertex into the correct position:
// For each vertex, we pass a -1, 0, 1 value for x, y, which determines // how it is moved by the right and up vectors of the camera. This // we pass in as a texture coordinate (inTexCoord2). Out.Pos = Input.inPos + (Input.inTexCoord2.x * RightVector) + (Input.inTexCoord2.y * UpVector);
Our approach differs from Pelzer's because we use camera-facing billboards instead of having three quads per clump and doing no screen alignment. The screen-facing billboards create a constant depth at all view angles (even when looking down), whereas the three-quad clump approach progressively breaks down as the camera looks more directly down at the grassy field. In a third-person camera view, typical of many types of games and simulations, this is not an uncommon camera angle.
When rendering grass, we want to use transparency to improve the visual blending and fade out at a distance near the boundary of our virtual grid. However, alpha-based transparency is far from ideal because it requires sorting, which is slow. Although there are techniques that take advantage of the chaotic nature of grass to minimize some of the sorting (Pelzer 2004), we can actually eliminate it altogether.
To do this, we adopt a dissolve effect (also called the screen-door effect) instead of alpha blending to simulate translucency. We accomplish this by modulating the alpha channel of our grass texture with a noise texture. Then we use alpha test to eliminate pixels from rendering. By sliding the alpha test value from 0 to 1, the texture appears to dissolve. The process is shown in Figure 1-5. SpeedTreeRT, a commercial package for real-time foliage creation, was the first to use this technique to blend between levels of detail with alpha testing enabled. Simutronics has licensed IDV's SpeedTreeRT for Hero's Journey, and techniques drawn or adapted from this commercial foliage toolkit are noted as such in this chapter.
Figure 1-5 Components of the Grass Texture
The benefit of this technique is that the alpha test is a fast operation and order-independent. We no longer need to sort, yet grass appears to fade out at a distance. Although a dissolve does not look nearly as good as true alpha translucency under normal circumstances, we can exploit the fractal properties of nature to completely mask any visual artifacts of the screen-door technique. Experimentally, we have found that if we use a Perlin noise texture (Perlin 2002) rather than a random noise texture, the dissolve effect matches the environment well enough to be nearly as good as alpha translucency.
One problem we have, however, is that the alpha test value we select needs to be based on the distance of each billboard from the camera. However, we can only use exactly one alpha test value for each batch (each grid cell) even though the batch is filled with grass clumps at varying distances; this is because we are rendering the entire grid cell in one draw call. Because we want each grass billboard to fade out precisely based on its distance from the camera, we select a fixed alpha test value and instead manipulate the alpha channel in the shaders, linearly interpolating them toward full alpha based on the distance from the camera and the maximum range at which we want grass to start being visible. We do this by adjusting the alpha component of the output color of the vertex shader, which then attenuates the alpha of the texel in the pixel shader (this can also be done with the fixed-function pipeline).
An additional pass can be made over grid cells nearest the camera, drawing with alpha blending to feather the edges of the grass blades into the scene, thus eliminating any harsh edges from blades of grass close to the camera. This can achieve marginally better compositing at the expense of another pass. But because alpha test is still used, relatively few pixels are written to the back buffer (just the edges of the blades, at most). Alpha blending and alpha testing work well together; it is often a good idea to experiment with both to achieve the best illusion of depth and volume.
To increase the realism of our grass, we want to introduce as much variety as we can without impeding frame rate. One approach is to use a variety of images for grass, but our batching approach limits us to one texture per draw call. Fortunately, we can use a larger texture with different variations of grass arranged on them. During the vertex building, we can adjust the UV coordinates to select different subregions of the texture (that is, we can build a texture atlas; see NVIDIA 2004 for more on this technique). It is easy to have, for instance, four different grass clump variations on one texture. Plus, as we unroll grass quads into the vertex buffer, we can randomly flip the U coordinate to mirror the image. Be sure to allow space between your images so that mipmapping does not introduce texel-bleed artifacts.
Each billboard can also carry along color information. This is very useful for tinting a grayscale texture or doing subtle color shifting in the vertex shader, if you also establish a color for each cluster when planting. We have found that Perlin noise works here as well. It is easy, for example, to tint grass from a healthy green to a dying brown to impart broad color variations and break up the repetitiveness of the grass. See Figure 1-6.
Figure 1-6 Using RGB Information to Increase Realism
Lighting plays an important role in how grass looks. For billboard grass, we want to make sure that our grass is lit like the ground underneath. Because the ground naturally undulates, and thus picks up different angles of sunlight, we want to simulate this by attenuating the brightness of the grass. To do so, we need to know the angle of the ground on which the grass is sitting. An easy solution is to pass along this information in the vertex definition as another vector. During planting, we determine the normal of the polygon on which we are planting grass and carry this along in our grass billboard definition. With this approach, the vertex shader can do the same lighting calculation as for the polygon underneath the grass and attenuate its color to match. On hilly terrain, this causes the grass to have the subtle angle-to-the-light cues that the ground has.
Unfortunately, this approach leads to a faceted shading of the grass even though the ground polygons are likely smooth shaded (such as with Gouraud shading). To get around this discrepancy, the normal that is passed through the vertex shader must itself be smoothly interpolated during the planting process.
If the sun angle is dynamic, a simplification is to assume that the ground normal is roughly straight up and then carry out the lighting based on this normal and the light angle. In this way, we do not have to compute or carry the ground polygon normal in the vertex definition. There is a quality trade-off here, but this approach was good enough for our application.
Grass comes alive when wind affects it. Offsetting the top two vertices of the grass quad each frame causes the quad to wave in the wind. We compute this offset by using a sum of sines approximation, similar to computing surface undulation for water (Finch 2004). The trick to this is to carry a blend weight in the vertex definition, where the top two vertices of the grass quad are set to 1 and the bottom two to 0. Then we multiply our wind scale factor by this value, so that the bottom vertices remain firmly planted in the ground. For additional variation, we can randomize the top two vertex weights somewhat during planting. This simulates some grass being more or less rigid.
Grass blades often change their orientation to the light as they wave in the wind, causing them to brighten and darken. We can use the wind term to augment the lighting to simulate this effect. This greatly improves the visual effect of the wind, even for grass clumps at a distance where the physical waving deformation becomes subpixel in size.
Note, however, that the wind factors should not be allowed to deform the grass quads too much. If they do, the resulting deformation will appear comical rather than realistic. Subtlety is the key.
Ground cover consists of more than just waving fields of grass. Twigs, small plants, rocks, and other debris complete the illusion of natural complexity. Some of these can be represented as billboards just as grass is, but the richness of the environment is enhanced when we mix in an assortment of geometric objects, as well.
Just as we did with grass billboards, we unroll our 3D mesh data into vertex and index buffers for each grid cell, which can then be drawn with a single call. We must group our ground clutter into layers that use the same textures and shaders. We can apply random transforms to vary their size and orientation as we pick our planting points, but the transforms must vary depending on the nature of the mesh: an upside-down rock is okay, but an upside-down bush is not. For additional variety, we can pass RGB information to tint the objects just as we did with the grass polygons.
The dissolve technique for handling order-independent transparency effects works exactly the same for 3D meshes as it does for billboards. We modulate the alpha channel of the texture by our Perlin noise texture and use our distance from the camera to attenuate. Then alpha test dissolves the 3D meshes the same way it did with the grass billboards.
Each vertex can be given a weighting value, which allows us to apply the same wind math to the 3D as we did with the billboards. Obviously, we want objects such as rocks and twigs to be rigid, but leafy plants can have artist-driven weights to achieve the proper effect. Figure 1-7 shows an example of a scene with ground clutter.
Figure 1-7 Using Ground Clutter to Add Dense Detail
The trunk and primary branches of a tree should be modeled as 3D meshes. Secondary branches can be simulated with billboards to add visual complexity. For trees that are leafy, we can use techniques similar to the one we used for grass to create clumps of leaves from camera-facing billboards.
The following approach to rendering trees and shrubs is based on SpeedTreeRT, which provides all of the real-time rendering information for tree assets and includes SpeedTreeCAD for parametric authoring of trees (IDV 2004). The actual rendering of trees is still the responsibility of the game engine, which gives developers a lot of flexibility in implementation.
Because trees need to maintain their basic volume over long distances but are expensive to render in great detail, a level-of-detail (LOD) strategy must be employed. As a tree recedes into the distance, larger but fewer leaf clump billboards can be used. For the larger billboards, we use a different texture that shows more but smaller leaves.
For efficiency, all of the textures related to a tree should be consolidated into a single texture, as shown in Figure 1-8. In fact, it is even preferable to have multiple tree textures packed into one texture, so we can draw more of them in one call.
Figure 1-8 Storing Multiple Leaf Cluster Images in a Single Texture
At some suitable distance, we eventually want to represent a tree with a camera-facing billboard. This can be difficult when the tree's profile is asymmetrical. To get around the problem, we can produce tree images for the billboard at various angles and then blend between them based on the angle of the tree and the camera.
If trees are placed manually, we have found that it is best to give the level designer fine control over LOD transition points for each instance. In areas where trees are dense, you can get away with LOD transitions fairly near the camera. But in other places, with long lines of sight but fewer trees, it's easier to maintain their high detail and avoid visual artifacts.
Shrubs and fronds can be handled as just another type of tree using many of the same techniques. For example, a leafy bush is simply a tree with a small or nonexistent trunk. Further, you can turn a tree upside down, turn off the leaves, and get an elaborate exposed root system to marry up with a normal tree.
Trees can be authored in a standard modeling package, but artists need some mechanism to specify where leaf points and branch billboards go. This can complicate exporting. The actual rendering of the trees is still the responsibility of the game engine, which gives developers a lot of flexibility in implementation.
Because we pass an RGB tint for grass and ground clutter, we can choose a dark tint for areas in shadow. This requires us to know whether or not each planted item is in shadow. To be effective in natural environments, this sort of shadowing needs to only grossly approximate the correct shadowing. See Figure 1-9.
Figure 1-9 Grass Shadowed by a Tree Root
One approach is to make the shadow determination when the planting occurs. Simply cast a shadow feeler ray from the planting position toward the dominant light source (the sun) and see if there is an intersection. If so, adjust the RGB values toward the scene's ambient color. Remember that a shadow feeler ray cast is concerned only about any intersection (not just the closest) so it can be much more efficient than a standard collision ray cast.
Soft shadows (technically called antialiased shadows in this context) can be achieved by casting more than one shadow feeler. Figure 1-10 shows how this works. By offsetting each ray start position slightly, three or five ray casts from a given spot can be performed. The fraction of hits is used to attenuate the light between the diffuse sun lighting and the scene's ambient lighting. Widening the offsets increases the softness of the shadowing.
Figure 1-10 Visibility Testing Using Ray Casts
These sorts of shadows are not dynamic, but they can be recomputed fairly quickly at intervals for slow-moving light sources (such as the traveling sun). In general, they provide sufficient visual cues to cause the scenery to seem more lifelike.
Special approximation techniques can be used to cause shadow feeler hits when casting through the bulk of a tree. Instead of looking for intersections with individual leaves, simply collide with the spherical volume of the leafy part of the tree or the cylinder of the trunk. For the leafy part, use a random function to determine if rays intersect based on the density of the leaves. Although this shadowing technique is crude, accurate solutions are not visually distinct.
If planting is precomputed as an offline process, then shadow fidelity can be greatly enhanced. One possibility beyond shadow feelers is to look at the texel of a light map to determine shadowing. This can be difficult in real time if the light map is not in system memory.
Natural environments react to sunlight in ways that are hard to simulate in real time. Like most interactive techniques, the goal is to find methods that achieve a reasonable, if not accurate, result. Using post-processing effects, we can tie the visuals together to achieve a superior environment.
Full-scene post-processing glow is useful for magical and high-tech effects in games, and extending this simple technique for natural effects has proven effective. Real-time glow techniques provide a way to simulate blooming—as well as create a natural softening effect (brought about by a Gaussian blur)(James and O'Rorke 2004). However, because glow is an additive effect, care must be taken to account for this when authoring textures. It is easy to get carried away and overbrighten a scene. As with all effects, just because a little is good, more isn't necessarily better.
A textured sky dome provides a rich opportunity to use glow to our advantage. We can set aside the alpha channel of the sky dome for the artist to define areas that are more luminous. This allows the artist to have a lot of control over how the sun and the clouds interact. When rendering the sky dome, simply apply the alpha layer to the glow channel. Figure 1-11 shows an example.
Figure 1-11 Sky Dome Diffuse Texture and Glow Component
A sky dome with a good amount of glow blooms around the delicate structure of tree leaves and branches. This yields a particularly realistic touch that is almost immediately noticeable, as shown in Figure 1-12. Because the sky glow is controlled by the artist, this technique is particularly effective and finely tuned at no additional cost.
Figure 1-12 Varying Amounts of Sky Dome Glow
Cinematography of natural environments is often enhanced by a technique in which diffusing gauze (cheesecloth) is placed over the camera lens. If post-processing glow is being used, we can simply clear the glow channel to an adjustable, nonzero value to create a full-scene glow. Because the glow is a simulation based on a Gaussian blur, this causes the whole scene to appear to be diffused; also, bright areas bloom slightly without having to resort to more expensive HDR effects. For natural outdoor scenes, this approach can greatly mitigate the harsh computer-generated look of polygons for no extra GPU cost (because full-scene glow processing is already occurring). See Figure 1-13.
Figure 1-13 The Effect of Full-Scene Glow
Rendering nature in a convincing way always adds visual drama to a game. Figure 1-14 illustrates how these techniques can be very effective at conveying a sense of the inherent complexity and grandeur of natural scenery.
Figure 1-14 Creating the Illusion of a Lush Landscape
Finch, Mark. 2004. "Effective Water Simulation from Physical Models." In GPU Gems, edited by Randima Fernando, pp. 5–29. Addison-Wesley.
IDV. 2004. SpeedTreeRT API and SpeedTreeCAD Windows application. http://www.idvinc.com/html/speedtreert.htm
James, Greg, and John O'Rorke. 2004. "Real-Time Glow." In GPU Gems, edited by Randima Fernando, pp. 343–362. Addison-Wesley.
NVIDIA Corporation. 2004. "Improve Batching Using Texture Atlases." SDK white paper. http://download.nvidia.com/developer/NVTextureSuite/Atlas_Tools/Texture_Atlas_Whitepaper.pdf
Pelzer, Kurt. 2004. "Rendering Countless Blades of Waving Grass." In GPU Gems, edited by Randima Fernando, pp. 107–121. Addison-Wesley.
Perlin, Ken. 2002. "Improving Noise." ACM Transactions on Graphics (Proceedings of SIGGRAPH 2002) 21(3), pp. 681–682. http://mrl.nyu.edu/~perlin/paper445.pdf
I would like to thank the entire Simutronics Hero's Journey art and programming team for their invaluable contributions to this effort, especially 3D artists Richard Amsinger and Kyle Knight, whose work is featured in this chapter. Additionally, I would like to thank Dave Dean and Bryan Cool for their programming wizardry, which contributed a great deal to the techniques presented in this chapter. Last, I want to thank our art director, Tracy Butler, for providing the illustrations that help clarify many of the concepts in this chapter.
Arul Asirvatham
Microsoft Research
Hugues Hoppe
Microsoft Research
The geometry clipmap introduced in Losasso and Hoppe 2004 is a new level-of-detail structure for rendering terrains. It caches terrain geometry in a set of nested regular grids, which are incrementally shifted as the viewer moves. The grid structure provides a number of benefits over previous irregular-mesh techniques: simplicity of data structures, smooth visual transitions, steady rendering rate, graceful degradation, efficient compression, and runtime detail synthesis. In this chapter, we describe a GPU-based implementation of geometry clipmaps, enabled by vertex textures. By processing terrain geometry as a set of images, we can perform nearly all computations on the GPU itself, thereby reducing CPU load. The technique is easy to implement, and allows interactive flight over a 20-billion-sample grid of the United States stored in just 355 MB of memory, at around 90 frames per second.
In large outdoor environments, the geometry of terrain landscapes can require significant storage and rendering bandwidth. Numerous level-of-detail techniques have been developed to adapt the triangulation of the terrain mesh as a function of the view. However, most such techniques involve runtime creation and modification of mesh structures (vertex and index buffers), which can prove expensive on current graphics architectures. Moreover, use of irregular meshes generally requires processing by the CPU, and many applications such as games are already CPU-limited.
The geometry clipmap framework (Losasso and Hoppe 2004) treats the terrain as a 2D elevation image, prefiltering it into a mipmap pyramid of L levels as illustrated in Figure 2-1. For complex terrains, the full pyramid is too large to fit in memory. The geometry clipmap structure caches a square window of nxn samples within each level, much like the texture clipmaps of Tanner et al. 1998. These windows correspond to a set of nested regular grids centered about the viewer, as shown in Figure 2-2. Note that the finer-level windows have smaller spatial extent than the coarser ones. The aim is to maintain triangles that are uniformly sized in screen space. With a clipmap size n = 255, the triangles are approximately 5 pixels wide in a 1024x768 window.
Figure 2-1 How Geometry Clipmaps Work
Figure 2-2 Terrain Rendering Using a Coarse Geometry Clipmap
Only the finest level is rendered as a complete grid square. In all other levels, we render a hollow "ring," which omits the interior region already rendered at finer resolutions.
As the viewer moves, the clipmap windows are shifted and updated with new data. To permit efficient incremental updates, the clipmap window in each level is accessed toroidally, that is, with 2D wraparound addressing (see Section 2.4).
One of the challenges with the clipmap structure is to hide the boundaries between successive resolution levels, while at the same time maintaining a watertight mesh and avoiding temporal popping artifacts. The nested grid structure of the geometry clipmap provides a simple solution. The key idea is to introduce a transition region near the outer perimeter of each level, whereby the geometry and textures are smoothly morphed to interpolate the next-coarser level (see Figure 2-3). These transitions are efficiently implemented in the vertex and pixel shaders, respectively.
Figure 2-3 Achieving Visual Continuity by Blending Within Transition Regions
The nested grid structure of the geometry clipmap also enables effective compression and synthesis. It allows the prediction of the elevation data for each level by upsampling the data from the coarser level. Thus, one need only store or synthesize the residual detail added to this predicted signal.
The original implementation of geometry clipmaps presented in Losasso and Hoppe 2004 represents each clipmap level as a traditional vertex buffer. Because the GPU currently lacks the ability to modify vertex buffers, that implementation required significant intervention by the CPU to both update and render the clipmap (see Table 2-1).
|
Our implementation of geometry clipmaps using vertex textures moves nearly all operations to the GPU. |
||
|
Original Implementation [1] |
GPU-Based Implementation |
|
|
Elevation Data |
In vertex buffer |
In 2D vertex texture |
|
Vertex Buffer |
Incrementally updated by CPU |
Constant! |
|
Index Buffer |
Generated every frame by CPU |
Constant! |
|
Upsampling |
CPU |
GPU |
|
Decompression |
CPU |
CPU |
|
Synthesis |
CPU |
GPU |
|
Adding Residuals |
CPU |
GPU |
|
Normal-Map Update |
CPU |
GPU |
|
Transition Blends |
GPU |
GPU |
In this chapter, we describe an implementation of geometry clipmaps using vertex textures. This is advantageous because the 2D grid data of each clipmap window is much more naturally stored as a 2D texture, rather than being artificially linearized into a 1D vertex buffer.
Recall that the clipmap has L levels, each containing a grid of nxn geometric samples. Our approach is to split the (x, y, z) geometry of the samples into two parts:
Because clipmap levels are uniform 2D grids, their (x, y) coordinates are regular, and thus constant up to a translation and scale. Therefore, we define a small set of read-only vertex and index buffers, which describe 2D "footprints," and repeatedly instance these footprints both within and across resolution levels, as described in Section 2.3.2.
The vertices obtain their z elevations by sampling the elevation map as a vertex texture. Accessing a texture in the vertex shader is a new feature in DirectX 9 Shader Model 3.0, and is supported by GPUs such as the NVIDIA GeForce 6 Series.
Storing the elevation data as a set of images allows direct processing using the GPU rasterization pipeline (as noted in Table 2-1). For the case of synthesized terrains, all runtime computations (elevation-map upsampling, terrain detail synthesis, normal-map computation, and rendering) are performed entirely within the graphics card, leaving the CPU basically idle. For compressed terrains, the CPU incrementally decompresses and uploads data to the graphics card (see Section 2.4).
To summarize, the main data structures are as follows. We predefine a small set of constant vertex and index buffers to encode the (x, y) geometry of the clipmap grids. And for each level 0 . . . L–1, we allocate an elevation map (a 1-channel floating-point 2D texture) and a normal map (a 4-channel 8-bit 2D texture). All these data structures reside in video memory.
Because the outer perimeter of each level must lie on the grid of the next-coarser level (as shown in Figure 2-4), the grid size n must be odd. Hardware may be optimized for texture sizes that are powers of 2, so we choose n = 2 k -1 (that is, 1 less than a power of 2) leaving 1 row and 1 column of the textures unused. Most of our examples use n = 255.
Figure 2-4 Two Successive Clipmap Rings
The choice of grid size n = 2 k -1 has the further advantage that the finer level is never exactly centered with respect to its parent next-coarser level. In other words, it is always offset by 1 grid unit either left or right, as well as either top or bottom (see Figure 2-4), depending on the position of the viewpoint. In fact, it is necessary to allow a finer level to shift while its next-coarser level stays fixed, and therefore the finer level must sometimes be off-center with respect to the next-coarser level. An alternative choice of grid size, such as n = 2 k -3, would provide the possibility for exact centering, but it would still require the handling of off-center cases and thus result in greater overall complexity.
Although we have allocated L levels for the clipmap, we often render (and update) only a set of active levels 0 . . . L'-1, where the number of levels L' L is based on the height of the viewpoint over the underlying terrain. The motivation is that when the viewer is sufficiently high, the finest clipmap levels are unnecessarily dense and in fact result in aliasing artifacts. Specifically, we deactivate levels for which the grid extent is smaller than 2.5h, where h is the viewer height above the terrain. Since all the terrain data resides in video memory, computing h involves reading back one point sample (immediately below the viewpoint) from the elevation map texture of the finest active level (L'-1). This readback incurs a small cost, so we perform it only every few frames. Of course, we render level L'-1 using a complete square rather than a hollow ring.
The implementation described in Losasso and Hoppe 2004 allows "cropping" of a level if its data was not fully updated during viewer motion. To simplify our GPU implementation, we have decided to forgo this feature. We instead assume that a level is either fully updated or declared inactive.
As explained earlier, we represent the grid (x, y) values as vertex data, while storing the z elevation values as a one-channel floating-point texture. A brute-force approach would be to define a single vertex buffer containing the (x, y) data for the entire ring within a level. To both reduce memory costs and enable view frustum culling, we instead break the ring into smaller 2D footprint pieces, as illustrated in Figure 2-5.
Figure 2-5 Partitioning a Clipmap Ring
Most of the ring is created using 12 blocks (shown in gray) of size mxm, where m = (n + 1)/4. Since the 2D grid is regular, the (x, y) geometries of all blocks within a level are identical up to a translation. Moreover, they are also identical across levels up to a uniform scale. Therefore, we can render the (x, y) geometries of all terrain blocks using a single read-only vertex buffer and index buffer, by letting the vertex shader scale and translate the block (x, y) geometry using a few parameters.
For our default clipmap size n = 255, this canonical block has 64x64 vertices. The index buffer encodes a set of indexed triangle strips whose lengths are chosen for optimal vertex caching. We use 16-bit indices for the index buffer, which allows a maximum block size of m = 256 and therefore a maximum clipmap size of n = 1023.
However, the union of the 12 blocks does not completely cover the ring. We fill the small remaining gaps using a few additional 2D footprints, as explained next. Note that in practice, these additional regions have a very small area compared to the regular mxm blocks, as revealed in Figure 2-6. First, there is a gap of (n -1) - ((m -1) x 4) = 2 quads at the middle of each ring side. We patch these gaps using four mx3 fix-up regions (shown in green in Figures 2-5 and 2-6). We encode these regions using one vertex and index buffer, and we reuse these buffers across all levels. Second, there is a gap of one quad on two sides of the interior ring perimeter, to accommodate the off-center finer level. This L-shaped strip (shown in blue) can lie at any of four possible locations (top-left, top-right, bottom-left, bottom-right), depending on the relative position of the fine level inside the coarse level. We define four vertex and one index buffer for this interior trim, and we reuse these across all levels.
Figure 2-6 Visualizing the Nested Grid Structure
Also, we render a string of degenerate triangles (shown in orange) on the outer perimeter. These zero-area triangles are necessary to avoid mesh T-junctions. Finally, for the finest level, we fill the ring interior with four additional blocks and one more L-shaped region.
Because the footprint (x, y) coordinates are local, these do not require 32-bit precision, so we pack them into a D3DDECLTYPE_SHORT2, requiring only 4 bytes per vertex. In the future, it may be possible to compute the (x, y) coordinates within the mxm block from the vertex index i itself as (fmod(i, m), floor(i/m)).
View frustum culling is done at the block level on the CPU. Each block is extruded by [z min, z max] and intersected with the view frustum in 3D. It is rendered only if this intersection is nonempty. Depending on the view direction, the rendering load is reduced by a factor of 2 to 3 for a 90-degree field of view, as shown in Figure 2-7.
Figure 2-7 View Frustum Culling
For each level, we make up to 14 DrawPrimitive (DP) calls: 12 for the blocks, 1 for the interior trim, and 1 for the remaining triangles. With view frustum culling, on average only 4 of the 12 blocks are rendered each frame. The finest level requires 5 more calls to fill the interior hole. Thus, overall we make 6L + 5 (71 for L = 11) DP calls per frame on average. This total number could be further reduced to 3L + 2 (35) by instancing all blocks rendered in each level using a single DP call.
The same vertex shader is applied to the rendering of all 2D footprints described previously. First, given the footprint (x, y) coordinates, the shader computes the world (x, y) coordinates using a simple scaling and translation. Next, it reads the z value from the elevation map stored in the vertex texture. No texture filtering is necessary because the vertices correspond one-to-one with the texture samples.
For smooth transitions, the vertex shader blends the geometry near the outer boundary of each level with that of the next-coarser level. It computes the blending parameter a based on the (x, y) location relative to the continuous viewer position (v x , v y ). Specifically, a = max(a x , a y ), with

and similarly for a y .
Here, all coordinates are expressed in the [0 . . . n-1] range of the clipmap grid, and w is the transition width (chosen to be n/10). The desired property is that a evaluates to 0 except in the transition region, where it ramps up linearly to reach 1 at the outer perimeter. Figure 2-3a (on page 29) shows the evaluation of the parameter a within the transition regions in purple.
For geometric blending, we linearly interpolate between the current (fine-level) elevation z f and the elevation z c at the same (x, y) location in the next-coarser level:
|
z' = (1 - a) zf + azc |
In the general case, the sample location lies on an edge of the coarser grid, and z c is obtained by averaging the two coarse samples on the edge endpoints. We could perform this computation at runtime, but it would require a total of three vertex-texture lookups (one for z f and two for z c = (z c 1 + z c 2)/2) and vertex-texture reads are currently expensive (Gerasimov et al. 2004).
Instead, we compute z c as part of the clipmap update, and pack both z f and z c into the same 1-channel floating-point texture. We achieve this by storing z f into the integer part of the float, and the difference z d = z c - z f (scaled) into the fractional part. The packing is done in the pixel shader of the upsampling stage (see Section 2.4.1).
Listing 2-1 is the High-Level Shader Language (HLSL) vertex program for clipmap rendering.
struct OUTPUT { vector pos : POSITION; float2 uv : TEXCOORD0; // coordinates for normal-map lookup float z : TEXCOORD1; // coordinates for elevation-map lookup float alpha : TEXCOORD2; // transition blend on normal map }; uniform float4 ScaleFactor, FineBlockOrig; uniform float2 ViewerPos, AlphaOffset, OneOverWidth; uniform float ZScaleFactor, ZTexScaleFactor; uniform matrix WorldViewProjMatrix; // Vertex shader for rendering the geometry clipmap OUTPUT RenderVS(float2 gridPos: TEXCOORD0) { OUTPUT output; // convert from grid xy to world xy coordinates // ScaleFactor.xy: grid spacing of current level // ScaleFactor.zw: origin of current block within world float2 worldPos = gridPos * ScaleFactor.xy + ScaleFactor.zw; // compute coordinates for vertex texture // FineBlockOrig.xy: 1/(w, h) of texture // FineBlockOrig.zw: origin of block in texture float2 uv = gridPos * FineBlockOrig.xy + FineBlockOrig.zw; // sample the vertex texture float zf_zd = tex2Dlod(ElevationSampler, float4(uv, 0, 1)); // unpack to obtain zf and zd = (zc - zf) // zf is elevation value in current (fine) level // zc is elevation value in coarser level float zf = floor(zf_zd); float zd = frac(zf_zd) * 512 - 256; // (zd = zc - zf) // compute alpha (transition parameter) and blend elevation float2 alpha = clamp((abs(worldPos - ViewerPos) – AlphaOffset) * OneOverWidth, 0, 1); alpha.x = max(alpha.x, alpha.y); float z = zf + alpha.x * zd; z = z * ZScaleFactor; output.pos = mul(float4(worldPos.x, worldPos.y, z, 1), WorldViewProjMatrix); output.uv = uv; output.z = z * ZTexScaleFactor; output.alpha = alpha.x; return output; }
The pixel shader accesses the normal map and shades the surface. We let the normal map have twice the resolution of the geometry, because one normal per vertex is too blurry. The normal map is incrementally computed from the geometry whenever the clipmap is updated (see Section 2.4.3).
For smooth shading transitions, the shader looks up the normals for the current level and the next-coarser level and blends them using the a value computed in the vertex shader. Normally this would require two texture lookups. Instead, we pack the two normals as (N x , N y , N cx , N cy ) in a single four-channel, 8-bit-per-channel texture, implicitly assuming N z = 1 and N cz = 1. We must therefore renormalize after unpacking.
Shading color is obtained using a lookup into a z-based 1D texture map.
Listing 2-2 shows the HLSL pixel program for clipmap rendering.
// Parameters uv, alpha, and z are interpolated from vertex shader. // Two texture samplers have min and mag filters set to linear: // NormalMapSampler: 2D texture containing normal map, // ZBasedColorSampler: 1D texture containing elevation-based color uniform float3 LightDirection; // Pixel shader for rendering the geometry clipmap float4 RenderPS(float2 uv : TEXCOORD0, float z : TEXCOORD1, float alpha : TEXCOORD2) : COLOR { float4 normal_fc = tex2D(NormalMapSampler, uv); // normal_fc.xy contains normal at current (fine) level // normal_fc.zw contains normal at coarser level // blend normals using alpha computed in vertex shader float3 normal = float3((1 - alpha) * normal_fc.xy + alpha * (normal_fc.zw), 1); // unpack coordinates from [0, 1] to [-1, +1] range, and renormalize normal = normalize(normal * 2 - 1); // compute simple diffuse lighting float s = clamp(dot(normal, LightDirection), 0, 1); // assign terrain color based on its elevation return s * tex1D(ZBasedColorSampler, z); }
As the viewer moves through the environment, each clipmap window translates within its pyramid level so as to remain centered about the viewer; thus, the window must be updated accordingly. Because viewer motion is coherent, generally only a small L-shaped region of the window needs to be incrementally processed each frame. Also, the relative motion of the viewer within the windows decreases exponentially at coarser levels, so coarse levels seldom require updating.
We update the active clipmap levels in coarse-to-fine order. Recall that each clipmap level stores two textures: a single-channel floating-point elevation map, and a four-channel, 8-bit-per-channel normal map. During the update step, we modify regions of these textures by rendering to them using a pixel shader.
To avoid having to translate the existing data, we perform all accesses toroidally as illustrated in Figure 2-8. That is, we use wraparound addressing to position the clipmap window within the texture image. Under this toroidal access, the L-shaped update region usually becomes a + shape, shown in red in Figure 2-8. We write to this region by rendering two quads. Note that three or four quads may be needed if the update region wraps across the texture boundary.
Figure 2-8 Processing a Toroidal Update
To facilitate terrain compression and synthesis, elevation data in the update region is first predicted by upsampling the coarser level, and a residual signal is then added to provide the detail. The next sections describe the three basic steps of the update process: upsampling the coarser-level elevation data, adding the residuals, and computing the normal map.
The finer-level geometry is predicted from the coarser one using an interpolatory subdivision scheme. We use the tensor-product version of the well-known four-point subdivision curve interpolant, which has mask weights (
)(Kobbelt 1996). This upsampling filter has the desirable property of being C1 smooth.
Depending on the position of the sample in the grid (even-even, even-odd, odd-even, odd-odd), a different mask is used. For an even-even pixel, only 1 texture lookup is needed because the sample is simply interpolated; for an odd-odd pixel, 4x4 texture lookups are needed; the remaining two cases require 4 texture lookups. On the CPU, the mask to be used can be easily deduced using an "if" statement. However, branch statements are expensive in the pixel shader. Hence, we always do 16 texture lookups but apply a different mask based on a lookup in a 2x2 texture.
The HLSL code for the upsampling shader appears in Listing 2-3. Code for the missing functions Upsample and ZCoarser is included on the accompanying CD.
// residualSampler: 2D texture containing residuals, which can be // either decompressed data or synthesized noise // p_uv: coordinates of the grid sample uniform float2 Scale; // Pixel shader for updating the elevation map float4 UpsamplePS(float2 p_uv : TEXCOORD0) : COLOR { float2 uv = floor(p_uv); // the Upsample() function samples the coarser elevation map // using a linear interpolatory filter with 4x4 taps // (depending on the even/odd configuration of location uv, // it applies 1 of 4 possible masks) float z_predicted = Upsample(uv); // details omitted here // add the residual to get the actual elevation float residual = tex2D(residualSampler, p_uv * Scale); float zf = z_predicted + residual; // zf should always be an integer, since it gets packed // into the integer component of the floating-point texture zf = floor(zf); // compute zc by linearly interpolating the vertices of the // coarse-grid edge on which the sample p_uv lies float zc = ZCoarser(uv); // details omitted here float zd = zc - zf; // pack the signed difference zd into the fractional component float zf_zd = zf + (zd + 256)/512; return float4(zf_zd, 0, 0, 0); }
The residuals added to the upsampled coarser data can come from either decompression or synthesis.
In the compressed representation, the residual data is encoded using image compression. We use a particular lossy image coder called PTC because it supports efficient region-of-interest decoding (Malvar 2000). The CPU performs the decompression and stores the result into a residual texture image.
Alternatively, we synthesize fractal detail by letting the residuals be uncorrelated Gaussian noise (Fournier et al. 1982). This synthesis is made extremely fast on the GPU by reading a small precomputed 2D texture containing Gaussian noise. We enable texture wrapping to extend the Gaussian texture infinitely, and apply a small magnification to the texture coordinates to break the regular periodicity. The superposition of noise across the many resolution levels gives rise to a terrain without any visible repetitive patterns, as Figure 2-9 shows.
Figure 2-9 On-the-Fly Terrain Synthesis
The shader that updates the normal map for a level takes as input the elevation map at the same level. It computes the current normal as the cross product of two grid-aligned tangent vectors. Additionally, it does a texture lookup to gather the normal from the coarser level. Finally, it packs (N x , N y , N cx , N cy ) into the four-channel texture. The code is included in the accompanying CD.
Our main data set is a 216,000x93,600 height map of the conterminous United States at 30-meter spacing and 1.0-meter vertical resolution. It compresses by a factor of more than 100 and therefore fits in only 355 MB of system memory. We render this terrain into a 1024x768 window on a PC with a 2.4 GHz Pentium 4, 1 GB system memory, and an NVIDIA GeForce 6800 GT.
Rendering rate: For L = 11 levels of size n = 255, we obtain 130 frames/sec with view frustum culling, at a rendering rate of 60 million triangles/sec. The use of a vertex-texture lookup is a bottleneck, since removing the lookup increases rendering rate to 185 frames/sec. (For n = 127, the rate with vertex textures is 298 frames/sec.)
Update rate: Table 2-2 shows the processing times for the various steps of the update process, for the extreme case of updating a whole level at once. Decompression on the CPU is clearly the bottleneck of the update process.
|
These are worst-case numbers because, in general, during smooth viewer motions, only small fractions of a full level need be updated per frame. |
||
|
Previous Implementation [*] |
GPU-Based Implementation |
|
|
Upsampling |
3 ms |
1.0 ms |
|
Decompression |
8 ms |
8 ms [**] |
|
Synthesis |
3 ms |
~0 ms |
|
Normal-Map Computation |
11 ms |
0.6 ms |
Final frame rate: For decompressed terrains, the system maintains a rate of around 87 frames/sec during viewer motion, and for synthesized terrains, the frame rate is about 120 frames/sec.
We have presented a GPU implementation of the geometry clipmap framework. The representation of geometry using regular grids allows nearly all computation to proceed on the graphics card, thereby offloading work from the CPU. The system supports interactive flight over a 20-billion-sample grid of the U.S. data set stored in just 355 MB of memory at around 90 frames/sec.
Geometry clipmaps use vertex textures in a very special, limited way: the texels are accessed essentially in raster-scan order, and in one-to-one correspondence with vertices. Thus, we can hope for future mechanisms that would allow increased rendering efficiency.
It should be possible to directly compute normals in the pixel shader by accessing four filtered samples of the elevation map. At present, vertex texture lookups require that the elevation maps be stored in 32-bit floating-point images, which do not support efficient bilinear filtering.
Fractal terrain synthesis is so fast on the GPU that we can consider regenerating the clipmap levels during every frame, thereby saving video memory. We allocate two textures, T 1 and T 2, and toggle between them as follows. Let T 1 initially contain the coarse geometry used to seed the synthesis process. First, we use T 1 as a source texture in the update pixel shader to upsample and synthesize the next-finer level into the destination texture T 2. Second, we use T 1 as a vertex texture to render its clipmap ring. Then, we swap the roles of the two textures and iterate the process until all levels are synthesized and rendered. Initial experiments are extremely promising, with frame rates of about 59 frames/sec using L = 9 levels.
Fournier, Alain, Don Fussell, and Loren Carpenter. 1982. "Computer Rendering of Stochastic Models." Communications of the ACM 25(6), June 1982, pp. 371–384.
Gerasimov, Philip, Randima Fernando, and Simon Green. 2004. "Shader Model 3.0: Using Vertex Textures." NVIDIA white paper DA-01373-001_v00, June 2004. Available online at http://developer.nvidia.com/object/using_vertex_textures.html.
Kobbelt, Leif. 1996. "Interpolatory Subdivision on Open Quadrilateral Nets with Arbitrary Topology." In Eurographics 1996, pp. 409–420.
Losasso, Frank, and Hugues Hoppe. 2004. "Geometry Clipmaps: Terrain Rendering Using Nested Regular Grids." ACM Transactions on Graphics (Proceedings of SIGGRAPH 2004) 23(3), pp. 769–776.
Malvar, Henrique. 2000. "Fast Progressive Image Coding without Wavelets." In Data Compression Conference (DCC '00), pp. 243–252.
Tanner, Christopher, Christopher Migdal, and Michael Jones. 1998. "The Clipmap: A Virtual Mipmap." In Computer Graphics (Proceedings of SIGGRAPH 98), pp. 151–158.
Francesco Carucci
Lionhead Studios
A great way to enrich the user experience in an interactive application is to present a credible world, full of small, interesting features and objects. From countless blades of grass, to trees, to generic clutter: it all improves the final perception and helps maintain the user's "suspension of disbelief." If the user can believe the world he is immersed in, he is more likely to be emotionally connected to that world—the Holy Grail of game development.
From a rendering point of view, achieving this richness translates into rendering many small objects, often each one similar to the next, with only small differences such as color, position, and orientation. For example, every tree in a big forest might be geometrically very similar, but all may be different in terms of their color or height. The user perceives a forest made of many unique trees and believes the world to be credible, enriching his or her game experience.
However, rendering a large number of small objects, each made from a few polygons, imposes a big strain on today's GPUs and rendering libraries. Graphics APIs such as Direct3D and OpenGL are not designed to efficiently render a small number of polygons thousands of times per frame (Wloka 2003).
This chapter addresses the problem of rendering many unique instances of the same geometry in Direct3D. Figure 3-1 shows an example from Black & White 2, a game in development at Lionhead Studios.
Figure 3-1 Geometry Instancing in Black & White 2
Submitting triangles to the GPU for rendering in Direct3D is a relatively slow operation. Wloka 2003 shows that a 1 GHz CPU can render only around 10,000 to 40,000 batches per second in Direct3D. On a more modern CPU, we can expect this number to be between 30,000 and 120,000 batches per second (around 1,000 to 4,000 batches per frame at 30 frames/sec). That's not much! It means that if we want to render a forest of trees, and we submit one tree per batch, we cannot render more than 4,000 trees, regardless of how many polygons are in a tree—with no CPU time left for the rest of the game. This situation is clearly not ideal. We would like to be able to minimize the number of state and texture changes in our application and render the same triangles multiple times within the same batch in a single call to Direct3D. Thus, we would minimize CPU time spent in submitting batches and free up that time for other systems, such as physics, artificial intelligence, and game logic.
We first define concepts behind geometry instancing, to make clear the kinds of objects that are part of the problem.
A geometry packet is a description of a packet of geometry to be instanced, a collection of vertices and indices. A geometry packet can be described in terms of vertices—with position, texture coordinates, normal, possibly tangent space and bones information for skinning, and per-vertex colors—and in terms of indices into the vertex stream. This kind of description directly maps to the most efficient way to submit geometry to the GPU.
A geometry packet is an abstract description of a piece of geometry, where the geometric entities are expressed in model space without any explicit reference to the context in which they will be rendered.
Here's a possible description in code of a geometry packet; it includes both the geometric information for the object as well as its bounding sphere:
struct GeometryPacket { Primitive mPrimType void* mVertices; unsigned int mVertexStride; unsigned short* mIndices; unsigned int mVertexCount; unsigned int mIndexCount; D3DXVECTOR3 mSphereCentre; float mSphereRadius; };
Typical attributes per instance are the model-to-world transformation matrix, the instance color, and an animation player providing the bones used to skin the geometry packet.
struct InstanceAttributes { D3DXMATRIX mModelMatrix; D3DCOLOR mInstanceColor; AnimationPlayer* mAnimationPlayer; unsigned int mLOD; };
A geometry instance is a geometry packet with the attributes specific to the instance. It directly connects a geometry packet and the instance attributes packet with which it's going to be rendered, giving an almost complete description of the entity ready to be submitted to the GPU.
Here's how a geometry instance structure might look in code:
struct GeometryInstance
{
GeometryPacket* mGeometryPacket;
InstanceAttributes mInstanceAttributes;
};
The render context is the current state of the GPU in terms of render states (such as alpha blending and testing states, active render target, and more). The texture context tracks the currently active textures. Render context and texture context are usually modeled with classes, as shown in Listing 3-1.
A batch is a collection of geometry instances and the render states and texture context to be used to render the collection. It always directly maps to a single DrawIndexedPrimitive() call, to simplify the design of the class.
Listing 3-2 is an abstract interface for a geometry batch class.
class RenderContext { public: // Begin the render context and make its render state active // void Begin(void); // End the render context and restore previous render state, // if necessary // void End(void); private: // Any description of the current render state and pixel // and vertex shaders. // D3DX Effect framework is particularly useful. // ID3DXEffect* mEffect; // Application-specific render states // . . . }; class TextureContext { public: // Set current textures to the appropriate texture stages // void Apply(void) const; private: Texture mDiffuseMap; Texture mLightMap; // . . . };
class GeometryBatch { public: // Remove all instances from the geometry batch // virtual void ClearInstances(void); // Add an instance to the collection and return its ID. // Return -1 if it can't accept more instances. // virtual int AddInstance(GeometryInstance* instance); // Commit all instances, to be called once before the // render loop begins and after every change to the // instances collection // virtual unsigned int Commit(void) = 0; // Update the geometry batch, eventually prepare GPU-specific // data ready to be submitted to the driver, fill vertex and // index buffers as necessary, to be called once per frame // virtual void Update(void) = 0; // Submit the batch to the driver, typically implemented // with a call to DrawIndexedPrimitive // virtual void Render(void) const = 0; private: GeometryInstancesCollection mInstances; // . . . };
The engine's renderer can see geometry instancing only in terms of the abstract interface of a GeometryBatch, which hides the specifics of the instancing implementation and provides services for managing instances, updating, and rendering a batch. This way, the engine can concentrate on sorting batches to minimize render state and texture changes, while the GeometryBatch takes care of the actual implementation and communication protocol with Direct3D.
Listing 3-3 shows a simple implementation of a render loop in pseudocode, sorting GeometryBatches first by render context and then by texture context, thus minimizing the number of render-state changes.
// Update phase Foreach GeometryBatch in ActiveBatchesList GeometryBatch.Update(); // Render phase Foreach RenderContext Begin RenderContext.BeginRendering(); RenderContext.CommitStates(); Foreach TextureContext Begin TextureContext.Apply(); Foreach GeometryBatch in the texture context GeometryBatch.Render(); End End
The update and render phases can be kept separate in case we want to update all batches once and render several times: this idea is extremely useful when implementing shadow mapping or water reflection and refraction, for example.
In this chapter, we discuss four implementations of GeometryBatch and analyze their performance characteristics, with a particular interest in the memory footprint and flexibility of each technique.
Here's a brief summary:
In static batching, we want to transform all instances once and copy them to a static vertex buffer. The most noticeable advantages of static batching are its rendering efficiency and its compatibility with any GPU on the market, regardless of age.
To implement static batching, we first create a vertex buffer object (with its associated index buffer) to fill with the transformed geometry. This object must be big enough to accept the maximum number of instances we decide to handle. Because we fill the buffer once and never touch it again, we can create it with the D3USAGE_WRITEONLY flag, giving a useful hint to the driver to put the buffer in the fastest memory available:
HRESULT res; res = lpDevice->CreateVertexBuffer( MAX_STATIC_BUFFER_SIZE, D3DUSAGE_WRITEONLY, 0, D3DPOOL_MANAGED, &mStaticVertexStream, 0); ENGINE_ASSERT(SUCCEEDED(res));
Whether to choose D3DPOOL_MANAGED or D3DPOOL_DEFAULT to create the buffer depends on the particular application and especially on the memory management strategy of the engine.
The next step is to implement the Commit() method to actually fill the vertex and index buffers with the transformed geometry that needs to be rendered.
Here is the Commit() method in pseudocode:
Foreach GeometryInstance in Instances
Begin
Transform geometry in mGeometryPacket to world space
with instance mModelMatrix
Apply other instance attributes (like instance color)
Copy transformed geometry to the Vertex Buffer
Copy indices (with the right offset) to the Index Buffer
Advance current pointer to the Vertex Buffer
Advance current pointer to the Index Buffer
End
From now on, it's just a matter of submitting the instances we have already prepared with a simple call to DrawIndexedPrimitive(). The actual implementation of the Update() and Render() methods is trivial.
Static batching is the fastest way to render many instances, and it supports instances of different geometry packets inside the same geometry batch, but it has a few serious limitations:
These limitations are addressed with the next technique, trading off rendering speed for additional flexibility.
Dynamic batching overcomes the limitations of static batching at the cost of reducing rendering efficiency. A major advantage of dynamic batching is, like static batching, that it can be implemented on any GPU without support for advanced programmable pipelines.
The first step is to create a vertex buffer (and its associated index buffer) as dynamic by using the D3DUSAGE_DYNAMIC and D3DPOOL_DEFAULT flags. These flags ensure that the buffer is placed in the best possible memory location, given that we intend to dynamically update its contents.
HRESULT res; res = lpDevice->CreateVertexBuffer( MAX_DYNAMIC_BUFFER_SIZE, D3DUSAGE_DYNAMIC | D3DUSAGE_WRITEONLY, 0, D3DPOOL_DEFAULT, &mDynamicVertexStream, 0);
Choosing the right value for MAX_DYNAMIC_BUFFER_SIZE is vital in this implementation. Two choices dictating the actual value are possible:
The first approach ensures a proper separation between updating and rendering a batch. Updating a batch means streaming all instances into the dynamic buffer; rendering just submits the geometry with a call to DrawIndexedPrimitive(). But this approach typically requires a large amount of graphics memory (that is, either video or AGP memory) and is less reliable in the worst case if we can't ensure that the buffer is always big enough during the life of the application.
The second solution requires interleaving between streaming geometry and rendering it: when the dynamic buffer is full, its geometry is submitted for rendering and the content of the buffer is discarded, ready for more instances to be streamed. For optimal performance, it is important to use the proper protocol, that is, locking the dynamic buffer with D3DLOCK_DISCARD at the beginning of a new batch of instances and with D3DLOCK_WRITEONLY for each new instance to be streamed. The drawback of this solution is that it requires locking the buffer and streaming geometry again each time the batch needs to be rendered, for example, when implementing shadow mapping.
The choice depends on the particular application and its requirements. In this chapter, we choose the first solution, for its simplicity and clarity, but we add a slight complication: dynamic batching naturally supports skinning, and we address this in our implementation.
The Update() method is very similar to the Commit() method we have already described in Section 3.3.1, but it is executed every frame. Here it is in pseudocode:
Foreach GeometryInstance in Instances
Begin
Transform geometry in mGeometryPacket to world space with
instance mModelMatrix
If instance needs skinning, request a set of bones from
mAnimationPlayer and skin geometry
Apply other instance attributes (like instance color)
Copy transformed geometry to the Vertex Buffer
Copy indices (with the right offset) to the Index Buffer
Advance current pointer to the Vertex Buffer
Advance current pointer to the Index Buffer
End
In this case, the Render() method is a trivial call to DrawIndexedPrimitive().
In vertex constants instancing, we take advantage of vertex constants to store instancing attributes. Vertex constants batching is very fast in terms of rendering performance, and it supports instances moving from frame to frame, but it pays in terms of flexibility.
The main limitations are these:
The first step is to prepare a static vertex buffer (with its associated index buffer) to store multiple copies of the same geometry packets, stored in model space, one for each instance in the batch.
The vertex buffer layout is illustrated in Figure 3-2.
Figure 3-2 Vertex Buffer Layout
The source vertex format must be updated to add an integer index to each vertex, which is constant per instance, indicating which instance the geometry packet belongs to. This is similar to palette skinning, in which each vertex contains an index to the bone (or multiple bones) affecting it.
The updated vertex format is this:
struct InstanceVertex { D3DXVECTOR3 mPosition; // Other vertex properties, such as normal and texture // coordinates WORD mInstanceIndex[4]; // Direct3D requires SHORT4 };
The Commit() method prepares the vertex buffer with the proper layout after all instances have been added to the geometry batch.
The next step is to upload attributes for each instance being rendered. Suppose we have only the model matrix, describing instance position and orientation, and the instance color.
When using a DirectX 9-class GPU, we can use a maximum of 256 vertex constants; we can reserve 200 for instancing attributes. In our example, each instance requires 4 constants for the model matrix and 1 constant for instance color, for a total of 5 constants per instance, leading to a maximum of 40 instances per batch.
Listing 3-4 shows the Update() method. The actual instancing is done in the vertex shader, shown in Listing 3-5.
D3DXVECTOR4 instancesData[MAX_NUMBER_OF_CONSTANTS]; unsigned int count = 0; for (unsigned int i = 0; i < GetInstancesCount(); ++i) { // write model matrix instancesData[count++] = *(D3DXVECTOR4*) &mInstances[i].mModelMatrix.m11; instancesData[count++] = *(D3DXVECTOR4*) &mInstances[i].mModelMatrix.m21; instancesData[count++] = *(D3DXVECTOR4*) &mInstances[i].mModelMatrix.m31; instancesData[count++] = *(D3DXVECTOR4*) &mInstances[i].mModelMatrix.m41; // write instance color instanceData[count++] = ConvertColorToVec4(mInstances[i].mColor)); } lpDevice->SetVertexConstants( INSTANCES_DATA_FIRST_CONSTANT, instancesData, count);
// vertex input declaration struct vsInput { float4 position : POSITION; float3 normal : NORMAL; // other vertex data int4 instance_index : BLENDINDICES; }; vsOutput VertexConstantsInstancingVS( in vsInput input) { // get the instance index; the index is premultiplied // by 5 to take account of the number of constants // used by each instance int instanceIndex = ((int[4]) (input.instance_index))[0]; // access each row of the instance model matrix float4 m0 = InstanceData[instanceIndex + 0]; float4 m1 = InstanceData[instanceIndex + 1]; float4 m2 = InstanceData[instanceIndex + 2]; float4 m3 = InstanceData[instanceIndex + 3]; // construct the model matrix float4x4 modelMatrix = { m0, m1, m2, m3 }; // get the instance color float4 instanceColor = InstanceData[instanceIndex + 4]; // transform input position and normal to world space with // the instance model matrix float4 worldPosition = mul(input.position, modelMatrix); float3 worldNormal = mul(input.normal, modelMatrix); // output position, normal, and color output.position = mul(worldPosition, ViewProjectionMatrix); output.normal = mul(worldNormal, ViewProjectionMatrix); output.color = instanceColor; // output other vertex data }
The Render() sets the view projection matrix and submits all instances with a DrawIndexedPrimitive() call.
A possible optimization in production code is to store the rotational part of the model matrix as a quaternion, saving two vertex constants and increasing the maximum number of instances to around 70. A uniform scaling value can be stored in the w component of the translation vector. The model matrix is then reconstructed in the vertex shader, increasing its complexity and execution time.
The last technique we present in this chapter is batching with the Geometry Instancing API exposed by DirectX 9 and fully implemented in hardware on GeForce 6 Series GPUs. As more hardware starts supporting the Geometry Instancing API, this technique will become more interesting, because it elegantly solves limitations imposed by vertex constants instancing, has a very limited memory footprint, and requires little intervention by the CPU. The only drawback of this technique is that it can only handle instances of the same geometry packet.
DirectX 9 provides the following call to access the Geometry Instancing API:
HRESULT SetStreamSourceFreq( UINT StreamNumber, UINT FrequencyParameter);
StreamNumber is the index of the source stream and FrequencyParameter indicates the number of instances each vertex contains.
We first create two vertex buffers: a static one to store geometry for the single geometry packet we want to instance multiple times, and a dynamic one to store instance data. The two vertex streams are shown in Figure 3-3.
Figure 3-3 Vertex Streams for Instancing with the Geometry Instancing API
The Commit() method ensures that all instances are using only one geometry packet and copies its geometry to the static buffer.
The Update() method simply copies all instance attributes into the second stream. Even though this method looks very similar to the Update() method of a dynamic batch, CPU intervention and graphics bus (AGP or PCI Express) bandwidth are minimal. Moreover, we can choose to allocate a vertex buffer big enough to accept attributes for all instances without worrying about graphics memory consumption, because each instance attribute uses only a fraction of the memory consumed by a whole geometry packet.
The Render() method sets up two streams with the correct stream frequency and issues a DrawIndexedPrimitive() call to render all instances in one batch, as shown in Listing 3-6.
The GPU takes care of virtually duplicating vertices of the first stream and packing them together with the second stream. The vertex shader sees as input a vertex with its position in model space and an additional instance attribute giving the model matrix to transform it to world space.
unsigned int instancesCount = GetInstancesCount(); // set up stream source frequency for the first stream // to render instancesCount instances // D3DSTREAMSOURCE_INDEXEDDATA tells Direct3D we'll use // indexed geometry for instancing lpDevice->SetStreamSourceFreq( 0, D3DSTREAMSOURCE_INDEXEDDATA | instancesCount); // set up first stream source with the vertex buffer // containing geometry for the geometry packet lpDevice->SetStreamSource( 0, mGeometryInstancingVB[0], 0, mGeometryPacketDecl); // set up stream source frequency for the second stream; // each set of instance attributes describes one instance // to be rendered lpDevice->SetStreamSourceFreq( 1, D3DSTREAMSOURCE_INSTANCEDATA | 1); // set up second stream source with the vertex buffer // containing all instances' attributes pd3dDevice->SetStreamSource( 1, mGeometryInstancingVB[0], 0, mInstancesDataVertexDecl);
The vertex shader code appears in Listing 3-7.
// vertex input declaration struct vsInput { // stream 0 float4 position : POSITION; float3 normal : NORMAL; // stream 1 float4 model_matrix0 : TEXCOORD0; float4 model_matrix1 : TEXCOORD1; float4 model_matrix2 : TEXCOORD2; float4 model_matrix3 : TEXCOORD3; float4 instance_color : D3DCOLOR; }; vsOutput GeometryInstancingVS( in vsInput input) { // construct the model matrix float4x4 modelMatrix = { input.model_matrix0, input.model_matrix1, input.model_matrix2, input.model_matrix3 }; // transform input position and normal to world space // with the instance model matrix float4 worldPosition = mul(input.position, modelMatrix); float3 worldNormal = mul(input.normal, modelMatrix); // output position, normal, and color output.position = mul(worldPosition, ViewProjectionMatrix); output.normal = mul(worldNormal, ViewProjectionMatrix); output.color = input.instance_color; // output other vertex data }
With minimal CPU overhead and memory footprint, this technique can efficiently render many copies of the same geometry and is therefore the ideal solution in many scenarios. The only drawback is that it requires special hardware capabilities and doesn't easily support skinning.
To implement skinning, it would be possible to store all bones for all instances into a texture and fetch the right bones of the right instance with the texture fetch available in Shader Model 3.0 (which is required by the Geometry Instancing API). If this solution seems attractive, the performance impact of using texture fetches with this kind of access pattern is uncertain and should be tested.
This chapter defined the concepts behind geometry instancing and described four different techniques to achieve the goal of efficiently rendering the same geometry multiple times. Each technique has pros and cons, and no single technique clearly represents the ideal choice for every scenario. The best choice depends mainly on the application and on the type of objects being rendered.
Some scenarios and proposed solutions:
Often the same application needs to rely on two or more techniques. Figures 3-4 and 3-5 show two examples. In such cases, hiding the implementation behind an abstract geometry batch concept makes the engine more modular and easier to maintain. As a result, the costs of implementing all the geometry-instancing techniques are greatly reduced for the entire application.
Figure 3-4 Combining Static Batching and Geometry Instancing in a Real Scene
Figure 3-5 Combining GPU and CPU Animation in a Real Scene
Cebenoyan, Cem. 2004. "Graphics Pipeline Performance." In GPU Gems, edited by Randima Fernando, pp. 473–486. Addison-Wesley. Available online at http://developer.nvidia.com/object/GPU_Gems_Samples.html
O'Rorke, John. 2004. "Managing Visibility for Per-Pixel Lighting." In GPU Gems, edited by Randima Fernando, pp. 245–257. Addison-Wesley.
Wloka, Matthias. 2003. "Batch, Batch, Batch: What Does It Really Mean?" Presentation at Game Developers Conference 2003. http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf
Jon Olick
2015
As graphical requirements evolve to include more polygons, more clutter, and better lighting to create more realistic-looking scenes, we are forced to produce more art to meet those requirements. One technique that helps to conserve development time is an instancing approach. For example, a single chair is modeled and placed all around the game world (see Figure 4-1). Instancing helps immensely to shrink development time. It is, however, all too common for modern games to become limited by the number of draw calls (Wloka 2003). If each instance used in a scene leads to one more batch, it may not be possible to realize the benefits of instancing: batches become the bottleneck. Segment buffering is a technique that collects multiple instances that are close to each other in the scene and merges them into "über-instances," thus reducing the number of batches and providing a simple and elegant solution to the batch bottleneck.
Figure 4-1 A Scene with Many Instances of the Same Object
Figure 4-1 shows a scene containing many static objects of the same material. Most GPUs cannot efficiently render scenes like this for more than a thousand instances; each instance potentially requires a render-state change, such as transform changes, light map texture changes, or vertex stream changes. These render-state changes cause driver and thus CPU overhead, ultimately resulting in poor performance. Although we could work around this problem on the content side by merging multiple models to create new models to instance around the world (thereby reducing the total number of instances), such an approach would require making a fair amount of custom art and could cause the maps to look repetitious.
Segment buffering automates the merging of similar instances while maintaining most of the benefits of rendering separate instances. The primary benefits of segment buffering are thus a nonrepetitious look and the ability to not draw some of the original instances, as if they were removed from the visibility set.
Let's assume that we have a list of instances of a specific model. The next step is to spatially organize the instances such that in a 1D array, objects that are near each other spatially are also near each other in the array. With this new, organized list, the final step is to stitch all of the individual instances together as if they were one big instance, recording which parts in the new vertex/index buffers belong to the individual instances. We call each record in this list a segment. To render objects inside the segment buffer, simply generate a list of segments to be rendered while merging sequential segments in the list. This procedure thus generates a new list of optimally merged sections of the vertex/index buffer to render. Segments are rendered as static batches (see Chapter 3 of this book, "Inside Geometry Instancing").
The key to making segment buffering work and work well is in the spatial organization of the segments inside the segment buffer. For this, we can use any spatial organization structure. K-d trees (Samet 1989, 1990), where k is 3, perform well for 2015's game Men of Valor: Vietnam. Octrees (Suter 1999) or any other spatial data structures would probably work just as well.
The tree is generated from a list of points that represent the locations of the instances for which we are creating a segment buffer. Once we have this tree, we do a depth-first traversal, always branching left first at every interior tree node. When we hit a leaf, we add all the points contained in the leaf to the new ordered list. When the traversal is complete, we have a spatially ordered list of instances.
The segment buffers and the spatially ordered list of instances need to be generated only when a new list of instances is introduced. In the majority of cases, then, the segment buffers and ordered list of instances need to be created only at load time.
Given a spatially ordered list of instances, we iterate through the list and construct a single vertex/index buffer containing all of the instances. We make sure to transform each instance's vertex buffer components into world space before the instance is written into the big vertex buffer. Also, we need to record what parts of the big vertex/index buffer belong to each instance, for later reference during rendering. Again, this big vertex/index buffer needs to be created only in the event a new list of instances is introduced, typically at load time.
When rendering, instead of immediately drawing an instance that is segmented, we generate a list of all these instances that need to be rendered. Instances outside the view frustum, for example, need not be rendered and should be omitted. Then we translate this list of instances into the parts of the big vertex/index buffer we previously generated for segment buffering. While doing this, we can merge segments adjacent to one another in the big vertex/index buffer to achieve an optimal list of the pieces of those buffers to be rendered. The result is the exact same output as if the instance list was rendered individually without segment buffering.
To make segment buffering a viable technique for a game that uses light mapping extensively, we need to implement automatic texture-atlas generation (NVIDIA 2004). This is because we cannot segment instances together that require render-state changes in between the rendering of each instance. In scenes where every instance has its own light map, a texture change is required between the rendering of each instance. Automatic texture-atlas generation is fairly easy to implement in the context of segment buffering because every instance inside the big vertex buffer has a unique section. Thus, the texture coordinates can be directly modified without affecting any other instance. This is in contrast to the traditional method of modifying the texture coordinates inside a vertex or pixel shader.
In this chapter, we have described a technique to significantly reduce the number of batches rendered in a single displayed frame. In doing so, the technique enables us to create a much richer and more realistic-looking environment. In addition to segment buffering, we described enhancements that make it applicable in situations where it normally would not be appropriate.
NVIDIA Corporation. 2004. "Improve Batching Using Texture Atlases." SDK white paper. http://download.nvidia.com/developer/NVTextureSuite/Atlas_Tools/Texture_Atlas_Whitepaper.pdf
Samet, Hanan. 1989. Applications of Spatial Data Structures. Addison-Wesley.
Samet, Hanan. 1990. The Design and Analysis of Spatial Data Structures. Addison-Wesley.
Suter, Jaap. 1999. "Introduction to Octrees." Flipcode Web site. http://www.flipcode.com/tutorials/tut_octrees.shtml
Wloka, Matthias. 2003. "Batch, Batch, Batch: What Does It Really Mean?" Presentation at Game Developers Conference 2003. http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf
Oliver Hoeller
Piranha Bytes
Kurt Pelzer
Piranha Bytes
One of the most difficult problems in modern real-time graphics applications is the massive amount of data that must be managed. Complex scenes with complex meshes combined with multiple render passes—as are needed for high-quality shadows and additional reflection and refraction effects, for example—are expensive to render. Finding ways to feed the GPU with optimally formatted data is crucial for successfully rendering complex scenes.
This chapter addresses this issue and focuses on a flexible model to resolve the problem of handling the gamut of graphics hardware found in current PCs. A PC game must be able to run and run well on the latest high-end GPU all the way down to last year's "value" GPU—and on everything in between. We introduce a solution that handles the massive amount of data and is careful to transmit only the currently needed vertex components for each pass. This system, which has been implemented in the Gothic III engine, combines two powerful techniques: several vertex buffers are combined via multistreaming (a feature introduced with Microsoft DirectX 8.0), and each vertex buffer is controlled by an optimized resource manager. We present both the abstract concept as well as its implementation based on DirectX 9.0c.
Each vertex in a mesh has multiple components associated with it (position, normal, tangent, texture coordinates, and so on), but all of these entities aren't always necessary when rendering the object. We want an automated system that uses only the currently needed vertex components. To handle the task, we use vertex buffer streams specialized for certain subtasks. Depending on the current rendering pass, subsets of these streams are combined to assemble the vertices.
We need four basic types of streams, as shown in Figure 5-1. Here are the streams and their subtasks:
Figure 5-1 Four Types of Vertex Streams
Subsets of these four streams combine to handle different tasks, as shown in Figure 5-2:
Render meshes without animation
Possible stream combinations: G or G + T
Render meshes with animation
Possible stream combinations: G + A or G + T + A
Render instanced meshes (optionally with animation)
Possible stream combinations: G + I or G + T + I
(Optional: G + A + I or G + T + A + I)
Render pure z-pass (optionally with or without instancing, with or without animation)
Possible stream combinations: G
(Optional: G + A or G + I or G + A + I)
Figure 5-2 Combining Currently Needed Streams
The next section presents an implementation of this model.
Now we show how to implement the abstract concept discussed in Section 5.1. First we examine some multistreaming example code based on DirectX 9.0c. Next we present the resource manager that handles the mesh resources. Finally, we show how to translate generic mesh data into the appropriate hardware structure.
Microsoft DirectX 8.0 introduced the notion of a stream to bind data to input registers. A stream is a uniform array of component data. Each component consists of one or more elements that represent a single entity, such as position, normal, color, texture coordinate set, and so on. With streams, GPUs are able to perform a direct memory access (DMA) from multiple vertex buffers in parallel. The Direct3D caps bit D3DCAPS9.MaxStreams defines how many streams a GPU supports; modern hardware supports up to 16 streams, while older, DirectX 8.0-compliant GPUs are limited to 8 streams. Because multistreaming has some minor performance implications, it's advisable to minimize the number of active streams. Because our implementation uses only 1 to 4 streams, all GPUs with multistreaming support can be targeted, with minimal loss of performance.
Listing 5-1 shows our multistreaming example code. Using DirectX 9.0 for this task, we need only three simple components:
// // Initializations: // // Here is a set of vertex definitions to support two streams. // Vertex format: // stream 0 -> position + normal + color 0 + color 1 // stream 1 -> 4 texture coordinate pairs struct VTXSTREAM_0 { float fPosX, fPosY, fPosZ; float fNormX, fNormY, fNormZ; DWORD dwColor0, dwColor1; }; struct VTXSTREAM_1 { float fTexU0, fTexV0; float fTexU1, fTexV1; float fTexU2, fTexV2; float fTexU3, fTexV3; }; // Vertex declaration D3DVERTEXELEMENT9 m_VtxDcl[] = { {0, 0, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0}, // stream 0, position {0, 12, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_NORMAL, 0}, // stream 0, normal {0, 24, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_COLOR, 0}, // stream 0, color 0 {0, 28, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_COLOR, 1}, // stream 0, color 1 {1, 0, D3DDECLUSAGE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 0}, // stream 1, tex coord set 0 {1, 8, D3DDECLUSAGE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 1}, // stream 1, tex coord set 1 {1, 16, D3DDECLUSAGE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 2}, // stream 1, tex coord set 2 {1, 24, D3DDECLUSAGE_FLOAT2, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_TEXCOORD, 3} // stream 1, tex coord set 3 } // Create a vertex declaration object LPDIRECT3DVERTEXDECLARATION9 m_pVtxDeclObject; m_pd3dDevice->CreateVertexDeclaration(m_VtxDcl,&m_pVtxDclObj);
In Listing 5-1, we present a set of vertex definitions and create a vertex declaration object m_pVtxDeclObject that describes the following vertex data layout:
Before calling a draw method, we must set this vertex declaration object together with the vertex buffers bound to the correct device data streams, as shown in Listing 5-2.
// // Each time we want to render primitives: // // Set the vertex declaration object m_pd3dDevice->SetVertexDeclaration(m_pVtxDclObj); // Bind vertex buffers to device data streams m_pd3dDevice->SetStreamSource(0, m_pVtxBuf0, 0, sizeof(VTXSTREAM_0)); m_pd3dDevice->SetStreamSource(1, m_pVtxBuf1, 0, sizeof(VTXSTREAM_1)); // Render .. m_pd3dDevice->DrawIndexedPrimitive( .. );
The actual references to the stream data do not occur until a drawing method, such as IDirect3DDevice9::DrawIndexedPrimitive(), is called.
In the Gothic III engine, we employ the concept of resource management to handle the administration of all data-intensive elements, such as vertex data (mesh information), textures, animation data (key frames and skinning data), sounds, and music. All these data structures use a uniform framework. It provides a simple interface so that other systems of the engine can access it.
All resources are packed into abstract data objects. We address these data objects by file name or by a globally unique identifier (GUID). The access of data containers is similarly organized like a database access. If you call a query, you get a referenceable data container with all necessary structures of the specified data.
The resource management system ensures the immediate and simultaneous availability of the data. It is also responsible for converting this proprietary data into the graphics hardware's preferred format.
As we mentioned, the resource management system makes it possible to store mesh data in specified file locations and to load these mesh objects only as needed. These structures must be stored in an efficient and flexible way, but vertex data layout can vary substantially from mesh to mesh.
The loaded resources should provide a reference counter, so that the same mesh can be used repeatedly in a scene. Geometry instancing is used to speed up rendering this type of mesh. (See Chapter 3 of this book, "Inside Geometry Instancing.")
The mesh information is generated by tools such as 3ds max or Maya. For this operation, a suitable importer has to be developed, as shown in Figure 5-3.
Figure 5-3 Structure of the Resource Framework
One of the database containers provided by the resource management system is a mesh resource, which has all relevant structures, all vertex and index data arrays, and a link to an abstract material object (which is also a referenceable resource).
Note that the vertex structure wrapped into a resource class (as shown in Figure 5-3) isn't adapted to any specific graphics API (such as DirectX or OpenGL). We can therefore implement the whole resource system independent of the graphics API used.
These structures consist of simple vertex streams (arrays) containing a single vertex component, such as position (an array of 3D vectors), normals (an array of 3D vectors), color (an array of DWORD values), or up to 16 texture coordinates (an array of 1D, 2D, 3D, or 4D vectors) and other customized data arrays such as weights for hardware skinning.
To complete this groundwork, we have developed a hardware-independent resource framework with a mesh class and its administrative interface. Furthermore, we have written a specified importer for the corresponding modeling tool.
The renderer takes over the task of changing the hardware-independent data into a format that is optimal for the graphics hardware.
First, the mesh resources are loaded into main memory at the start of the application. Alternatively, we use an on-demand approach to load the data into memory when needed, using a separate thread.
For the renderer, the mesh object is only a temporary container to build up the correct format. The generated hardware-dependent format is used by the application to draw the scene. The reason for this strict separation is the ability to tailor the data optimally for the GPU found in the PC on which the application is currently running. We use only the vertex data that are actually needed.
This separation becomes useful in the adaptation of needed texture coordinates: On less powerful hardware, many visual effects are turned off because the hardware is incapable of running them, or because the overall rendering load needs to be reduced. Because these effects are not rendered, the texture coordinates used by these effects need not be passed to the graphics hardware. They can simply be removed. So we reduce memory consumption and save processor power and bandwidth.
For this task, we provide a submodule, which we have named Vertexprocessor, that translates the generic mesh data into the specified hardware structure. The following steps are necessary for the data to be prepared for multistreaming:
Figure 5-4 Query of Single Vertex Streams Used by Vertexprocessor
The Vertexprocessor unit for geometry data seeks position, normals, and the two colors (if available) from a mesh object.
For illustration purposes, Listing 5-3 creates a vertex buffer that contains all this information and uses this specific vertex layout. The buffer should have a cache-friendly size of 32 bytes per entry, so the layout is fixed for this buffer type.
D3DVERTEXELEMENT9 m_VtxDcl[] = { {0, 0, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_POSITION, 0}, // stream 0, position {0, 12, D3DDECLTYPE_FLOAT3, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_NORMAL, 0}, // stream 0, normal {0, 24, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_COLOR, 0}, // stream 0, color 0 {0, 28, D3DDECLTYPE_D3DCOLOR, D3DDECLMETHOD_DEFAULT, D3DDECLUSAGE_COLOR, 0}, // stream 0, color 1 }; // Here comes the vertex definition. // Vertex format: // stream 0 -> position + normal + color 0 + color 1 struct sVTXStreamGeometry { float vecPos[ 3 ]; float vecNorm[ 3 ]; DWORD dwColor0, dwColor1; }; // Data container Resourcemesh supports GetVertexStreamArray here. // (See API methods above.) cResourceMesh* m_pResourceMesh; // Create the vertex declaration and create a specific vertex buffer. // For clarity, we calculate the size of the buffer in the number of // elements and the corresponding FVF flags. In our engine implementation, // the geometry stream has a size of 32 bytes (cache-friendly), so we // can easily create the sufficient memory size of the vertex buffer. CreateVertexDeclaration( m_VtxDcl ); m_pVtxBuf0 = CreateVertexBuffer(NumVertices, FVF_XYZ | FVF_NORMAL | FVF_DIFFUSE | FVF_SPECULAR); sVTXStreamGeometry* pBuffer = LockBuffer(m_pVtxBuf0); for each vertexIndex in m_pResourceMesh { pBuffer->m_vecPos = m_pResourceMesh->GetVertexStreamArray(enum_Position, vertexIndex); pBuffer->m_vecNorm = m_pResourceMesh->GetVertexStreamArray(enum_Normal, vertexIndex); pBuffer->m_dwColor0 = m_pResourceMesh->GetVertexStreamArray(enum_Diffuse, vertexIndex); pBuffer->m_dwColor1 = m_pResourceMesh->GetVertexStreamArray(enum_Specular, vertexIndex); pBuffer++; }
The Vertexprocessor unit for texture data fetches the available texture coordinate slots from the mesh object and builds up a flexible vertex buffer with all necessary texture coordinates. At most, eight texture coordinate groups (with a 1D, 2D, 3D, or 4D vector per texture coordinate) are possible.
In addition, one of the slots can be encoded as tangent vectors for tangent-space normal maps; otherwise, it would be necessary to use a separate vertex buffer stream for this information. We omit this in Listing 5-4 for clarity. It is, however, straightforward to add this feature.
If light maps are available, we should use a separate Vertexprocessor unit (that is, a separate vertex buffer per mesh instance), because every instance must have unique UV pairs for light map textures. Light maps must be individually mapped, and their mapping coordinates depend on their world-space position, and so they can't be instantiated. This can be a big problem for the memory footprint and performance of your application. Consequently, light maps are useful for big meshes, such as terrain in outdoor scenarios or irregular wall meshes in dungeons, but not for small meshes or multi-instantiated objects.
// Data container Resourcemesh supports GetVertexStreamArray here. // (See API methods above.) cResourceMesh* m_pResourceMesh; // This is a template array structure to allocate a flexible amount // of vtxDcl structures in one block Array<D3DVERTEXELEMENT9> arrDeclVtx; // We create a specific vertex buffer for textures. We use the FVF flag // from Resourcemesh and filter the specific texture flags to create // the specific buffer. // For clarity, we use here a function that calculates the size of the // buffer and the corresponding FVF flags. You can use an alternative // implementation to calculate the correct size of the vertex buffer. sVTXStreamTexture* pVtxBuf1 = CreateVertexBuffer(NumTextureEntries, m_pResourceMesh->m_FVFFlag & TextureFlags); // Lock buffer with unspecific byte pointer char* pBuffer = LockBuffer( pVtxBuf1 ); // Current buffer position (in bytes) in locked data of vertex buffer DWORD dwGlobalBufferPosition = 0; for each textureIndex in m_pResourceMesh { DWORD dwCurrentOffset = 0; // Current offset position for vtxDecl for (int i = 0; i<m_pResourceMesh->GetNumberOfTextureSlots(); i++) { // Array template allocates us a new structure at the end of the // structure block and returns a nonconstant reference of the block. // All structures allocated before remain unchanged (by realloc). D3DVERTEXELEMENT9& flexibleElement = arrDeclVtx.InsertNewEnd(); flexibleElement.Stream = 1; // Is at 2nd position after Geometry flexibleElement.Usage = D3DDECLUSAGE_TEXCOORD; // Current size in bytes of texture slots (u,v = 8 bytes) // r,s,t = 12 bytes, r,s,t,w = 16 bytes DWORD dwCurrentTextureSlotSize = m_pResourceMesh->GetTextureCoordinateSizeBySlot( i ); // For specified slot, we look for what size (in bytes) we need // and add it to our global offset counter dwCurrentOffset flexibleElement.Offset = dwCurrentOffset; // Calculate stream position in Resourcemesh for specified // texture slot int enumStreamArraySlot = m_pResourceMesh->GetTextureSlotEnum( i ); // Now copy out the vertex information from the Resourcemesh // stream into locked buffer memcpy(&pBuffer[ dwGlobalBufferPosition + dwCurrentOffset ], m_pResourceMesh->GetVertexStreamArray(enumStreamArraySlot, textureindex), dwCurrentTextureSlotSize); dwCurrentOffset += dwCurrentTextureSlotSize; // Calculate types of vtxDecl (D3DDECLTYPE_FLOAT2 = 8 bytes, // D3DDECLTYPE_FLOAT3 = 12 bytes, D3DDECLTYPE_FLOAT4 = 16 bytes) flexibleElement.Type = CalculateDataSizeType(dwCurrentTextureSlotSize); } // Add new position offset to global position in buffer dwGlobalBufferPosition += dwCurrentOffset; }
The Vertexprocessor unit for animation data builds up weights for animation skinning with bones in hardware. A varying number of weights per bone can be used, so it is not necessary to waste more memory than is really needed. In Gothic III, we use up to six weights per bone for our animation system.
An index buffer corresponding to the above vertex buffers is also built up. Indices are stored into a mesh object as a separate stream of information. The bit size of index entries (whether 16 or 32 bit) depends on the number of vertices. Having more than 65,535 entries necessitates using DWORDs as index entries. Otherwise, WORD (16-bit) indices suffice. Choosing the smallest possible data type helps reduce an application's memory footprint.
The Vertexprocessor has the responsibility to create all the previously described vertex buffers in an optimal manner for hardware (typically they are created as Pool_Managed and WriteOnly)(Ashida 2004).
Once all the streams are built up properly, they are available for rendering and drawing.
As an example of using streams separately, we execute the G-stream (that is, the geometry stream) as a separate z-pass, to achieve fast z-rejects in hardware and to use this buffer in conjunction with an occlusion query system implemented in the render system. (See Chapter 6 of this book, "Hardware Occlusion Queries Made Useful," for an example of such a system.)
It is possible to add the A-stream (the animation stream) in the calculation, but the effort isn't worthwhile in most cases. The number of pixels that differ because of the animation is typically small and thus out of proportion to the additional rendering cost of adding animation.
Individual streams are activated depending on the view (whether solid rendering or z-pass) by the renderer.
In this chapter, we have shown how current applications can overcome problems caused by the growing amount of geometry data in scenes. We've discussed a flexible model that gives the application more control over the data and drives the detected hardware optimally by combining two powerful techniques:
A nice side benefit: we efficiently handle the bandwidth for data, which sometimes can be limited because of data duplication/redundant data transmission across the bus from system memory to the GPU.
Ashida, Koji. 2004. "Optimising the Graphics Pipeline." NVIDIA presentation. http://developer.nvidia.com/docs/IO/10878/ChinaJoy2004_OptimizationAndTools.pdf
Cebenoyan, Cem. 2004. "Graphics Pipeline Performance." In GPU Gems, edited by Randima Fernando, pp. 473–486. Addison-Wesley. Available online at http://developer.nvidia.com/object/GPU_Gems_Samples.html
For additional information about programming one or more streams in DirectX, see the Microsoft Web site:
Michael Wimmer
Vienna University of Technology
Jirí Bittner
Vienna University of Technology
Hardware occlusion queries were one of the most eagerly awaited graphics hardware features in a long time. This feature makes it possible for an application to ask the 3D API (OpenGL or Direct3D) whether or not any pixels would be drawn if a particular object were rendered. With this feature, applications can check whether or not the bounding boxes of complex objects are visible; if the bounds are occluded, the application can skip drawing those objects. Hardware occlusion queries are appealing to today's games because they work in completely dynamic scenes.
However, although this feature has been available for more than two years (via the NV/ARB_occlusion_query OpenGL extension and by the IDirect3DQuery9 interface in Direct3D), there is not yet widespread adoption of this feature for solving visibility culling in rendering engines. This is due to two main problems related to naive usage of the occlusion query feature: the overhead caused by issuing the occlusion queries themselves (since each query adds an additional draw call), and the latency caused by waiting for the query results.
In this chapter, we present a simple but powerful algorithm that solves these problems (Bittner et al. 2004): it minimizes the number of issued queries and reduces the impact of delays due to the latency of query results.
To achieve this, the algorithm exploits the spatial and temporal coherence of visibility by reusing the results of occlusion queries from the previous frame in order to initiate and schedule the queries in the current frame. This is done by storing the scene in a hierarchical data structure (such as a k-d tree or octree [Cohen-Or et al. 2003]), processing nodes of the hierarchy in a front-to-back order, and interleaving occlusion queries with the rendering of certain previously visible nodes.
The algorithm is smart about the queries it actually needs to issue: most visible interior nodes of the spatial hierarchy and many small invisible nodes are not tested at all. Furthermore, previously visible nodes are rendered immediately without waiting for their query result, which allows filling the time needed to wait for query results with useful work.
Let's recap briefly for which scenes and situations the algorithm presented in this chapter is useful.
Occlusion queries work best for large scenes where only a small portion of the scene is visible each frame—for example, a walkthrough in a city. More generally, the algorithm presented here works if in each frame there are large occluded interior nodes in the hierarchy. These nodes should contain many objects, so that skipping them if they are occluded saves geometry, overdraw, and draw calls (which are a typical bottleneck in today's rendering systems [Wloka 2003]). An important advantage the algorithm has over most other techniques used nowadays in games (such as portal culling) is that it allows completely dynamic environments.
Some games—for example, shader-heavy ones—use a separate initial rendering pass that writes only to the depth buffer. Subsequent passes test against this depth buffer, and thus expensive shading is done only for visible pixels. If the scenes are complex, our occlusion-culling algorithm can be used to accelerate establishing the initial depth buffer and at the same time to obtain a complete visibility classification of the hierarchy. For the subsequent passes, we skip over objects or whole groups of objects that are completely invisible at virtually no cost.
Finally, if there is little or no occlusion in a scene—for example, a flyover of a city—there is no benefit to occlusion culling. Indeed, occlusion queries potentially introduce an overhead in these scenes and make rendering slower, no matter how the occlusion queries are used. If an application uses several modes of navigation, it will pay to switch off occlusion culling for modes in which there is no or only very little occlusion.
The term occlusion culling refers to a method that tries to reduce the rendering load on the graphics system by eliminating (that is, culling) objects from the rendering pipeline if they are hidden (that is, occluded) by other objects. There are several methods for doing this.
One way to do occlusion culling is as a preprocess: for any viewpoint (or regions of viewpoints), compute ahead of time what is and is not visible. This technique relies on most of the scene to be static (so visibility relationships do not change) and is thus not applicable for many interactive and dynamic applications (Cohen-Or et al. 2003).
Portal culling is a special case of occlusion culling; a scene is divided into cells (typically rooms) connected via portals (typically door and window openings). This structure is used to find which cells (and objects) are visible from which viewpoint—either in a preprocess or on the fly (Cohen-Or et al. 2003). Like general preprocess occlusion culling, it relies on a largely static scene description, and the room metaphor restricts it to mostly indoor environments.
Online occlusion culling is more general in that it works for fully dynamic scenes. Typical online occlusion-culling techniques work in image space to reduce computation overhead. Even then, however, CPUs are less efficient than, say, GPUs for performing rasterization, and thus CPU-based online occlusion techniques are typically expensive (Cohen-Or et al. 2003).
Fortunately, graphics hardware is very good at rasterization. Recently, an OpenGL extension called NV_occlusion_query, or now ARB_occlusion_query, and Direct3D's occlusion query (D3DQUERYTYPE_OCCLUSION) allow us to query the number of rasterized fragments for any sequence of rendering commands. Testing a single complex object for occlusion works like this (see also Sekulic 2004):
The approximation used in step 3 should be simple so as to speed up the rendering process, but it must cover at least as much screen-space area as the original object, so that the occlusion query does not erroneously classify an object as invisible when it actually is visible. The approximation should thus be much faster to render, and does not modify the frame buffer in any way.
This method works well if the tested object is really complex, but step 5 involves waiting until the result of the query actually becomes available. Since, for example, Direct3D allows a graphics driver to buffer up to three frames of rendering commands, waiting for a query results in potentially large delays. In the rest of this chapter, we refer to steps 1 through 4 as "issuing an occlusion query for an object."
If correctness of the rendered images is not important, a simple way to avoid the delays is to check for the results of the queries only in the following frame (Sekulic 2004). This obviously leads to visible objects being omitted from rendering, which we avoid in this chapter.
As a first attempt to use occlusion queries, we show a naive occlusion-culling algorithm and extend it to use a hierarchy. In the following section, we show our coherent hierarchical culling algorithm, which makes use of coherence and other tricks to be much more effective than the naive approach.
To understand why we want to use a hierarchical algorithm, let's take a look at the naive occlusion query algorithm first:
Although this algorithm calculates correct occlusion information for all of our objects, it is most likely slower than just rendering all the objects directly. The reason is that we have to issue a query and wait for its result for every object in the scene!
Now imagine, for example, a walkthrough of a city scene: We typically see a few hundred objects on screen (buildings, streetlights, cars, and more). But there might be tens of thousands of objects in the scene, most of which are hidden by nearby buildings. If each of these objects is not very complex, then issuing and waiting for queries for all of them is more expensive than just rendering them.
We need a mechanism to group objects close to one another so we can treat them as a single object for the occlusion query. That's just what a spatial hierarchy does. Examples of spatial hierarchies are k-d trees, BSP trees, and even standard bounding-volume hierarchies. They all have in common that they partition the scene recursively until the cells of the partition are "small" enough according to some criterion. The result is a tree structure with interior nodes that group other nodes, and leaf nodes that contain actual geometry.
The big advantage of using hierarchies for occlusion queries is that we can now test interior nodes, which contain much more geometry than the individual objects. In our city example, hundreds or even thousands of objects might be grouped into one individual node. If we issue an occlusion query for this node, we save the tests of all of these objects—a potentially huge gain!
If geometry is not the main rendering bottleneck, but rather the number of draw calls issued (Wloka 2003), then making additional draw calls to issue the occlusion queries is a performance loss. With hierarchies, though, interior nodes group a larger number of draw calls, which are all saved if the node is occluded using a single query. So we see that in some cases, a hierarchy is what makes it possible to gain anything at all by using occlusion queries.
The naive hierarchical occlusion-culling algorithm works like this—it is specified for a node within the hierarchy and initially called for the root node:
This algorithm can potentially be much faster than the basic naive algorithm, but it has two significant drawbacks.
This approach shares the first drawback with the naive algorithm. Whenever we issue an occlusion query for a node, we cannot continue our algorithm until we know the result. But waiting until the query result becomes available may be prohibitive. The query has to be sent to the GPU. There it sits in a command queue until all previous rendering commands have been issued (and modern GPUs can queue up more than one frame's worth of rendering commands!). Then the bounding box associated with the query must be rasterized, and finally the result of the query has to travel back over the bus to the driver.
During all this time, the CPU sits idle, and we have caused a CPU stall. But that's not all. While the CPU sits idle, it doesn't feed the GPU any new rendering commands. Now when the result of the occlusion query arrives, and the CPU has finally figured out what to draw and which commands to feed the GPU next, the command buffer of the GPU has emptied and it has become idle for some time as well. We call this GPU starvation. Obviously, we are not making good use of our resources. Figure 6-1 sums up these problems.
Figure 6-1 CPU Stalls and GPU Starvation Caused by Occlusion Queries
The second drawback of this algorithm runs contrary to the original intent of the hierarchical method. We wanted to reduce the overhead for the occlusion queries by grouping objects together. Unfortunately, this approach increases the number of queries (especially if many objects are visible): in addition to the queries for the original objects, we have to add queries for their groupings. So we have improved the good case (many objects are occluded), but the bad case (many objects are visible) is even slower than before.
The number of queries is not the only problem. The new queries are for interior nodes of the hierarchy, many of which, especially the root node and its children, will have bounding boxes covering almost the entire screen. In all likelihood, they are also visible. In the worst case, we might end up filling the entire screen tens of times just for rasterizing the bounding geometry of interior nodes.
As we have seen, the hierarchical stop-and-wait method does not make good use of occlusion queries. Let us try to improve on this algorithm now.
To solve problem 1, we have to find a way to avoid the latency of the occlusion queries. Let's assume that we could "guess" the outcome of a query. We could then react to our guess instead of the actual result, meaning we don't have to wait for the result, eliminating CPU stalls and GPU starvations.
Now where does our guess come from? The answer lies, as it so often does in computer graphics, in temporal coherence. This is just another way of expressing the fact that in real-time graphics, things don't move around too much from one frame to the next. For our occlusion-culling algorithm, this means that if we know what's visible and what's occluded in one frame, it is very likely that the same classification will be correct for most objects in the following frame as well. So our "guess" is simply the classification from the previous frame.
But wait—there are two ways in which our guess can go wrong: If we assume the node to be visible and it's actually occluded, we will have processed the node needlessly. This can cost some time, but the image will be correct. However, if we guess that a node is occluded and in reality it isn't, we won't process it and some objects will be missing from the image—something we need to avoid!
So if we want to have correct images, we need to verify our guess and rectify our choice in case it was wrong. In the first case (the node was actually occluded), we update the classification for the next frame. In the second case (the node was actually visible), we just process (that is, traverse or render) the node normally. The good news is that we can do all of this later, whenever the query result arrives. Note also that the accuracy of our guess is not critical, because we are going to verify it anyway. Figure 6-2 shows the different cases.
Figure 6-2 Processing Requirements for Various Nodes
To address problem 2, we need a way to reduce overhead caused by the occlusion queries for interior nodes.
Luckily, this is easy to solve. Using idea 1, we are already processing previously visible interior nodes without waiting for their query results anyway. It turns out that we don't even need to issue a query for these nodes, because at the end of the frame, this information can be easily deduced from the classification of its children, effectively "pulling up" visibility information in the tree. Therefore, we save a number of occlusion queries and a huge amount of fill rate.
On the other hand, occlusion queries for interior nodes that were occluded in the previous frame are essential. They are required to verify our choice not to process the node, and in case the choice was correct, we have saved rendering all the draw calls, geometry, and pixels that are below that node (that is, inside the bounding box of the node).
What this boils down to is that we issue occlusion queries only for previously visible leaf nodes and for the largest possible occluded nodes in the hierarchy (in other words, an occluded node is not tested if its parent is occluded as well). The number of queries issued is therefore linear in the number of visible objects. Figure 6-3 illustrates this approach.
Figure 6-3 Occlusion Query Requirements for Various Nodes
We apply these two ideas in an algorithm we call "coherent hierarchical culling." In addition to the hierarchical data structure for the front-to-back traversal we already know, we need a "query queue" that remembers the queries we issued previously. We can then come back later to this queue to check whether the result for a query is already available.
The algorithm is easy to incorporate into any engine as it consists of a simple main traversal loop. A queue data structure is part of the C++ standard template library (STL), and a hierarchical data structure is part of most rendering libraries anyway; for example, for collision detection.
The algorithm visits the hierarchy in a front-to-back order and immediately recurses through any previously visible interior node (idea 1). For all other nodes, it issues an occlusion query and stores it in the query queue. If the node was a previously visible leaf node, it also renders it immediately, without waiting for the query result.
During the traversal of the hierarchy, we want to make sure that we incorporate the query results as soon as possible, so that we can issue further queries if we encounter a change in visibility.
Therefore, after each visited node, we check the query queue to see if a result has already arrived. Any available query result is processed immediately. Queries that verify our guess to be correct are simply discarded and do not generate additional work. Queries that contradict our guess are handled as follows: Nodes that were previously visible and became occluded are marked as such. Nodes that were previously occluded and became visible are traversed recursively (interior nodes) or rendered (leaf nodes). Both of these cases cause visibility information to be pulled up through the tree. See Figure 6-4, which also depicts a "pull-down" situation: a previously occluded node has become visible and its children need to be traversed.
Figure 6-4 Visibility of Hierarchy Nodes in Two Consecutive Frames
The algorithm is easy to follow using the pseudocode, which we show in Listing 6-1. Let's discuss some of the details in the code.
To maintain the visibility classification for all nodes over consecutive frames, we use a visible flag and a frameID, which increments every frame. Nodes visible in the previous frame are easily identified through lastVisited = frameID - 1 and visible = true; all other nodes are assumed invisible. This way, we don't have to reset the visibility classification for each frame.
TraversalStack.Push(hierarchy.Root); while ( not TraversalStack.Empty() or not QueryQueue.Empty() ) { //--PART 1: process finished occlusion queries while ( not QueryQueue.Empty() and (ResultAvailable(QueryQueue.Front()) or TraversalStack.Empty()) ) { node = QueryQueue.Dequeue(); // wait if result not available visiblePixels = GetOcclusionQueryResult(node); if ( visiblePixels > VisibilityThreshold ) { PullUpVisibility(node); TraverseNode(node); } } //--PART 2: hierarchical traversal if ( not TraversalStack.Empty() ) { node = TraversalStack.Pop(); if ( InsideViewFrustum(node) ) { // identify previously visible nodes wasVisible = node.visible and (node.lastVisited == frameID - 1); // identify nodes that we cannot skip queries for leafOrWasInvisible = not wasVisible or IsLeaf(node); // reset node's visibility classification node.visible = false; // update node's visited flag node.lastVisited = frameID; // skip testing previously visible interior nodes if ( leafOrWasInvisible ) { IssueOcclusionQuery(node); QueryQueue.Enqueue(node); } // always traverse a node if it was visible if ( wasVisible ) TraverseNode(node); } } } TraverseNode(node) { if ( IsLeaf(node) ) Render(node); else TraversalStack.PushChildren(node); } PullUpVisibility(node) { while (not node.visible) { node.visible = true; node = node.parent; } }
To establish the new visibility classification for the current frame, we set all nodes we encounter during the traversal to invisible by default. Later, when a query identifies a node as visible, we "pull up" visibility; that is, we set all its ancestors to visible as well.
Interior nodes that were visible in the previous frame (that is, not leafOrWasInvisible) are nodes for which we skip occlusion queries altogether.
Traversing a node means rendering the node for leaf nodes or pushing its children onto the traversal stack so that they get rendered front to back.
We also assume that the Render() call renders an object only if it hasn't been rendered before (this can be checked using the frameID). This facilitates the code for previously visible objects: when we finally get the result of their visibility query, we can just process them again without introducing a special case or rendering them twice (since they have already been rendered immediately after their query was issued).
Another advantage is that we can reference objects in several nodes of the hierarchy, and they still get rendered only once if several of these nodes are visible.
Furthermore, we have to be careful with occlusion tests for nodes near the viewpoint. If the front faces of a bounding box get clipped by the near plane, no fragments are generated for the bounding box. Thus the node would be classified as occluded by the test, even though the node is most likely visible. To avoid this error for each bounding box that passes the view-frustum test, we should check whether any of its vertices is closer than the near plane. In such a case, the associated node should be set visible without issuing an occlusion query. Finally, note that for the occlusion queries, the actual node bounding boxes can be used instead of the (usually less tight) nodes of the spatial hierarchy.
Let's take a step back and see what we have achieved with our algorithm and why.
The coherent hierarchical culling algorithm does away with most of the inefficient waiting found in the hierarchical stop-and-wait method. It does so by interleaving the occlusion queries with normal rendering, and avoiding the need to wait for a query to finish in most cases.
If the viewpoint does not move, then the only point where we actually might have to wait for a query result is at the end of the frame, when all the visible geometry is already rendered. Previously visible nodes are rendered right away without waiting for the results.
If the viewpoint does move, the only dependency occurs if a previously invisible node becomes visible. We obviously need to wait for this to be known in order to decide whether to traverse the children of the node. However, this does not bother us too much: most likely, we have some other work to do during the traversal. The query queue allows us to check regularly whether the result is available while we are doing useful work in between.
Note that in this situation, we might also not detect some occlusion situations and unnecessarily draw some objects. For example, if child A occludes child B of a previously occluded interior node, but the query for B is issued before A is rendered, then B is still classified as visible for the current frame. This happens when there is not enough work to do between issuing the queries for A and B. (See also Section 6.5.7, where we show how a priority queue can be used to increase the chance that there is work to do between the queries for A and B.)
We have already seen that a hierarchical occlusion-culling algorithm can save a lot of occlusion queries if large parts of the scene are occluded. However, this is paid for by large costs for testing interior nodes.
The coherent hierarchical culling algorithm goes a step further by obviating the need for testing most interior nodes. The number of queries that need to be issued is only proportional to the number of visible objects, not to the total number of objects in the scene. In essence, the algorithm always tests the largest possible nodes for occluded regions.
Neither does the number of queries depend on the depth of the hierarchy, as in the hierarchical stop-and-wait method. This is a huge win, because the rasterization of large interior nodes can cost more than occlusion culling buys.
We haven't talked a lot about traversing the hierarchy up to now. In principle, the traversal depends on which hierarchy we use, and is usually implemented with a traversal stack, where we push child nodes in front-to-back order when a node is traversed. This basically boils down to a depth-first traversal of the hierarchy.
However, we can gain something by not adhering to a strict depth-first approach. When we find nodes that have become visible, we can insert these nodes into the traversal, which should be compatible with the already established front-to-back order in the hierarchy.
The solution is not to use a traversal stack, but a priority queue based on the distance to the observer. A priority queue makes the front-to-back traversal very simple. Whenever the children of a node should be traversed, they are simply inserted into the priority queue. The main loop of the algorithm now just checks the priority queue, instead of the stack, for the next node to be processed.
This approach makes it simple to work with arbitrary hierarchies. The hierarchy only needs to provide a method to extract its children from a node, and to compute the bounding box for any node. This way, it is easy to use a sophisticated hierarchy such as a k-d tree (Cohen-Or et al. 2003), or just the bounding volume hierarchy of the scene graph for the traversal, and it is easy to handle dynamic scenes.
Note also that an occlusion-culling algorithm can be only as accurate as the objects on which it operates. We cannot expect to get accurate results for individual triangles if we only test nodes of a coarse hierarchy for occlusion. Therefore, the choice of a hierarchy plays a critical role in occlusion culling.
We briefly cover a few optimizations that are beneficial in some (but not all) scenes.
First of all, a very simple optimization that is always useful concerns previously visible leaf nodes (Sekulic 2004). Because these will be rendered regardless of the outcome of the query, it doesn't make sense to use an approximation (that is, a bounding box) for the occlusion query. Instead, when issuing the query as described in Section 6.3, we omit step 2 and replace step 3 by rendering the actual geometry of the object. This saves rasterization costs and draw calls as well as state changes for previously visible leaf nodes.
For some scenes, using a z-only rendering pass can be advantageous for our algorithm. Although this entails twice the transformation cost and (up to) twice the draw calls for visible geometry, it provides a good separation between occlusion culling and rendering passes as far as rendering states are concerned. For the occlusion pass, the only state that needs to be changed between an occlusion query and the rendering of an actual object is depth writing. For the full-color pass, visibility information is already available, which means that the rendering engine can use any existing mechanism for optimizing state change overhead (such as ordering objects by rendering state).
We might be willing to accept certain image errors in exchange for better performance. This can be achieved by setting the VisibilityThreshold in the algorithm to a value greater than zero. This means that nodes where no more than, say, 10 or 20 pixels are visible are still considered occluded. Don't set this too high, though; otherwise the algorithm culls potential occluders and obtains the reverse effect.
This optimization works best for scenes with complex visible geometry, where each additional culled object means a big savings.
Another optimization makes even more use of temporal coherence. When an object is visible, the current algorithm assumes it will also be visible for the next frame. We can even go a step further and assume that it will be visible for a number of frames. If we do that, we save the occlusion queries for these frames (we assume it's visible anyway). For example, if we assume an object remains visible for three frames, we can cut the number of required occlusion queries by almost a factor of three! Note, however, that temporal coherence does not always hold, and we almost certainly render more objects than necessary.
This optimization works best for deep hierarchies with simple leaf geometry, where the number of occlusion queries is significant and represents significant overhead that can be reduced using this optimization.
Occlusion culling is an essential part of any rendering system that strives to display complex scenes. Although hardware occlusion queries seem to be the long-sought-after solution to occlusion culling, they need to be used carefully, because they can introduce stalls into the rendering pipeline.
We have shown an algorithm that practically eliminates any waiting time for occlusion query results on both the CPU and the GPU. This is achieved by exploiting temporal coherence, assuming that objects that have been visible in the previous frame remain visible in the current frame. The algorithm also reduces the number of costly occlusion queries by using a hierarchy to cull large occluded regions using a single test. At the same time, occlusion tests for most other interior nodes are avoided.
This algorithm should make hardware occlusion queries useful for any application that features a good amount of occlusion, and where accurate images are important. For example, Figure 6-5 shows the application of the algorithm to the walkthrough of a city model, with the visibility classification of hierarchy nodes below. The orange nodes were found visible; all the other depicted nodes are invisible. Note the increasing size of the occluded nodes with increasing distance from the visible set. For the viewpoint shown, the algorithm presented in this chapter provided a speedup of about 4 compared to rendering with view-frustum culling alone, and a speedup of 2.6 compared to the hierarchical stop-and-wait method.
Figure 6-5 Visualizing the Benefits of Occlusion Queries for a City Walkthrough
Bittner, J., M. Wimmer, H. Piringer, and W. Purgathofer. 2004. "Coherent Hierarchical Culling: Hardware Occlusion Queries Made Useful." Computer Graphics Forum (Proceedings of Eurographics 2004) 23(3), pp. 615–624.
Cohen-Or, D., Y. Chrysanthou, C. Silva, and F. Durand. 2003. "A Survey of Visibility for Walkthrough Applications." IEEE Transactions on Visualization and Computer Graphics 9(3), pp. 412–431.
NVIDIA Corporation. 2004. NVIDIA GPU Programming Guide. http://developer.nvidia.com/object/gpu_programming_guide.html
Sekulic, D. 2004. "Efficient Occlusion Culling." In GPU Gems, edited by Randima Fernando, pp. 487–503. Addison-Wesley.
Wloka, M. 2003. "Batch, Batch, Batch: What Does It Really Mean?" Presentation at Game Developers Conference 2003. http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf
Michael Bunnell
NVIDIA Corporation
In this chapter we describe how to perform view-dependent, adaptive tessellation of Catmull-Clark subdivision surfaces with optional displacement mapping. We use the GPU to do the tessellation calculations, which saves graphics bus bandwidth and is many times faster than using the CPU. The first part of this chapter explains how to tessellate subdivision surfaces to polygons for rendering high-quality curved surfaces without visible polygon artifacts. The second part of this chapter describes how to add displacement-mapping support for rendering highly detailed models that can be animated in real time.
This chapter takes a repeated subdivision approach to tessellation, implemented by rendering into 2D textures. The subdivision, flatness test, and final vertex attribute calculations are done using fragment programs (pixel shaders). This method assumes that the vertex data of the subdivision surface control mesh are stored in a texture map. Intermediate results are also rendered to and read from texture maps, and the final tessellation results (position, normal, and so on) are rendered to a vertex array ready to be used by a render-primitives call such as glDrawElements().
Subdivision surfaces are arguably the most popular curved-surface representation used in computer graphics today. Specifically, Catmull-Clark subdivision surfaces are supported in practically every 3D modeling and animation application. Introduced to the big screen by Pixar with A Bug's Life, they have been used in all of Pixar's movies since, including Finding Nemo and The Incredibles. Subdivision surfaces are often combined with displacement mapping to add extra detail for computer-generated characters in live-action movies. The most striking example of a displacement-mapped subdivision surface model is probably the creature Gollum from the recent Lord of the Rings movies.
Subdivision surfaces are curved surfaces (or higher-order surfaces) described by a simple polygon control mesh and some subdivision rules. The subdivision rules define how to perform one subdivision step on the surface. Each subdivision step creates a new mesh with more geometry and a smoother appearance than the previous mesh. Figure 7-1 illustrates such rules applied to a cube.
Figure 7-1 Catmull-Clark Subdivision of a Cube
We use Catmull-Clark subdivision surfaces in this chapter, and so we use the Catmull-Clark subdivision rules (Catmull and Clark 1978). Catmull-Clark subdivision works on meshes of arbitrary manifold topology. However, it is considered a quad-based subdivision scheme because it works best on four-sided polygons (quads) and because all the polygons in the mesh after the first subdivision step are quads. Our implementation is limited to control meshes consisting only of quads. This implies that the source models have to be created as all-quad models, a task familiar to artists with a NURBS modeling background or experience using Lightwave SubPatches.
The Catmull-Clark rules to subdivide an all-quad mesh are simple. Split each face of the control mesh into four faces by adding a vertex in the middle of the face and a vertex at each edge. The positions of these new vertices and the new positions of the original vertices can be computed from the pre-subdivided mesh vertex positions using the weight masks in Figure 7-2. We show only the weight masks for calculating the new positions of original vertices of valence 3, 4, and 5. The weight masks for higher valences follow the same progression.
Figure 7-2 Catmull-Clark Subdivision Masks
If we examine a couple of subdivision steps of a single face, we notice that subdivision is very similar to scaling a 2D image. The only differences are the filtering values used and that the vertex position data is generally stored as floating point, while most images are stored as integer or fixed point. Figure 7-3 shows a single face of a mesh being subdivided two times. One thing to note about this figure is that we need a ring of vertices around the faces that are being subdivided to perform the subdivision. Also, the extraordinary point (the vertex of valence 5) at the upper-left corner of the face in the original mesh is present in the subdivided meshes too.
Figure 7-3 Subdividing a Face of a Subdivision Surface
One obvious way to tessellate subdivision surfaces is to apply a fixed number of subdivision steps on the original control mesh to create a list of quads to render. In fact, that is how pre-tessellated surfaces are created with most 3D-modeling packages. The disadvantage of this method is that parts of a model are over-tessellated and other parts under-tessellated. Also, it leaves the burden of selecting the number of subdivisions on the application program. Under-tessellation leads to visible polygon artifacts. Over-tessellation creates too many small triangles that can severely affect rendering performance. Too much geometry causes the GPU to become vertex-transform or triangle-setup limited. Worse, having many tiny triangles that are just a few pixels in size can significantly reduce rasterizer efficiency.
We use a flatness-test-based adaptive subdivision algorithm to avoid the problems with the fixed subdivision method. Instead of blindly subdividing a fixed number of steps, we test the surface for flatness first. In this way, the subdivision adapts to the surface curvature and the viewpoint. The closer the surface is or the more curved it is, the more it gets subdivided. See Figure 7-4. We measure flatness as the distance from the polygonal mesh to the limit surface measured in screen pixels. For example, a flatness value of 0.5 means the maximum distance from the polygon approximation of the surface to the limit surface is half a pixel.
Figure 7-4 Adaptive vs. Uniform Tessellation
We need to store the subdivision control mesh vertex data in a format that allows for efficient processing. We do this by breaking up the original control mesh into smaller meshes that can be represented as regular 2D grids and therefore stored in 2D arrays. See Figure 7-5. Now we can infer the mesh connectivity by the location of vertices in the array. Vertices next to each other are assumed to be connected by edges. Control points arranged this way, in a 2D array or matrix, are generally referred to as patches, so that is what we call them. Our patches are different in that we allow extraordinary points at the patches' corners.
Figure 7-5 Patching a Surface
The obvious way to divide the control mesh into patches is to create a patch for each face of the mesh. This method requires a lot of duplicated vertices because we need to include the vertices from neighboring faces so we can subdivide the patch. We can improve on this situation by creating patches with more faces when possible.
Patching a surface is not a trivial process, but fortunately it can be done in a preprocessing step. Even animated meshes generally do not change topology, so the vertex locations in each patch can be precalculated and stored in a 2D texture. The texture can be passed to the tessellation code along with the list of patch information, which includes the size of the patch and the valences of any extraordinary vertices. The demo program available on the book's Web site shows one possible implementation of patching an arbitrary quad model.
The tessellation algorithm works as follows. First, we copy the vertex data of the control mesh (positions) into the patch buffer texture. The flatness test assumes the vertex coordinates are in eye space. If they are not in eye space, they can be transformed before or during this copy. Next, we check each patch for flatness (see "The Flatness Test," later in this section). Then we compute vertex attribute data, such as positions and normals, for the patches that are flat and write them to the vertex array, writing the index information to a primitive index buffer. Patches that are not flat are subdivided into another patch buffer. Then we swap patch-buffer pointers and go back to the flatness test step. This section describes each step in the tessellation process in more detail.
We use the CPU to do buffer management and to create the list of vertex indices that form part of the tessellation result. The CPU does not have to touch the vertex data, but it does read the flatness test results (using glReadPixels) and decides which patches need further tessellation. It assigns patch locations in the buffers, and it uses a simple recursive procedure to build the primitive index list based on the edge flatness results computed by the shaders.
Patches are created by rendering a quad per patch at the appropriate location in the patch buffer, to copy the control mesh vertex position data from a texture. The program executed at the quad's fragments uses a precomputed texture to index the control mesh vertices discussed in Section 7.1.4. If there are more vertex attributes to subdivide, such as texture coordinates, they are copied into a different location in the patch buffer. Vertex attributes other than position are bilinearly interpolated for performance.
The flatness of each edge in the patch is calculated by rendering a quad of appropriate size into the flatness buffer using the FlatTest1.cg or the FlatTest2.cg shaders. (All of the shaders mentioned here can be found with the SubdViewer demo program.) Unlike the other buffers used for tessellation, this buffer is not floating point. It is an 8-bit-per-component buffer.
The two flat-test shaders measure the flatness at each edge of the control mesh. Both shaders treat the four control points along an edge as a b-spline curve and measure the flatness of the curve. One shader is used before the first subdivision step because it can handle s-type curves, which are curves whose first derivatives change sign. The second, faster test is used thereafter, because there are only c-type curves after the first subdivision. (C-type curves have first derivatives that do not change sign and thus are c-shaped.)
As shown in Figure 7-6, the first shader's flatness test calculates the distance from the control point 2 to the midpoint of control points 1 and 3, as well as the distance of control point 3 to the midpoint of control points 2 and 4. The second flatness test calculates the distance from the midpoint of control points 1 and 4 to the midpoint of control points 2 and 3. Both tests compare the calculated distance to the flatness value given in pixels, scaled appropriately. The flatness value is prescaled to account for the distance of the b-spline control points from the limit curve, and it is doubled so the midpoint calculations can be done with only an add. Because the control points are assumed to be in eye space, the flatness distance is scaled by the z value of the midpoint of control points 2 and 3 to handle perspective projection.
Figure 7-6 Comparing Flatness Tests
The flatness test also needs to handle z-scale values that are less than the near render plane to avoid over-tessellating meshes that cross the near plane. It does so by reflecting z values less than the near plane to the other side of the plane. This technique results in good tessellation for patches that cross the near plane without creating too much geometry that would be clipped later.
Results of the flatness test are copied to host memory, so that the CPU can decide which patches are flat. The results are also saved for each level of subdivision except the last (we know all edges are flat after the last subdivision), so they can be used to build the array of primitive indices. This data is packed to save bandwidth and averages about 200 times fewer bytes of data than the description of the patch itself.
Subdivision is performed by rendering a quad for each patch into the destination patch buffer using the Subdiv.cg shader shown in Listing 7-1. The render works similar to a 2 x scaled image blit, but it uses the weights for Catmull-Clark subdivision, as discussed in Section 7.1.2. The source coordinate for rendering is offset from the patch's origin by (0.75, 0.75), to account for the ring of extra vertices around the faces that are being subdivided and to create a similar ring in the destination. See Figure 7-7. The size of the result patch is twice that of the source patch, minus 3. The Subdiv.cg shader uses the fractional part of the source coordinates to determine which subdivision mask to use: the face mask, the edge mask, or the valence-4 vertex mask. Attributes other than position are interpolated bilinearly in Normal.cg.
Figure 7-7 Resampling a Patch
float4 main(float4 srcCoord, uniform samplerRECT srcTexMap : TEXUNIT0) : COLOR { float3 wh, wv; // weight values float3 s1, s2, s3; float2 f; // calculate weights based on fractional part of // source texture coordinate // // x == 0.0, y == 0.0 use face weights (1/4, 1/4, 0) // (1/4, 1.4, 0) // ( 0, 0, 0) // x == 0.5, y == 0.0 use edge weights (horizontal) // (1/16, 6/16, 1/16) // (1/16, 6/16, 1/16) // ( 0, 0, 0) // x == 0.0, y == 0.5 use edge weights (vertical) // (1/16, 1/16, 0) // (6/16, 6/16, 0) // (1/16, 1/16, 0) // x == 0.5, y == 0.5 use valence 4 vertex weights // (1/64, 6/64, 1/64) // (6/64, 36/64, 6/64) // (1/64, 6/64, 1/64) wh = float3 (1.0/8.0, 6.0/8.0, 1.0/8.0); f = frac (srcCoord.xy + 0.001) < 0.25; // account for finite precision if (f.x != 0.0) { wh = float3(0.5, 0.5, 0.0); srcCoord.x += 0.5; // fraction was zero -- move to texel center } wv = float3 (1.0/8.0, 6.0/8.0, 1.0/8.0); if (f.y != 0) { wv = float3 (0.5, 0.5, 0.0); srcCoord.y += 0.5; // fraction was zero -- need to move to texel center } // calculate the destination vertex position by using the weighted // sum of the 9 vertex positions centered at srcCoord s1 = texRECT (srcTexMap, srcCoord.xy + float2 (-1, -1)).xyz * wh.x + texRECT (srcTexMap, srcCoord.xy + float2 (0, -1)).xyz * wh.y + texRECT (srcTexMap, srcCoord.xy + float2 (1, -1)).xyz * wh.z; s2 = texRECT (srcTexMap, srcCoord.xy + float2 (-1, 0)).xyz * wh.x + texRECT (srcTexMap, srcCoord.xy + float2 (0, 0)).xyz * wh.y + texRECT (srcTexMap, srcCoord.xy + float2 (1, 0)).xyz * wh.z; s3 = texRECT (srcTexMap, srcCoord.xy + float2 (-1, 1)).xyz * wh.x + texRECT (srcTexMap, srcCoord.xy + float2 (0, 1)).xyz * wh.y + texRECT (srcTexMap, srcCoord.xy + float2 (1, 1)).xyz * wh.z; return float4 (s1 * wv.x + s2 * wv.y + s3 * wv.z, 0); }
If there are any extraordinary points at the corners of the patch, they are copied and adjusted using the extraordinary-point subdivision shader, EPSubdiv.cg. This shader does not really subdivide per se, because no more data is created. It only adjusts the positions based on the subdivision rules, using a table lookup to get the appropriate filter weights. It calculates the coordinates of the weights in the lookup table based on the valence of the extraordinary point and the relative position of the control point that it is calculating. This shader is much slower than the regular subdivision shader, but it runs at many fewer pixels. Each corner of the patch (that has an extraordinary point) is handled by rendering a separate, small quad. The extraordinary-point shader is run after the regular subdivision shader and overwrites the control points in the destination that were calculated incorrectly.
We use the limit shaders and normal shaders to create the vertex attributes that are written to the vertex array for rendering, including the attributes that are bilinearly interpolated, which are copied in the normal shader. The limit shader calculates the position on the limit surface corresponding to the vertices of the patch. It uses a weighted average of each vertex and its eight neighbors (Halstead et al. 1993). The control points themselves are often not close enough to the limit surface to be substituted for the limit surface points, especially for neighboring patches that are subdivided to different depths. The EPLimit.cg shader is used to calculate the limit position for extraordinary points.
The normal shader calculates two tangents for each control point and then uses their cross product to calculate the surface normal of the limit surface at that point. The EPNormal.cg shader does the same for extraordinary points. Both shaders calculate the tangents using a weighted average of the control vertex's neighbors. Because the normal shaders use the same control points as the limit shaders, it is a good idea to combine them in a single shader and use multiple render targets to write out the position and normal at the same time.
Watertight tessellation means that there are no gaps between polygons in the tessellation output. Even tiny gaps can lead to missing pixels when rendering the surface. One source of these gaps is the fact that floating-point operations performed in different orders do not always give the exact same result. Ordering shader calculations so that they are identical for neighboring patches costs a lot of performance. Therefore, we do not rely on positions or flatness data at the edge of a patch to match its neighbor's. The index list-building code ensures watertightness by using only one patch's data for shared corners and edges.
Another watertight tessellation problem that must be avoided is T-junctions. T-junctions often crop up with adaptive tessellation schemes. They occur if a patch is split even though one or more of its edges is flat. If the patch sharing the flat edges is not also split, then a T-junction is created. There are different methods for avoiding T-junctions during tessellation, some of which are quite complicated. Our tessellator has the advantage that index values of all the vertices of a patch are known when building the index list. The tessellator simply replaces the index of the vertex at the junction with one that is farther from the edge and fills the gap with a triangle, as illustrated in Figure 7-8. Moving the vertex at the T-junction avoids zero-area triangles, which can cause artifacts when rendering alpha-blended surfaces.
Figure 7-8 Handling T-Junctions
Displacement mapping is a method for adding geometric complexity to a simple, lightweight model. A displacement map is a texture map containing a scalar value per texel that describes how much the surface to which it is applied should be displaced in the direction of its normal. Unlike bump or normal maps, displacement maps change the actual geometry of the surface, not just the surface normal.
There are several reasons to add displacement mapping to subdivision surfaces. First, subdivision surfaces provide a nice, smooth surface from which to displace, even when animated. Second, they are very compact. The control mesh does not need any normal and tangent data, like polygon models, because they are derived during tessellation. Third, it is easy to modify our subdivision surface shaders to support adaptive displacement mapping.
The trick to efficient displacement mapping is adding geometry where it is really needed. We can modify our flatness test so that more geometry is added if the maximum displacement is greater than the flatness threshold. Because the flatness is tested per edge, we create a table using the displacement map containing the maximum displacement on either side of each edge. Figure 7-9 shows a 2D representation of how the maximum distance is calculated for each level of subdivision. Note that all the low-resolution mesh points are located on the displaced surface, because that is where the tessellated vertices are positioned. In the flatness-test shaders, we add the edge's maximum displacement value to the error distance before comparing it to the flatness threshold. All that is left to do geometry-wise is to add the displacement from the displacement map to the vertex position in the limit shaders.
Figure 7-9 Four Steps of the Maximum Displacement Calculation
We use normal mapping to take the displacement into account when shading the surface, instead of adjusting the vertex normals. See Figure 7-10 for an example. This technique has a couple of advantages. First, normal mapping is done per pixel, so it generally allows more detailed shading than is possible with vertex normals. More important, the surface does not appear to pop when animated, as the amount of geometry changes due to adaptive tessellation. Normal mapping requires some extra attributes from the tessellator: a tangent, and normal map texture coordinates.
Figure 7-10 An Example of Displacement Mapping
We have described a method to tessellate subdivision surfaces on the GPU using subdivision with a breadth-first recursion algorithm. We described the shaders necessary to perform flatness testing, to carry out the subdivision, and to compute the limit surface attributes. We also explained how to modify the shaders to add displacement-mapping support for increasing the geometric detail of subdivision surface models.
Catmull, E., and J. Clark. 1978. "Recursively Generated B-Spline Surfaces on Arbitrary Topology Meshes." Computer Aided Design 10(6), pp. 350–355.
DeRose, T., M. Kass, and T. Truong. 1998. "Subdivision Surfaces in Character Animation." In Computer Graphics (Proceedings of SIGGRAPH 98), pp. 85–94.
Halstead, M., M. Kass, and T. DeRose. 1993. "Efficient, Fair Interpolation Using Catmull-Clark Surfaces." In Computer Graphics (Proceedings of SIGGRAPH 93), pp. 35–44.
William Donnelly
University of Waterloo
In this chapter, we present distance mapping, a technique for adding small-scale displacement mapping to objects in a pixel shader. We treat displacement mapping as a ray-tracing problem, beginning with texture coordinates on the base surface and calculating texture coordinates where the viewing ray intersects the displaced surface. For this purpose, we precompute a three-dimensional distance map, which gives a measure of the distance between points in space and the displaced surface. This distance map gives us all the information necessary to quickly intersect a ray with the surface. Our algorithm significantly increases the perceived geometric complexity of a scene while maintaining real-time performance.
Cook (1984) introduced displacement mapping as a method for adding small-scale detail to surfaces. Unlike bump mapping, which affects only the shading of surfaces, displacement mapping adjusts the positions of surface elements. This leads to effects not possible with bump mapping, such as surface features that occlude each other and nonpolygonal silhouettes. Figure 8-1 shows a rendering of a stone surface in which occlusion between features contributes to the illusion of depth.
Figure 8-1 A Displaced Stone Surface
The usual implementation of displacement mapping iteratively tessellates a base surface, pushing vertices out along the normal of the base surface, and continuing until the polygons generated are close to the size of a single pixel. Michael Bunnell presents this approach in Chapter 7 of this book, "Adaptive Tessellation of Subdivision Surfaces with Displacement Mapping."
Although it seems most logical to implement displacement mapping in the vertex shading stage of the pipeline, this is not possible because of the vertex shader's inability to generate new vertices. This leads us to consider techniques that rely on the pixel shader. Using a pixel shader is a good idea for several reasons:
A disadvantage of using the pixel shader is that we cannot alter a pixel's screen coordinate within the shader. This means that unlike an approach based on tessellation, an approach that uses the pixel shader cannot be used for arbitrarily large displacements. This is not a severe limitation, however, because displacements are almost always bounded in practice.
Parallax mapping is a simple way to augment bump mapping to include parallax effects (Kaneko et al. 2001). Parallax mapping uses the information about the height of a surface at a single point to offset the texture coordinates either toward or away from the viewer. Although this is a crude approximation valid only for smoothly varying height fields, it gives surprisingly good results, particularly given that it can be implemented with only three extra shader instructions. Unfortunately, because of the inherent assumptions of the technique, it cannot handle large displacements, high-frequency features, or occlusion between features.
Relief mapping as presented by Policarpo (2004) uses a root-finding approach on a height map. It begins by transforming the viewing ray into texture space. It then performs a linear search to locate an intersection with the surface, followed by a binary search to find a precise intersection point. Unfortunately, linear search requires a fixed step size. This means that in order to resolve small details, it is necessary to increase the number of steps, forcing the user to trade accuracy for performance. Our algorithm does not have this trade-off: it automatically adjusts the step size as necessary to provide fine detail near the surface, while skipping large areas of empty space away from the surface.
View-dependent displacement mapping (Wang et al. 2003) and its extension, generalized displacement mapping (Wang et al. 2004), treat displacement mapping as a ray-tracing problem. They store the result of all possible ray intersection queries within a three-dimensional volume in a five-dimensional map indexed by three position coordinates and two angular coordinates. This map is then compressed by singular value decomposition, and decompressed in a pixel shader at runtime. Because the data is five-dimensional, it requires a significant amount of preprocessing and storage.
The algorithm for displacement rendering we present here is based on sphere tracing (Hart 1996), a technique developed to accelerate ray tracing of implicit surfaces. The concept is the same, but instead of applying the algorithm to ray tracing of implicit surfaces, we apply it to the rendering of displacement maps on the GPU.
Suppose we want to apply a displacement map to a plane. We can think of the surface as being bounded by an axis-aligned box. The conventional displacement algorithm would render the bottom face of the box and push vertices upward. Our algorithm instead renders the top plane of the box. Then within the shader, we find which point on the displaced surface the viewer would have really seen.
In a sense, we are computing the inverse problem to conventional displacement mapping. Displacement mapping asks, "For this piece of geometry, what pixel in the image does it map to?" Our algorithm asks, "For this pixel in the image, what piece of geometry do we see?" While the first approach is the one used by rasterization algorithms, the second approach is the one used by ray-tracing algorithms. So we approach distance mapping as a ray-tracing problem.
A common ray-tracing approach is to sample the height map at uniformly spaced locations to test whether the viewing ray has intersected the surface. Unfortunately, we encounter the following problem: as long as our samples are spaced farther apart than a single texel, we cannot guarantee that we have not missed an intersection that lies between our samples. This problem is illustrated in Figure 8-2. These "overshoots" can cause gaps in the rendered geometry and can result in severe aliasing. Thus we have two options: either we accept artifacts from undersampling, or we must take a sample for each texel along the viewing ray. Figure 8-3 shows that our algorithm can render objects with fine detail without any aliasing or gaps in geometry.
Figure 8-2 A Difficult Case for Displacement Algorithms
Figure 8-3 Rendering Displaced Text
We can draw two conclusions from the preceding example:
To solve these problems, we define the distance map of the surface. For any point p in texture space and a surface S, we define a function dist(p, S) = min{d(p, q) : q in S}. In other words, dist(p, S) is the shortest distance from p to the closest point on the surface S. The distance map for S is then simply a 3D texture that stores, for each point p, the value of dist(p, S). Figure 8-4 shows a sample one-dimensional height map and its corresponding distance map.
Figure 8-4 A Sample Distance Map in Two Dimensions
This distance map gives us exactly the information we need to choose our sampling positions. Suppose we have a ray with origin p 0 and direction vector d, where d is normalized to unit length. We define a new point p 1= p 0 + dist(p 0, S) x d. This point has the important property that if p 0 is outside the surface, then p 1 is also outside the surface. We then apply the same operation again by defining p 2 = p 1+ dist(p 1, S) xd, and so on. Each consecutive point is a little bit closer to the surface. Thus, if we take enough samples, our points converge toward the closest intersection of the ray with the surface. Figure 8-5 illustrates the effectiveness of this algorithm.
Figure 8-5 Sphere Tracing
It is worth noting that distance functions do not apply only to height fields. In fact, a distance map can represent arbitrary voxelized data. This means that it would be possible to render small-scale detail with complex topology. For example, chain mail could be rendered in this manner.
Up to this point, we have discussed applying distance mapping only to planes. We would like to apply distance mapping to general meshes. We do so by assuming the surface is locally planar. Based on this assumption, we can perform the same calculations as in the planar case, using the local tangent frame of the surface. We use the local surface tangents to transform the view vector into tangent space, just as we would transform the light vector for tangent space normal mapping. Once we have transformed the viewing ray into tangent space, the algorithm proceeds exactly as in the planar case.
Now that we know how to use a distance map for ray tracing, we need to know how to efficiently compute a distance map for an arbitrary height field.
Computing a distance transform is a well-studied problem. Danielsson (1980) has presented an algorithm that runs in O(n) time, where n is the number of pixels in the map. The idea is to create a 3D map where each pixel stores a 3D displacement vector to the nearest point on the surface. The algorithm then performs a small number of sequential sweeps over the 3D domain, updating each pixel's displacement vector based on neighboring displacement vectors. Once the displacements have been calculated, the distance for each pixel is computed as the magnitude of the displacement. An implementation of this algorithm is on the book's CD.
When the distance transform algorithm is finished, we have computed the distance from each pixel in the distance map to the closest point on the surface, measured in pixels. To make these distances lie in the range [0, 1], we divide each distance by the depth of the 3D texture in pixels.
The vertex shader for distance mapping, shown in Listing 8-1, is remarkably similar to a vertex shader for tangent-space normal mapping, with two notable differences. The first is that in addition to transforming the light vector into tangent space, we also transform the viewing direction into tangent space. This tangent-space eye vector is used in the vertex shader as the direction of the ray to be traced in texture space. The second difference is that we incorporate an additional factor inversely proportional to the perceived depth. This allows us to adjust the scale of the displacements interactively.
We now have all we need to begin our ray marching: we have a starting point (the base texture coordinate passed down by the vertex shader) and a direction (the tangent-space view vector).
The first thing the fragment shader has to do is to normalize the direction vector. Note that distances stored in the distance map are measured in units proportional to pixels; the direction vector is measured in units of texture coordinates. Generally, the distance map is much higher and wider than it is deep, and so these two measures of distance are quite different. To ensure that our vector is normalized with respect to the measure of distance used by the distance map, we first normalize the direction vector, and then multiply this normalized vector by an extra "normalization factor" of (depth/width, depth/height, 1).
v2fConnector distanceVertex(a2vConnector a2v, uniform float4x4 modelViewProj, uniform float3 eyeCoord, uniform float3 lightCoord, uniform float invBumpDepth) { v2fConnector v2f; // Project position into screen space // and pass through texture coordinate v2f.projCoord = mul(modelViewProj, float4 (a2v.objCoord, 1)); v2f.texCoord = float3 (a2v.texCoord, 1); // Transform the eye vector into tangent space. // Adjust the slope in tangent space based on bump depth float3 eyeVec = eyeCoord - a2v.objCoord; float3 tanEyeVec; tanEyeVec.x = dot(a2v.objTangent, eyeVec); tanEyeVec.y = dot(a2v.objBinormal, eyeVec); tanEyeVec.z = -invBumpDepth * dot(a2v.objNormal, eyeVec); v2f.tanEyeVec = tanEyeVec; // Transform the light vector into tangent space. // We will use this later for tangent-space normal mapping float3 lightVec = lightCoord - a2v.objCoord; float3 tanLightVec; tanLightVec.x = dot(a2v.objTangent, lightVec); tanLightVec.y = dot(a2v.objBinormal, lightVec); tanLightVec.z = dot(a2v.objNormal, lightVec); v2f.tanLightVec = tanLightVec; return v2f; }
We can now proceed to march our ray iteratively. We begin by querying the distance map, to obtain a conservative estimate of the distance we can march along the ray without intersecting the surface. We can then step our current position forward by that distance to obtain another point outside the surface. We can proceed in this way to generate a sequence of points that converges toward the displaced surface. Finally, once we have the texture coordinates of the intersection point, we compute normal-mapped lighting in tangent space. Listing 8-2 shows the fragment shader.
When sampling textures, we must be careful about how to specify derivatives for texture lookups. In general, the displaced texture coordinates have discontinuities (for example, due to sections of the texture that are occluded). When mipmapping or anisotropic filtering is enabled, the GPU needs information about the derivatives of the texture coordinates. Because the GPU approximates derivatives with finite differences, these derivatives have incorrect values at discontinuities. This leads to an incorrect choice of mipmap levels, which in turn leads to visible seams around discontinuities.
Instead of using the derivatives of the displaced texture coordinates, we substitute the derivatives of the base texture coordinates. This works because displaced texture coordinates are always continuous, and they vary at approximately the same rate as the base texture coordinates.
Because we do not use the GPU's mechanism for determining mipmap levels, it is possible to have texture aliasing. In practice, this is not a big problem, because the derivatives of the base texture coordinates are a good approximation of the derivatives of the displaced texture coordinates.
Note that the same argument about mipmap levels also applies to lookups into the distance map. Because the texture coordinates can be discontinuous around feature edges, the texture unit will access a mipmap level that is too coarse. This in turn results in incorrect distance values. Our solution is to filter the distance map linearly without any mipmaps. Because the distance map values are never visualized directly, aliasing does not result from the lack of mipmapping here.
We implemented the distance-mapping algorithm in OpenGL and Cg as well as in Sh (http://www.libsh.org). The images in this chapter were created with the Sh implementation, which is available on the accompanying CD. Because of the large number of dependent texture reads and the use of derivative instructions, the implementation works only on GeForce FX and GeForce 6 Series GPUs.
f2fConnector distanceFragment(v2fConnector v2f, uniform sampler2D colorTex, uniform sampler2D normalTex, uniform sampler3D distanceTex, uniform float3 normalizationFactor) { f2fConnector f2f; // Normalize the offset vector in texture space. // The normalization factor ensures we are normalized with respect // to a distance which is defined in terms of pixels. float3 offset = normalize(v2f.tanEyeVec); offset *= normalizationFactor; float3 texCoord = v2f.texCoord; // March a ray for (int i = 0; i < NUM_ITERATIONS; i++) { float distance = f1tex3D(distanceTex, texCoord); texCoord += distance * offset; } // Compute derivatives of unperturbed texcoords. // This is because the offset texcoords will have discontinuities // which lead to incorrect filtering. float2 dx = ddx(v2f.texCoord.xy); float2 dy = ddy(v2f.texCoord.xy); // Do bump-mapped lighting in tangent space. // 'normalTex' stores tangent-space normals remapped // into the range [0, 1]. float3 tanNormal = 2 * f3tex2D(normalTex, texCoord.xy, dx, dy) - 1; float3 tanLightVec = normalize(v2f.tanLightVec); float diffuse = dot(tanNormal, tanLightVec); // Multiply diffuse lighting by texture color f2f.COL.rgb = diffuse * f3tex2D(colorTex, texCoord.xy, dx, dy); f2f.COL.a = 1; return f2f; }
In all examples in this chapter, we have used a 3D texture size of 256x256x16, but this choice may be highly data-dependent. We also experimented with maps up to 512x512x32 for complex data sets. We set the value of NUM_ITERATIONS to 16 iterations for our examples. In most cases, this was more than sufficient, with 8 iterations sufficing for smoother data sets.
Our fragment shader, assuming 16 iterations, compiles to 48 instructions for the GeForce FX. Each iteration of the loop takes 2 instructions: a texture lookup followed by a multiply-add. GeForce FX and GeForce 6 Series GPUs can execute this pair of operations in a single clock cycle, meaning that each iteration takes only one clock cycle. It should be noted that each texture lookup depends on the previous one; on GPUs that limit the number of indirect texture operations, this shader needs to be broken up into multiple passes.
We tested our distance-mapping implementation on a GeForce FX 5950 Ultra and on a GeForce 6800 GT with several different data sets. One such data set is pictured in Figure 8-6. On the GeForce FX 5950 Ultra, we can shade every pixel at 1024x768 resolution at approximately 30 frames per second. If we reduce the number of iterations to 8, we obtain approximately 75 frames per second at the same resolution. On the GeForce 6800 GT, we were able to run at a consistent 70 frames per second, even shading every pixel at 1280x1024 resolution with 16 iterations.
Figure 8-6 A Grating Rendered with Displacement Mapping
We have presented distance mapping, a fast iterative technique for displacement mapping based on ray tracing of implicit surfaces. We show that the information contained in a distance function allows us to take larger steps when rays are farther from the surface, while ensuring that we never take a step so large that we create gaps in the rendered geometry. The resulting implementation is very efficient; it converges in a few iterations, and each iteration costs only a single cycle on GeForce FX and GeForce 6 Series GPUs.
In the future we would like to use the dynamic branching capabilities of Shader Model 3 GPUs to improve the performance of our technique by taking "early outs" for pixels that converge quickly. This would allow us to increase the quality of the technique by devoting more computing power to pixels that converge more slowly.
We would also like to adapt our technique to curved surfaces. Although our algorithm can be used on any model with appropriate tangents, it results in distortion in regions of high curvature. In particular, generalized displacement mapping (Wang et al. 2004) uses a type of tetrahedral projection to account for curved surfaces, and this method can also be applied to our algorithm. Such a tetrahedral projection can be accomplished in a vertex shader (Wylie et al. 2002).
Cook, Robert L. 1984. "Shade Trees." In Computer Graphics (Proceedings of SIGGRAPH 84) 18(3), pp. 223–231.
Danielsson, Per-Erik. 1980. "Euclidean Distance Mapping." Computer Graphics and Image Processing 14, pp. 227–248.
Hart, John C. 1996. "Sphere Tracing: A Geometric Method for the Antialiased Ray Tracing of Implicit Surfaces." The Visual Computer 12(10), pp. 527–545.
Kaneko, Tomomichi, Toshiyuki Takahei, Masahiko Inami, Naoki Kawakami, Yasuyuki Yanagida, Taro Maeda, and Susumu Tachi. 2001. "Detailed Shape Representation with Parallax Mapping." In Proceedings of the ICAT 2001 (The 11th International Conference on Artificial Reality and Telexistence), Tokyo, December 2001, pp. 205–208.
Policarpo, Fabio. 2004. "Relief Mapping in a Pixel Shader Using Binary Search." http://www.paralelo.com.br/arquivos/ReliefMapping.pdf
Wang, Lifeng, Xi Wang, Xin Tong, Stephen Lin, Shimin Hu, Baining Guo, and Heung-Yeung Shum. 2003. "View-Dependent Displacement Mapping." ACM Transactions on Graphics (Proceedings of SIGGRAPH 2003) 22(3), pp. 334–339.
Wang, Xi, Xin Tong, Stephen Lin, Shimin Hu, Baining Guo, and Heung-Yeung Shum. 2004. "Generalized Displacement Maps." In Eurographics Symposium on Rendering 2004, pp. 227–234.
Wylie, Brian, Kenneth Moreland, Lee Ann Fisk, and Patricia Crossno. 2002. "Tetrahedral Projection Using Vertex Shaders." In Proceedings of IEEE Volume Visualization and Graphics Symposium 2002, October 2002, pp. 7–12.
Special thanks to Stefanus Du Toit for his help implementing the distance transform and writing the Sh implementation.