
Last Updated:
06
/
23
/
2005
Newsletter Tip Archive
Table of Contents (arranged from most recent to least recent)
Back to News Archives
To enable Transparency Multisampling in Direct3D first check if it is supported:
(pd3d->
CheckDeviceFormat(D3DADAPTER_DEFAULT, D3DDEVTYPE_HAL,
D3DFMT_X8R8G8B8, 0,D3DRTYPE_SURFACE,
(D3DFORMAT)MAKEFOURCC('A', 'T', 'O', 'C'))) == S_OK);
When rendering alpha tested fragments multisampling can be turned on by setting:
pd3dDevice->
SetRenderState(D3DRS_ADAPTIVETESS_Y,
(D3DFORMAT)MAKEFOURCC('A', 'T', 'O', 'C'));
To enable Transparency Supersampling first check if it is supported:
(pd3d->
CheckDeviceFormat(D3DADAPTER_DEFAULT, D3DDEVTYPE_HAL,
D3DFMT_X8R8G8B8, 0, D3DRTYPE_SURFACE,
(D3DFORMAT)MAKEFOURCC('S', 'S', 'A', 'A'))) == S_OK);
When rendering alpha tested fragments supersampling can be turned on by setting:
pd3dDevice->
SetRenderState(D3DRS_ADAPTIVETESS_Y,
(D3DFORMAT)MAKEFOURCC('S', 'S', 'A', 'A'));
Both modes can be turned off by setting:
pd3dDevice->SetRenderState(D3DRS_ADAPTIVETESS_Y,
D3DFMT_UNKNOWN);
The order in which you allocate resources is vital for graphics performance: Allocate all render targets first, sort the order of allocation by pitch (that is, width * bytes per pixel), and sort these different pitch groups based on frequency of use. Surfaces that are rendered to most frequently should be allocated first. Then create vertex and pixel shaders, and finally create all textures and vertex and index buffers. Not following these guidelines can cost you between 30% to 100% of your frame rate.
| Comparing Filtering for 64-Sample Soft Shadows |
 Hardware Shadow Mapping |
 Ordinary Percentage-Closer Filtering |
Since the GeForce3 processor was released in 2001, NVIDIA GPUs have supported hardware-accelerated shadow map filtering.
Hardware shadow mapping, as it is commonly known, allows shadow map queries to produce smoothly-varying gray-scale results that are superior to ordinary percentage-closer filtering (PCF) or nearest filtering. And because there are dedicated transistors on-chip, the smoother result comes at extremely low cost (often for free).
So how does "hardware shadow mapping" produce such a smooth result? Regular PCF is able to produce only as many shades of gray as the number of samples that are used. This results in marked blockiness as one zooms closer to shadowed regions unless the number of samples increases dramatically (which causes a corresponding load increase in the pixel shader).
In contrast, hardware shadow mapping uses a bilinear filter on top of the usual PCF step, resulting in smooth grey values even if the viewer zooms in on the shadowed regions. As you zoom in, hardware shadow mapping is therefore able to deliver arbitrarily higher quality without taking additional samples and while maintaining the same frame rate.
Learn more about hardware shadow mapping in Section 6.2 of our GPU Programming Guide.
The order in which you allocate resources is vital for graphics performance: Allocate all render targets first, sort the order of allocation by pitch (that is, width * bytes per pixel), and sort these different pitch groups based on frequency of use. Surfaces that are rendered to most frequently should be allocated first. Then create vertex and pixel shaders, and finally create all textures and vertex and index buffers. Not following these guidelines can cost you between 30% to 100% of your frame rate.
In OpenGL, Pixel Buffer Objects (PBOs) enable the fast path for texture uploads/downloads in the driver. There is an excellent example with source code included in the NVIDIA SDK 8.5 (and newer versions) that uses ARB approved extensions for PBOs. This shows example demonstrates how to optimally use OpenGL Pixel Buffer Objects (PBOs) for transferring textures to and from the GPU. Learn more by reading the new 2D and Video Programming chapter of our GPU Programming Guide.
Don't use the Read Time Stamp Counter (RDTSC) to do timing in your app. Here's why: it is unusable on mobile platforms because calculations assume a fixed clock speed. Mobile platforms (such as notebooks) will throttle back the CPU at unexpected times making the tick count from RDTSC useless. Instead use QueryPerformanceCounter, which insures consistent architecture independent results.
Compile as many shaders as you can up front at load time. Doing dynamic compiles is bad, especially as shaders get longer and more complex. If you MUST do dynamic compiles try to space them out over many frames. Don't turn a corner to compile 150 new shaders and expect your frame rate to be steady -- it won't be. You "stream" in texture resources in anticipation of future need, look for similar strategies for your shader resources.
When you're trying to debug your application and find potential bottlenecks, there are some simple steps you should take to ensure accurate measurements:
- Verify that the application runs cleanly. For example, when the application runs with Microsoft’s DirectX Debug runtime, it should not generate any errors or warnings.
- Ensure that the test environment is valid. That is, make sure you are running release versions of the application and its DLLs, as well as the release runtime of the latest version of DirectX.
- Use release versions (not debug builds) for all software.
- Make sure all display settings are set correctly. Typically, this means that they are at their default values. Anisotropic filtering and antialiasing settings particularly influence performance.
- Disable vertical sync. This ensures that your frame rate is not limited by your monitor’s refresh rate.
- Run on the target hardware. If you’re trying to find out if a particular hardware configuration will perform sufficiently, make sure you’re running on the correct CPU, GPU, and with the right amount of memory on the system. Bottlenecks can change significantly as you move from a low-end system to a high-end system.
Learn more by reading our GPU Programming Guide.
There are many ways to generate mipmaps. Mipmaps control how your objects are going to look at a distance, so getting them to look as good as possible is important. You can use the following tips with our Adobe Photoshop plug-in.
Filtering. What filtering method should you use? Many artists just use the box filter, but other filters give better results. The Mitchell filter, for example, gives very high quality results. Other filters, such as Catrom, Gaussian, Bessel, and more, are also provided for experimentation. Filters can be scaled artificially to blur or sharpen them.
Sharpening. Sharpening a mipmap after it has been generated can dramatically improve the look of your characters. This feature is especially useful for filters that tend to over-blur as they create the smaller mipmap levels.
Compression. Compression can improve the look of your character too, allowing you to squeeze in greater amounts of detail into the same area. Fading. Fading mipmaps to a specific color can create the look of fog
Dithering/Error Diffusion. Adding dithering and/or error diffusion to texture with smooth gradients can help the look of compressed textures.
NVPerfHUD is a powerful performance analysis tool that helps you understand the internal functions of your application. To ensure that unauthorized third parties do not analyze your application without your permission, you must make a minor modification to enable NVPerfHUD analysis.
One of the first things you do when setting up your graphics pipeline is call the Direct3D CreateDevice() function to create your display device. In your application it probably looks something like this:
HRESULT Res;
Res = g_pD3D->CreateDevice( D3DADAPTER_DEFAULT,
D3DDEVTYPE_HAL,
hWnd,
D3DCREATE_HARDWARE_VERTEXPROCESSING,
&d3dpp,
&g_pd3dDevice );
When your application is launched by NVPerfHUD, a special “NVIDIA NVPerfHUD” adapter is created. Your application can give NVPerfHUD permission to analyze it by selecting this adapter. In addition, since some applications might select the “NVIDIA NVPerfHUD” adapter ID unintentionally and expose themselves to unauthorized analysis, you must select “D3DDEVTYPE_REF” as the device type. Your application will not actually use the reference rasterizer as long as you have selected the NVPerfHUD adapter.
A minimal code change that will enable NVPerfHUD analysis in your application would be something like this:
HRESULT Res;
Res = g_pD3D->CreateDevice( g_pD3D->GetAdapterCount()-1,
D3DDEVTYPE_REF,
hWnd,
D3DCREATE_HARDWARE_VERTEXPROCESSING,
&d3dpp,
&g_pd3dDevice );
Using the last adapter (by calling GetAdapterCount()-1 as shown above) assumes that the “NVIDIA NVPerfHUD” adapter identifier created by NVPerfHUD will be the last in the list.
Learn more about NVPerfHUD by reading the User Guide that accompanies it. (Chapter 2 of the User Guide expands on the preceding discussion.)
Because the GeForce 6 Series architecture unconditionally supports non-power-of-two (NP2) textures, the D3DTEXTURECAPS_NONPOW2CONDITIONAL caps-bit (which indicates "conditional" textures that only support CLAMP addressing and don't support mipmaps or compressed textures) is NOT exposed. Thus, applications querying for NP2 support must take care to detect both conditional NP2-support for pre-GeForce 6 architectures, as well as unconditional NP2-support, as provided by the GeForce 6 architecture.
Instead of simply testing for D3DTEXTURECAPS_NONPOW2CONDITIONAL, i.e.,
// Incorrect test for any kind of NP2 support:
if ((pCaps->TextureCaps &
D3DPTEXTURECAPS_NONPOW2CONDITIONAL) == 0)
{
MessageBox(NULL,"Device does not support NP2 textures!",
"ERROR",MB_OK|MB_SETFOREGROUND|MB_TOPMOST);
return E_FAIL;
}
an application must test for conditional, as well as unconditional NP2 support; the D3DPTEXTURECAPS_POW2 caps-bit indicates general NP2 texture-support, i.e.,
// Correct test for any kind of NP2 support:
// If both unconditional and conditional support is
// unavailable then fail.
if ( ((pCaps->TextureCaps & D3DPTEXTURECAPS_POW2) != 0) &&
((pCaps->TextureCaps &
D3DPTEXTURECAPS_NONPOW2CONDITIONAL) == 0))
{
MessageBox(NULL, "Device does not support NP2 textures!",
"ERROR", MB_OK|MB_SETFOREGROUND|MB_TOPMOST);
return E_FAIL;
}
In order to minimize the chance of your application trashing video memory, the best way to allocate shaders and render targets is:
1. Allocate render targets first
- Sort the order of allocation by pitch (width * bpp).
- Sort the different pitch groups based on frequency of use. The surfaces that are rendered to most frequently should be allocated first.
2. Create vertex and pixel shaders
3. Load remaining textures
In the never ending quest for greater realism in games, there is one factor that is overlooked by most game developers. That is the fact that people see in the real world with two eyes. While (artificial) stereoscopic viewing, that is, on a screen verses in real life does not have a huge following, it never-the-less has a following. Many gamers enjoy the extra sense of presence obtained by playing games with inexpensive shutter glasses along with the stereo override driver from NVIDIA.
That aside, there is plenty of value in viewing your game in stereo during development in order to pick up on things that will look "fake" even when not viewing in stereo. Keep in mind that motion parallax gives similar visual cues to stereo, it's just that stereo viewer perceives instantaneously what user's see if they move around and obtain depth information via motion parallax.
Some old examples are trees with two vertical textures placed perpendicular. In stereo this effect does not sell. At least it would be better have a more varied approach which many developers are now implementing.
In a sense, using stereoscopic viewing while developing a game puts you ahead of the pack because you will see and correct these visual defects before they even come out in a game. This of course, will also enhance the experience of those who play your game in stereo.
For more stereo tips, check out our 3D Stereoscopic Development Guide. In addition, we'll have live stereo demos at GDC 2004 so that you can experience state-of-the-art stereo viewing first-hand.
The DirectX API allows graphics drivers to buffer up to three frames in the command queue of the GPU. Such a large buffer enables CPU and GPU to work in parallel even as workload on CPU and GPU varies. If there was no buffer then the GPU would become idle as soon as the CPU reduced its graphics command output (for example, because it was solving physics equations) and conversely the CPU would become idle whenever it wanted to send another graphics command and the GPU was still busy rendering a previously submitted graphics command.
On the other hand, allowing the driver to buffer three frames worth of data also means that lag (the time between a user giving input and seeing its effect on-screen) increases by up to three frames.
Several solutions exist to limit lag, in case lag becomes problematic. Locking the back-buffer is a solution, but it is a particularly bad one, see our tip in Developer Newsletter #7 for why it is inadvisable.
1. For games that mainly interact via a cursor, such as real-time strategy games, it is often sufficient to simply reduce the lag of the cursor. GPUs have specialized hardware-supported cursors that can be updated independent of (that is, more frequently than) the rendered scene. For more details, see the DirectX documentation for the methods:
IDirect3DDevice9::ShowCursor, IDirect3DDevice9::SetCursorPosition, and IDirect3DDevice9::SetCursorProperties.
2. Another solution is to use DirectX event queries: DirectX allows the insertion of tokens, called "events," into the command buffer and then allows to check whether the event has been processed. For example, at start-up time create an event query via
IDirect3DQuery9 *pQuery;
device->CreateQuery(D3DQUERYTYPE_EVENT, &pQuery);
Then just before calling Present(), insert the event into the command buffer:
pQuery->Issue(D3DISSUE_END);
If we wanted to limit the number of frames buffered to at most one, we need to check that the query has been processed at the end of the next frame. If it has not then we spin until it has been processed:
bool data;
while (pQuery->GetData(&data, sizeof(data), D3DGETDATA_FLUSH) == S_FALSE);
Because we can track multiple events in parallel and because we can insert and query these events from anywhere in the frame, we can thus finely regulate how many frames get maximally buffered: it is possible to buffer anything from fractional frames (a buffer that is at most half a frame) to 1, 2, or 2.5 frames. The main disadvantage of this technique is that the application is actively spinning while waiting for an event to be processed (see above while loop). Spinning like this can waste precious CPU cycles.
Microsoft's HLSL compiler (fxc.exe) adds chip-specific optimizations based on the profile that you're compiling for. If your shaders require ps_2_0 or higher, you should use the ps_2_a profile, which is a superset of ps_2_0 functionality that directly corresponds to the GeForce FX family. Compiling to the ps_2_a profile, in conjunction with our latest ForceWare drivers, will probably give you better performance than compiling to the generic ps_2_0 profile. Please note that the ps_2_a profile was only available starting with the DirectX 9.0a release this past summer.
Microsoft's IDxDiagContainer interface provides a convenient means for an application to obtain hardware and software information about a user's system. It is a COM interface easily run from within a C++ application. The interface can be queried for a specific property, or it can enumerate all properties available. For example, you can query for the graphics driver version using GetProp("szDriverVersion", &result). Also, you can query for the amount of physical video memory on the graphics card using szDisplayMemoryEnglish, which will return a string like "128 MB." The NVIDIA Developer Website will soon feature a convenient C++ class encapsulating this interface, making it even easier to add to your application.
Achieving high performance is all about removing bottlenecks-which really means that you have to balance every piece of the pipeline: the CPU, the AGP bus, and the pieces of the graphics pipeline in the GPU. The decision to use a vertex shader or a pixel shader depends on a few factors:
- How tessellated are your objects? You may want to lighten the load on the vertex shader if you have millions of vertices in each frame. This is especially true if you're using a multipass algorithm.
- What resolution are you targeting? If you expect your application to be run at higher resolutions, the pixel shader is more likely to become the bottleneck. So, you may want to push more computations to the vertex shader.
- How long are your pixel shaders? If you're doing complex shading, the pixel shader will probably be your bottleneck. If your pixel shaders compile to more than 20 instructions (on average) and occupy more than half the screen, your application will likely be pixel shader-bound. So, look for opportunities to move calculations to the vertex shader. For example, you can move from world space to light space for attenuation. Or, if you're doing bump mapping, you can make the move into tangent space per-vertex, unless you're doing per-pixel reflection into a cube map.
Are you taking advantage of everything your GPU can do? Our nView multi-display technology provides a revolutionary way to multi-task and process information easier. Instead of stacking window upon window within the confines of a single display, imagine spreading your work across multiple displays. Financial analysts can have a monitor for tracking each data stream. Graphic artists can use an entire display for palettes, and another for editing. The possibilities are endless.
nView is seamlessly integrated within the Microsoft Windows environment, helping users to maximize productivity though advanced desktop and application management. nView also provides increased efficiency on a single monitor by enabling multiple Windows desktops, quicker access to hidden windows with transparency and window rollups, and hotkeys for access to all nView functions. nView provides a quick and easy way for you to manage multiple Windows desktops, thereby increasing your efficiency and enabling you to see what you've been missing.
Learn more about how nView can help you at http://www.nvidia.com/view.asp?IO=feature_nview.
Take advantage of our collection of NVIDIA Texture Tool Suite to simplify your life when dealing with texture manipulation. Here are some tips to help you as you use the compression plug-in, library, and standalone nvDXT.exe:
- If your texture is mostly grey scale, select the Grey Scale button. Compression favors the green color more than red and blue. Selecting this option weights all the color channels equally.
- Dithering can improve the look of many textures when using compression.
- Gamma space filtering can improve your mipmaps. This creates mipmaps that take gamma correction into account so your mipmaps look better. In the MIP Map Generation (Filter) area, select Kaiser (Gamma Space). Please note that this filter may take a while to run on large textures.
- After mipmaps have been generated, you can apply a sharpening filter to reduce the amount of blurriness in your mipmaps. In the Sharpen After Filtering area, select Sharpen. Press Sharpen Settings to change the sharpening parameters.
- Edge Radius. Width of edge detection.
- Lambda. Strength of sharpening.
- Clamping Value. Values that stop texels from drifting. Keep this greater than zero.
- Theta. Used in anisotropic filtering. Blend value from source to destination image.
- Non-Maximal Suppression. Add clamping to filter.
- Two Components. Create two images and process them.
- Anisotropic. Alternate version of warp sharp.
- SharpBlur. Used in anisotropic filtering. Sharpen the warped image.
- The sharpening filter is described in the paper Enhancement by Image-Dependent Warping, by Nur Arad and Craig Gotsman.
Did you know that our Cg Toolkit comes with more than 80 sample shaders? Here are some interesting shaders to check out:
- check3d.fx shows procedural anti-aliasing
- uberCTB2.fx shows how to do multiply-lit multitextured BRDF shading
- MrWiggle.fx and blobCT.fx show how to do vertex animation
- wood.fx shows a classic procedural texture
- screentest.fx presents a workbench for image processing
- durer.fx shows an procedural and anti-aliased NPR (Non-photorealistic) technique
- ghostly2.fx shows a technique that's great for x-ray and similar effects
This issue's coding tip is more of an informational tip. Because storing normal maps is an issue that any developer will encounter when implementing per-pixel lighting, we thought it would be useful to briefly survey model-space normal maps and tangent-space normal maps, as well as some of the formats available for storing them.
Per-pixel lighting computes the lighting equation individually for each pixel. Each pixel therefore needs access to its individual inputs for the lighting equation. In particular, per-pixel surface normals are needed, which are stored in textures.
When storing normals in a texture you can choose between several texture formats. For example, you can use the red, green, and blue channels of a R8G8B8A8 format to store the three components of a normal, or you can use the specialized HILO formats. HILO8 requires only 16 bits of storage per normal because it stores only the x and y components of each normal, in 8 bit precision. The GPU computes the z component on the fly--since a normal is of length one, z = sqrt(1 - x^2 - y^2). Similarly, HILO16 requires only 32 bits of storage per normal. It stores the x and y components of each normal with 16 bit precision and computes the z component on the fly. Yet another alternative is to use a format like R5G6B5, which uses 5 bits for red, 6 bits for green, and 5 bits for blue, to conserve storage.
In addition to choosing a storage format for the normal, you also have a choice of which coordinate system to store the normal in. The choice of coordinate system influences how efficiently and how accurately you can compute the lighting equation. Obvious choices are tangent space and model space.
Tangent-space normals use the coordinate system of the model's surface. It implies that the z-component of each normal is always positive (because a surface normal always points away from the model's surface.) Because each normal is in tangent space, it must be transformed into the space of the other vectors used in the lighting equation before use. Alternatively, the other vectors may be transformed to tangent space. Transforming a tangent-space normal into model, world, or light space requires a 3x3 matrix multiplication per pixel. It may be more efficient to compute lighting in tangent space. Because the normal is already in tangent space, only the light vectors need to be transformed to tangent space. Since the light vectors do not change per-pixel and the tangent-space transformation only changes significantly per vertex, it is sufficient to pretransform all light vectors to tangent space in the vertex shader. This transformation is accomplished by a 3x3 matrix multiplication for every light vector.
However, if the vertex shader performs matrix palette skinning, then each bone requires its own pre-transformed version of the light-vector. A four bone vertex shader, for example, requires that a light vector be transformed into four different tangent spaces before it can be averaged into a single tangent space light vector to be passed to the pixel shader.
Model-space normals use the coordinate system of the model. Their z-component therefore is free to range arbitrarily from -1 to 1. It becomes nearly impossible to tile model-space normal maps across a model, because the model surface is unlikely to be constant in model space. Model space, however, has the advantage that it is invariant per-pixel and per-vertex, meaning that model-space light vectors can be precomputed. Therefore, computing the lighting equation per-pixel in model space is efficient because it doesn't require coordinate system conversions for any vectors. This computational saving can be substantial: 3 per-pixel instructions or up to 12 per-vertex instructions if doing 4 bone matrix palette skinning. Model-space normal maps however cannot take advantage of the HILO formats.
In summary, tangent-space normal maps can save storage (because they can be tiled, and because they can be stored in specialized formats), while object-space normal maps may be computationally more efficient. The choice you make for your application should depend on your content creation tools, storage requirements, and performance goals.
Use 3ds max's viewport background capability to place a screenshot of your game behind any CgFX model you are working with inside of MAX. This should help to ensure that your shader matches the overall look of your game. A second idea is to place a specially prepared bitmap in your viewport background that contains color and tone swatches that you would like to match with your shaders. This is simple and practical way to ensure that shaders and texture maps fall within specific color and tonal ranges; great for creating consistent art assets.
You may have heard that reading buffers (such as the color back-buffer or a render-target texture) is a bad idea if you want to get optimal performance. There are actually three problems.
First, reading back any buffer forces the driver to finish up the current list of rendering commands and to stall the graphics pipeline, preventing new commands from executing. This is necessary to ensure the buffer is in a consistent and valid state when it is returned to the application.
Second, reading the returned buffer requires that the data transfers from AGP to system-memory. The AGP-bus connecting these two types of memory is fast when data is flowing from system- to AGP-memory, but unfortunately much, much slower for transferring data the other way. (This problem will be addressed by the PCI Express bus.)
Third, and most significant, when the driver is waiting for the graphics pipeline to finish (see above), it idles the CPU! This is because the CPU has to wait to get the results of the read back before it can continue processing.
The bottom line is simple: look for ways to avoid read backs whenever possible -- perhaps by using render-to-texture or reworking your algorithms. In return, you'll get better performance.
When rendering using the hardware transform-and-lighting (TnL) pipeline or vertex-shaders, the GPU intermittently caches transformed and lit vertices. Storing these post-transform and lighting (post-TnL) vertices avoids recomputing the same values whenever a vertex is shared between multiple triangles and thus saves time. The post-TnL cache increases rendering performance by up to 2x.
Because the post-TnL cache is limited in size, taking maximum advantage of it requires rearranging triangle rendering-order. The easiest way to rearrange triangle-order for the post-TnL cache is to use the NVTriStrip library (.LIBs and source are available for free: http://developer.nvidia.com/view.asp?IO=nvtristrip_library).
If you are interested in exploring the performance characteristics of the post-TnL cache yourself, here are the details: The post-TnL cache is a strict First-In-First-Out buffer, and varies in size from effectively 10 (actual 16) vertices on GeForce 256, GeForce 2, and GeForce 4 MX chipsets to effectively 18 (actual 24) on GeForce 3 and GeForce 4 Ti chipsets. Non-indexed draw-calls cannot take advantage of the cache, as it is then impossible for the GPU to know which vertices are shared.
The following example explores how these restrictions translate into optimally rendering a 100x100 vertex mesh. The mesh needs to be submitted in a single draw-call to optimize batch-size. The draw-call must be with an indexed primitive-type (see above), either strips or lists -- the performance difference between strips and lists is negligible when taking advantage of the post-TnL cache. For illustration purposes, our example uses strips.
Rendering the mesh as 99 strips running along each row of triangles and stitching them together into a single strip with degenerate triangles only marginally takes advantage of the post-TnL cache. Only two vertices in each triangle hit the post-TnL cache (i.e., one vertex is transformed and lit per triangle).
Let's limit the length of each row-strip to at most 2*16 = 32 triangles. The mesh thus separates into ceil(99/16) = 7 columns of strips, each no longer than 32 triangles. Rendering all row-strips in column 0 first, from top to bottom and connected via degenerate triangles, allows an 18 entry post-TnL cache to store not only the last two vertices for each triangle but also the whole top row of vertices for each row-strip. Thus, only 1 vertex for every 2 triangles needs to be computed. For the vertices to be in just the right order such that the top row of vertices is in the post-TnL cache, the top-row of vertices of each column should be sent as a list of degenerate triangles.
The alpha blending capabilities of Direct3D and OpenGL provide you with the ability to perform additional multiplications that could save you from having to add a second pass to achieve certain effects. This can be particularly useful with older hardware such as GeForce 256, which does not support pixel shaders. For example, for per-pixel lighting, it would be impossible to do something as simple as
((L dot N) + Ambient) * DiffuseTexture
in Direct3D. So just adding a scalar ambient term to your lighting calculations would seem to cost you an extra pass. However, you can do it in one pass, using the following technique:
- Use the first color combiner to calculate (L dot N)
- Use the second color combiner to set DiffuseTexture
- Use the second alpha combiner to take the result of the previous stage, (L dot N), and add a constant (Ambient)
Now the output to the alpha blending stage of the pipeline is:
- SourceColor: an RGB color coming from the DiffuseTexture
- SourceAlpha: an alpha value equal to (L dot N) + Ambient
Next you can configure the blending parameters to perform the final multiplication. By setting the source blend factor to SRCALPHA and the destination blend factor to ZERO, the blend operation will calculate SourceColor*SourceAlpha + DestinationColor*0, which gives us:
SourceColor * SourceAlpha
Which, in our case is:
((L dot N) + Ambient) * DiffuseTexture
and what we wanted in the first place.
When using NV_occlusion_query, don't request the results immediately after making the query. You must structure your application code to hide the latency of the query. Here's how you could do this:
- initialize the depth buffer with objects that were not fully occluded in the previous frame
- do occlusion queries on these objects
- try to sort this rendering to get the best occlusion information
- mask depth buffer writes and draw objects that were fully occluded in the previous frame
- do occlusion queries on these objects
- do color rendering (shading, shadows, whatever) for all objects that were not fully occluded in the previous frame
This approach allows you to hide the latency of the query over a frame, but it means that sometimes you don't draw the first frame for an object that becomes visible. More advanced approaches can be used to eliminate this problem, but they are beyond the scope of this tip.
Loosely render from front to back as much as possible. For example, you might render your characters first, then the terrain they stand on, and finally the sky. This reduces pixel shading computation time for architectures that perform pre-z-culling (GeForce3 and above) and saves memory bandwidth because of fewer color and z writes to the frame buffer! Don't worry about explicitly sorting each piece of geometry from front to back -- a loose sort is all that's needed.