While developing and playing PC games on Windows/WDDM, it is common for stuttering (uneven frame times) to start happening when enabling graphics features or increasing the screen resolution. There are a lot of possible root causes for stuttering; one of the most common is video-memory overcommitment which happens when an application is using more video memory than is physically available on the GPU. In this article, I will describe a method we have been using at NVIDIA to determine whether video-memory overcommittement is happening and is causing any stuttering on Windows Vista, 7, 8 or 8.1. (The method described in this article may not apply to Windows 10/WDDMv2, which has a different memory residency model.)

I assume that you already have a way to monitor the CPU & GPU frame times in your game engine (and/or you are using FRAPS to measure the CPU frame times), and you have identified that your game is stuttering badly at some location, on a certain GPU, with specific graphics settings. In a blog article and GDC 2015 talk, Iain Cantlay discussed how one can quantify stuttering by using frame-time percentiles. Various possible causes & fixes for stuttering were discussed by Cem Cebenoyan at GDC China 2012, John McDonald at GDC 2014, and Iain Cantlay at GDC 2015. Video-memory overcommittement is one of the most common causes of stuttering.

Now, you may suspect that some stutter could be caused by video-memory overcommitment and you may not be sure how to best prove this hypothesis. Monitoring the amount of per-GPU used video memory with a tool such as GPU-Z, reducing the texture quality (to lower the video memory footprint), or testing with a GPU with more video memory can give you hints. However, for various reasons, the GPU-Z “Memory Used” counter may be below the amount of available dedicated video memory but the application may actually still be over-committing video memory. So in general, looking at the used video memory alone is not enough to know if you are over-committing.

Note that in this context, the term “committed” is equivalent to “referenced”, that is, a resource “used” (or bound) in any render call (clear, copy, draw or dispatch call). Only the resources that are referenced in an actual render call are considered to be placed in (dedicated) video memory. Resources that are just created but never referenced in any render call will never be committed to video memory.

At NVIDIA, we have developed a method for determining whether or not any Windows/WDDM graphics application is overcommitting video memory - by using GPUView. GPUView is a free tool provided by Microsoft as part of the Windows Performance Toolkit (which ships with the Windows 8.1 SDK). It can be used with all GL and D3D applications on Windows Vista and forward. If you are not familiar with the tool yet, introductions to GPUView are available on Matthew Fisher’s website and Jon Story’s GDC 2012 presentation. The tool also ships with a help file which I recommend checking out.

How To Detect Video Memory Overcommittement

To determine whether a Windows application is running out of video memory or not, the first thing I do is capture a GPUView trace (see Appendix) from a run where stuttering is happening consistently. I then open up the trace in GPUView and:

  1. Check the GPU Hardware Queues
  2. Check the CPU Context Queues
  3. Check for any EvictAllocation Events

In this article, I am going to take the example of a DX11 application running on a GeForce GTX 680 (with 2 GB of dedicated video memory) on Windows 8 and I am going to use the GPUView build from the Win 8.1 SDK. I have captured 3 GPUView traces from the same in-game location, with different screen resolutions and super-sampling settings: 1920x1200 & 2560x1600 with no super-sampling and 2560x1600 + 120% super-sampling (that is, an effective 3072x1920 resolution).

Step 1: Check the GPU Hardware Queues

Here is how the GPU hardware queues look near the end of the GPUView traces, in time intervals containing six Present packets. In both cases, the top hardware queue is the Graphics queue for the application — which contains the hashed Present packets and other graphics packets — and the bottom-most hardware queue is the Copy Engine queue which contains the red Paging packets.

Check the GPU Hardware Queues Figure 1. GPU Queues in the 1920x12v00 trace. No stuttering.

Check the GPU Hardware Queues Figure 2. GPU Queues in the 3072x1920 trace. Heavy stuttering.

The times and percentages highlighted in the red boxes below each hardware queue are, respectively, the total amount of GPU time and the fraction of the elapsed time that the queue was not empty in the current time interval.

In the 1920x1200 trace, the Graphics queue was occupied for 100% of the time, so there was no problem at the queue level.

In the 3072x1920 trace, the Graphics queue was occupied only 54.1% of the time and there were large gaps of variable length in the queue. In practice, each of these GPU gaps in the GPU Graphics queue results in a GPU frame-time stutter.

Note that you can have large gaps in the GPU Graphics queue and not have stuttering, if the gaps are always consistent. The reason these gaps are causing stuttering is because they are of variable length.

Let’s zoom in to the first frame from Figure 2 — after the first Present packet and up to the next Present packet. In this frame, the Graphics queue is occupied for only 22.9% of the time. If we select the largest Graphics-queue gap in this frame, we see it is taking 66 ms, as displayed in the bottom-right corner of the GPUView window:

Check the GPU Hardware Queues Figure 3. Large 66ms gap in the GPU Graphics Queue, in the 3072x1920 trace.

Step 2: Check the CPU Context Queues

If we scroll down in GPUView and look at the CPU Graphics queue from the application (the one which has the hashed Present packets and the correct process name), we see that the CPU queue always contains between one and two Present packets. (A maximum of two is expected in this case because this application is limiting the number of queued frames to at most two, by using event queries. We happen to know this from talking to the application developer – the event query usage details are not explicitly discernible from the GPUView trace itself.) The point is that the CPU Graphics queue is never empty; if it was empty, that would indicate a problem.

Check the GPU Hardware Queues Figure 4. CPU & GPU queues for one GPU frame in the 3072x1920 trace.

We now know that the large gaps in the GPU Graphics queue are not caused by any CPU-GPU sync point or by the application being CPU bound. Otherwise, the CPU queue would go empty at some point.

Step 3: Check for any EvictAllocation Events

The EvictAllocation events can be listed by going to Tools -> Event List and selecting the “DxgKrnl EvictAllocation” events in the GUID List.

The 1920x1200 trace contains no EvictAllocation events during gameplay. In contrast, the 3072x1920 trace has 244 of those events in the current time interval:

Check the GPU Hardware Queues Figure 5. EvictAllocation events confirming video-memory overcommitment.

In GPUView, each of these selected events is visualized with a red vertical bar on the time line. As shown in Figure 5, there are multiple of these events per frame and the events are correlated with the gaps in the Graphics GPU queue and with activity in dxgmms1.sys — the OS video memory manager.

If we zoom out and look at a 40s interval from this trace, we see that there are EvictAllocation events all over:

Check the GPU Hardware Queues Figure 6. EvictAllocation events in the 3072x1920 trace over a 40 second interval.

Finally, in the 2560x1600 with no super-sampling enabled, there are still some GPU-queue gaps and EvictAllocation events correlated with the GPU queue gaps. So the application is not only over-committing the 2GB of video memory in 3072x1920 but also in 2560x1600, to a lesser extent.

Check the GPU Hardware Queues Figure 7. EvictAllocation events in the 2560x1600 trace causing a stutter.

Summary

Overall, you know video-memory overcommittement is causing stuttering if:

  1. There are large gaps (>1ms) of variable length in the GPU Graphics queue.
  2. There are no gaps in the game’s CPU Graphics queue.
  3. There are DxgKrnl EvictAllocation events happening during gameplay.
  4. The EvictAllocation events are correlated with the GPU Graphics queue gaps.

TIP: The first thing you can do when opening a GPUView trace is enabling all DxgKrnl EvictAllocation events in the Event List. If you don’t see any such events, then you know you don’t have any video-memory overcommittement.

Appendix: Using the GPUView Reference Chart

Whenever a resource is created, our driver fills in an array of “Preferred Segments” which are used by the OS Video Memory Manager to decide if a resource should be promoted or evicted to/from video memory at any time.

GPUView lets us visualize the percentage of referenced resources that are currently in their preferred segment (also known as “P0”), as well as their fallback segments (P1, P2, etc.). To enable this visualization, you can go to the Charts menu and click on “Toggle Reference Charts”. This adds a Reference Chart below each CPU context queue, for example:

GPUView Reference Chart Figure 8. Reference Chart for the CPU Graphics queue in the 3072x1920 trace.

TIP: In practice, if you see any non-zero P2 percentage in any reference charts for your application, then you know you are overcommitting video memory.

Note that this is not a necessary condition though: if there are no P2 references, you may still be over-committing video memory and can check for GPU idle gaps and EvictAllocation events to make sure. Still, it’s nice to be able to confirm that video-memory overcommittement is happening using another data point.

Appendix: Monitoring Available Video Memory

Note that the amount of video memory that is currently available to any application in the whole system can be queried by using the NvAPI_GPU_GetMemoryInfo function from NVAPI.

Here is an example helper class that does it:


#include <stdlib.h>
#include <assert.h>
#include "nvapi.h"

class NvVideoMemoryMonitor
{
public:
    struct MemInfo
    {
        unsigned int DedicatedVideoMemoryInMB;
        unsigned int AvailableDedicatedVideoMemoryInMB;
        unsigned int CurrentAvailableDedicatedVideoMemoryInMB;
    };

    NvVideoMemoryMonitor()
        : m_GpuHandle(0)
    {
    }

    void Init()
    {
        NvAPI_Status Status = NvAPI_Initialize();
        assert(Status == NVAPI_OK);

        NvPhysicalGpuHandle NvGpuHandles[NVAPI_MAX_PHYSICAL_GPUS] = { 0 };
        NvU32 NvGpuCount = 0;
        Status = NvAPI_EnumPhysicalGPUs(NvGpuHandles, &NvGpuCount);
        assert(Status == NVAPI_OK);
        assert(NvGpuCount != 0);
        m_GpuHandle = NvGpuHandles[0];
    }

    void GetVideoMemoryInfo(MemInfo* pInfo)
    {
        NV_DISPLAY_DRIVER_MEMORY_INFO_V2 MemInfo = { 0 };
        MemInfo.version = NV_DISPLAY_DRIVER_MEMORY_INFO_VER_2;
        NvAPI_Status Status = NvAPI_GPU_GetMemoryInfo(m_GpuHandle, &MemInfo);
        assert(Status == NVAPI_OK);

        pInfo->DedicatedVideoMemoryInMB = MemInfo.dedicatedVideoMemory / 1024;
        pInfo->AvailableDedicatedVideoMemoryInMB = MemInfo.availableDedicatedVideoMemory / 1024;
        pInfo->CurrentAvailableDedicatedVideoMemoryInMB = MemInfo.curAvailableDedicatedVideoMemory / 1024;
    }

private:
    NvPhysicalGpuHandle m_GpuHandle;
};

The dedicatedVideoMemory and availableDedicatedVideoMemory counters are constant and availableDedicatedVideoMemory may be lower than the dedicatedVideoMemory.

As for the curAvailableDedicatedVideoMemory counter, it is variable and depends on the graphics applications currently running on the system. If you see this counter getting lower than 50MB, then you know you are likely overcommitting video memory (or are going to do it soon) and can verify if that is the case by using GPUView. You may want to query it every frame and display it on screen in QA builds.

Appendix: How To Capture a GPUView Trace

I am assuming that your system has enough RAM (16GB should be enough typically) so that capturing the trace in memory should not affect the system performance significantly. If you have enough RAM, the trace should reflect the underlying problems well.

Using a single computer

For convenience, I use a batch file containing this single line:

cmd /K "cd C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\gpuview"

And I launch this batch file with Right Click -> “Run as Administrator”. (If you don’t launch this command prompt as Administrator, the logging may fail.)

To capture a trace, I normally do the following:

  1. Type “log light” in the GPUView command line (to start capturing the trace)
  2. Launch the game in full-screen mode and reach the repro location
  3. Wait for 20 seconds (ideally without moving the camera to get more stable results)
  4. ALT-Tab back to the GPUView cmd line and type “log” (to stop capturing)
  5. Run “GPUView.exe Merged.etl“ to open up the GPUView trace

Note that it is possible to start the capture after the game has been launched. However, we have found that the CPU queues are not getting captured correctly when doing so on Win8. Starting the capture before launching the game (which is what I am recommending) does not have this corruption problem. This has the drawback of generating larger traces but avoids any potential doubts or confusion.

Using two computers

This method is assuming that you have a second computer available and connected to the same network as the main computer you want to capture a GPUView trace from. On the second computer, you can download the PsExec package and launch this command line:

PsExec \\YourComputerName -s -u YourDomain\YourUserName cmd /K "cd C:\Program Files (x86)\Windows Kits\8.1\Windows Performance Toolkit\gpuview"

This will open a remote command prompt on your local computer. You can then type “log light” to start logging and “log” to stop logging.

The “-s” PsExec argument is needed to launch the remote command prompt as Administrator.