Simulation / Modeling / Design

CUDA Pro Tip: Improve NVIDIA Visual Profiler Loading of Large Profiles

GPU Pro Tip

Some applications launch many tiny kernels, making them prone to very large (100s of megabytes or larger) nvprof timeline dumps, even for application runs of only a handful of seconds.

Such nvprof files may fail to even load when you try to import them into the NVIDIA Visual Profiler (NVVP). One symptom of this problem is that when you click “Finish” on the import screen, NVVP “thinks” for a minute or so, but then just goes right back to the import screen asking you to click Finish again. In other cases, attempting to load a large file can result in NVVP “thinking” about it for many hours.

It turns out that this problem is because of the Java max heap size setting specified in the libnvvp/nvvp.ini file of the CUDA Toolkit installation: the profiler configures the Java VM to cap the heap size at 1GB in order to work even on systems with minimal physical memory.  While this 1GB value is already an improvement over the 512MB setting used in earlier CUDA versions, it is still not enough for some applications, considering that the memory footprint of the profiler can be at least four to five times larger than the input file size.

Given that many modern workstations have far more than 1GB of physical memory, we can customize this configuration setting based on our needs and based on our system’s physical memory size to improve the Visual Profiler’s ability to import larger data files.  The nvvp.ini configuration file looks like this out of the box, with the relevant line highlighted:


Our primary goal, then, is to change this “1024m” to something bigger. Exactly which size you pick depends on your situation.  For example, my workstation has 24GB of system memory, and I happen to know that I won’t need to run any other memory-intensive applications at the same time as the Visual Profiler, so it’s okay for NVVP to take up the vast majority of that space.  So I might pick, say, 22GB as the maximum heap size, leaving a few gigabytes for the OS, GUI, and any other programs that might be running.

While I’m at it, I can make a few other configuration tweaks as well:

  • Increase the default heap size (the one Java automatically starts up with) to, say, 2GB. (-Xms)
  • Tell Java to run in 64-bit mode instead of the default 32-bit mode (only works on 64-bit systems); this is required if you want heap sizes >4GB.  (-d64)
  • Enable Java’s parallel garbage collection system, which helps both to decrease the required memory space for a given input size as well as to catch out of memory errors more gracefully.  (-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode)

So I changed my nvvp.ini to the following.  The -Xmx setting should be tailored to the available system memory and input size as mentioned above, but shoot for at least 5-6x the input file size as a minimum.   (Note: most CUDA installations require administrator/root-level access to modify this file.)


And voila, I can load profiles of hundreds of megabytes in file size in seconds instead of hours.

Discuss (4)