VampirTrace performance monitor gives detailed insight into the runtime behavior of accelerators. This enables an extensive performance analysis and optimization of hybrid programs written in CUDA, OpenACC, OpenCL, and PyCUDA.

VampirTrace is capable of tracing GPU accelerated applications
and generates exact time stamps for all GPU related events. The
information can be used to generate quick profiles or can also be
graphically analyzed using Vampir. Vampir allows interactive navigation
(zooming, moving) through the timelines of the execution of a parallel
application annotated with a lot of statistics like time consumed, number of
invocations, messages statistics, performance counter support, etc. The
latest addition also allows capturing of GPU performance counters.

Key features

Vampir Screen Shot (click to expand)
  • Easy integration into the build process by supplied compiler wrappers
  • GPU performance counter support via CUDA Performance Tools Interface (CUPTI)
  • Powerful parallel analysis engine to support interactive trace analysis
  • A powerful tool to locate load imbalances and understand what your application is actually doing

Most projects claim at least 20% performance increase after isolating
serial or unbalanced portions of their code and optimizing them.

For more information visit these links: