
TAU Performance System® is a profiling and tracing toolkit for performance analysis of hybrid parallel programs written in CUDA C, CUDA C++, OpenCL or using pyCUDA or HMPP. TAU gathers performance information of GPU computations and integrates it with other application performance data, through instrumentation of functions, methods, basic blocks, and statements to capture a performance picture of the resulting application execution.
To address the high‐level programming aspect, TAU can be integrated with GPU compilers CAPS HMPP and PGI Accelerator compilers. TAU intercepts the runtime library routines and automatically inserts calls to the TAU measurement interfaces in its runtime system and compiler generated code.
Example: TAU profile of GPU Accelerated NAMD




Registered Developers Website
NVDeveloper (old site)