TAU Performance System® is a profiling and tracing toolkit for performance analysis of hybrid parallel programs written in CUDA C, CUDA C++, OpenCL or using pyCUDA or OpenACC. TAU gathers performance information of GPU computations and integrates it with other application performance data, through instrumentation of functions, methods, basic blocks, and statements to capture a performance picture of the resulting application execution.
To address the high level programming aspect, TAU can be integrated with CUDA compilers and OpenACC compilers. TAU intercepts the runtime library routines and automatically inserts calls to the TAU measurement interfaces in its runtime system and compiler generated code.
Example: TAU profile of GPU Accelerated NAMD