The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.
Two previous Pro Tip posts on Parallel Forall showed how to use NVTX in CUDA C++ and MPI codes. In this post, I’ll show how to use NVTX to annotate the profiles of Fortran codes (with either CUDA Fortran or OpenACC).
NVTX has a lot of features, but here I’ll focus on using it to annotate the profiler output with timeline markers using nvtxRangePush() and nvtxRangePop(). I’ll show you how to insert markers with custom labels and colors.
To make it easy, I’ve written a Fortran module to instrument CUDA/OpenACC Fortran codes that works like the macro that Jiri Krauss wrote about in his post on NVTX. The nvtx module is simple to use. After loading the module, just call nvtxStartRange() / nvtxEndRange() to insert markers in the timeline. Calls to nvtxStartRange() with a single argument generate green markers, or you can specify one of seven available colors using an optional second integer parameter.
The following test code generates a green labeled range encompassing the whole run, and 14 custom ranges labeled with the iteration number that cycle through the predefined set of seven colors.
program main
use nvtx
character(len=4) :: itcount
! First range with standard color
call nvtxStartRange("First label")
do n=1,14
! Create custom label for each marker
write(itcount,'(i4)') n
! Range with custom color
call nvtxStartRange("Label "//itcount,n)
! Add sleep to make markers big
call sleep(1)
call nvtxEndRange
end do
call nvtxEndRange
end program main
To compile the code and generate an executable, pass the location of the libnvToolsExt (usually /usr/local/cuda/lib on 32-bit systems or /usr/local/cuda/lib64 on 64-bit systems) and the library name, as follows.
$ pgf90 nvtx.cuf -L/usr/local/cuda/lib -lnvToolsExt
You can quickly generate profiler output by running nvprof and saving it to a file with the “-o” flag:
$ nvprof -o profiler.output ./a.out ==10653== NVPROF is profiling process 10653, command: ./a.out ==10653== Generated result file: /Users/mfatica/profiler.output
With CUDA 7.5, it is now possible to visualize the output straight from the command line with NVVP (In previous versions, you need to select “File->Import”, then select “Nvprof”, click “Next”, select “Single process”, and then browse to the output of nvprof.):
$ nvvp -o profiler.output
NVVP generates the timeline in Figure 1.

The example had no GPU kernels, so the timeline isn’t very interesting. But you can use the same methodology to generate more complex traces with CPU and GPU markers, as Figure 2 shows.

NVTX Fortran Module Code
Following is the code for the nvtx module. The code uses the Fortran ISO C Binding module to create an interface to the NVTX C functions. It also uses the “optional” keyword to handle the custom color parameter.
module nvtx
use iso_c_binding
implicit none
integer,private :: col(7) = [ Z'0000ff00', Z'000000ff', Z'00ffff00', Z'00ff00ff', Z'0000ffff', Z'00ff0000', Z'00ffffff']
character(len=256),private :: tempName
type, bind(C):: nvtxEventAttributes
integer(C_INT16_T):: version=1
integer(C_INT16_T):: size=48 !
integer(C_INT):: category=0
integer(C_INT):: colorType=1 ! NVTX_COLOR_ARGB = 1
integer(C_INT):: color
integer(C_INT):: payloadType=0 ! NVTX_PAYLOAD_UNKNOWN = 0
integer(C_INT):: reserved0
integer(C_INT64_T):: payload ! union uint,int,double
integer(C_INT):: messageType=1 ! NVTX_MESSAGE_TYPE_ASCII = 1
type(C_PTR):: message ! ascii char
end type
interface nvtxRangePush
! push range with custom label and standard color
subroutine nvtxRangePushA(name) bind(C, name='nvtxRangePushA')
use iso_c_binding
character(kind=C_CHAR,len=*) :: name
end subroutine
! push range with custom label and custom color
subroutine nvtxRangePushEx(event) bind(C, name='nvtxRangePushEx')
use iso_c_binding
import:: nvtxEventAttributes
type(nvtxEventAttributes):: event
end subroutine
end interface
interface nvtxRangePop
subroutine nvtxRangePop() bind(C, name='nvtxRangePop')
end subroutine
end interface
contains
subroutine nvtxStartRange(name,id)
character(kind=c_char,len=*) :: name
integer, optional:: id
type(nvtxEventAttributes):: event
tempName=trim(name)//c_null_char
if ( .not. present(id)) then
call nvtxRangePush(tempName)
else
event%color=col(mod(id,7)+1)
event%message=c_loc(tempName)
call nvtxRangePushEx(event)
end if
end subroutine
subroutine nvtxEndRange
call nvtxRangePop
end subroutine
end module nvtx
You can customize the interface to your needs. For example, you could add more colors, add a variant that includes a call to cudaDeviceSynchronize() to better mark GPU kernels, or add more functions from NVTX. You now have the power to use NVTX in Fortran code to better understand the hot spots in your applications.