Simulation / Modeling / Design

Customize CUDA Fortran Profiling with NVTX

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.

Two previous Pro Tip posts on Parallel Forall showed how to use NVTX in CUDA C++ and MPI codes. In this post, I’ll show how to use NVTX to annotate the profiles of Fortran codes (with either CUDA Fortran or OpenACC).

NVTX has a lot of features, but here I’ll focus on using it to annotate the profiler output with timeline markers using nvtxRangePush() and nvtxRangePop(). I’ll show you how to insert markers with custom labels and colors.

To make it easy, I’ve written a Fortran module to instrument CUDA/OpenACC Fortran codes that works like the macro that Jiri Krauss wrote about in his post on NVTX. The nvtx module is simple to use. After loading the module, just call nvtxStartRange() / nvtxEndRange() to insert markers in the timeline. Calls to nvtxStartRange() with a single argument generate green markers, or you can specify one of seven available colors using an optional second integer parameter.

The following test code generates a green labeled range encompassing the whole run, and 14 custom ranges labeled with the iteration number that cycle through the predefined set of seven colors.

program main
  use nvtx
  character(len=4) :: itcount

  ! First range with standard color
  call nvtxStartRange("First label")

  do n=1,14
    ! Create custom label for each marker
    write(itcount,'(i4)') n

    ! Range with custom  color
    call nvtxStartRange("Label "//itcount,n)

    ! Add sleep to make markers big 
    call sleep(1)

    call nvtxEndRange
  end do

  call nvtxEndRange
end program main

To compile the code and generate an executable, pass the location of the libnvToolsExt (usually /usr/local/cuda/lib on 32-bit systems or /usr/local/cuda/lib64 on 64-bit systems) and the library name, as follows.

$ pgf90 nvtx.cuf -L/usr/local/cuda/lib -lnvToolsExt

You can quickly generate profiler output by running nvprof and saving it to a file with the “-o” flag:

$ nvprof -o profiler.output ./a.out
 ==10653== NVPROF is profiling process 10653, command: ./a.out
 ==10653== Generated result file: /Users/mfatica/profiler.output

With CUDA 7.5, it is now possible to visualize the output straight from the command line with NVVP (In previous versions, you need to select “File->Import”, then select “Nvprof”, click “Next”, select “Single process”, and then browse to the output of nvprof.):

$ nvvp -o profiler.output

NVVP generates the timeline in Figure 1.


Figure 1: The NVIDIA Visual Profiler (NVVP) profile timeline showing custom ranges and labels inserted using NVTX.
Figure 1: The NVIDIA Visual Profiler (NVVP) profile timeline showing custom ranges and labels inserted using NVTX.

The example had no GPU kernels, so the timeline isn’t very interesting. But you can use the same methodology to generate more complex traces with CPU and GPU markers, as Figure 2 shows.

Figure 2: Profiler timeline for a more complex application showing CUDA Fortran kernels, API calls, and custom ranges inserted using NVTX.
Figure 2: Profiler timeline for a more complex application showing CUDA Fortran kernels, API calls, and custom ranges inserted using NVTX.

NVTX Fortran Module Code

Following is the code for the nvtx module. The code uses the Fortran ISO C Binding module to create an interface to the NVTX C functions. It also uses the “optional” keyword to handle the custom color parameter.

module nvtx

use iso_c_binding
implicit none

integer,private :: col(7) = [ Z'0000ff00', Z'000000ff', Z'00ffff00', Z'00ff00ff', Z'0000ffff', Z'00ff0000', Z'00ffffff']
character(len=256),private :: tempName

type, bind(C):: nvtxEventAttributes
  integer(C_INT16_T):: version=1
  integer(C_INT16_T):: size=48 !
  integer(C_INT):: category=0
  integer(C_INT):: colorType=1 ! NVTX_COLOR_ARGB = 1
  integer(C_INT):: color
  integer(C_INT):: payloadType=0 ! NVTX_PAYLOAD_UNKNOWN = 0
  integer(C_INT):: reserved0
  integer(C_INT64_T):: payload   ! union uint,int,double
  integer(C_INT):: messageType=1  ! NVTX_MESSAGE_TYPE_ASCII     = 1 
  type(C_PTR):: message  ! ascii char
end type

interface nvtxRangePush
  ! push range with custom label and standard color
  subroutine nvtxRangePushA(name) bind(C, name='nvtxRangePushA')
  use iso_c_binding
  character(kind=C_CHAR,len=*) :: name
  end subroutine

  ! push range with custom label and custom color
  subroutine nvtxRangePushEx(event) bind(C, name='nvtxRangePushEx')
  use iso_c_binding
  import:: nvtxEventAttributes
  type(nvtxEventAttributes):: event
  end subroutine
end interface

interface nvtxRangePop
  subroutine nvtxRangePop() bind(C, name='nvtxRangePop')
  end subroutine
end interface


subroutine nvtxStartRange(name,id)
  character(kind=c_char,len=*) :: name
  integer, optional:: id
  type(nvtxEventAttributes):: event


  if ( .not. present(id)) then
    call nvtxRangePush(tempName)
    call nvtxRangePushEx(event)
  end if
end subroutine

subroutine nvtxEndRange
  call nvtxRangePop
end subroutine

end module nvtx

You can customize the interface to your needs. For example, you could add more colors, add a variant that includes a call to cudaDeviceSynchronize() to better mark GPU kernels, or add more functions from NVTX. You now have the power to use NVTX in Fortran code to better understand the hot spots in your applications.

Related resources

Discuss (4)