We’ve all been there. Your CUDA Fortran code is humming along and suddenly you get a runtime error: copyin
, copyout
, usually accompanied by FAILED
in all caps. In many cases, the error message gives you enough information to find where the problem is in your source code: you have a copyin
runtime error and you only perform a few host-to-device transfers, or your code ran fine before you added that block of code earlier today—either way, you know where to look.
Then there are the cases where you run across a runtime error in a huge code with many device arrays, and many transfers between host and device occur when staging MPI transfers. How do you track down the line of source code that caused the error in this case? Using cuda-gdb
becomes unwieldy with large MPI codes. There is always the print statement approach, but do you really want to litter your code with print statements (only to forget to remove one when you’ve resolved the error)?
Fortunately there is a simple method to pinpoint the line of source code that causes a runtime error in CUDA Fortran that involves no code modification, only recompilation with -g
if using optimization of -O2
or higher. We describe this method in this Pro Tip.
Example CUDA Fortran Code with Runtime Error
First we need a CUDA Fortran code that generates a runtime error, such as the following.
module m Contains attributes(global) subroutine increment(a, b) implicit none integer :: a(:), b(:) integer :: i, n i = blockDim%x*(blockIdx%x-1) + threadIdx%x n = size(a) if (i <= n) a(i) = a(i)+b(i) end subroutine increment end module m program main use cudafor use m implicit none integer, parameter :: n = 1024*1024 integer, allocatable :: a(:), b(:) integer, device, allocatable :: a_d(:), b_d(:) integer :: tPB = 256 allocate(a(n), b(n), a_d(n)) a = 1 b = 3 a_d = a b_d = b call increment<<<ceiling(real(n)/tPB),tPB>>>(a_d, b_d) a = a_d if (any(a /= 4)) then write(*,*) '**** Program Failed ****' else write(*,*) 'Program Passed' endif deallocate(a, b, a_d) end program main
A quick inspection of this simple code reveals that the allocatable device array b_d
is not allocated before use, and sure enough the code compiles fine but generate a runtime error.
% pgf90 -fast -g unalloc.cuf % ./a.out 0: copyin Memcpy (dev=0x(nil), host=0x0x7f7e8986e230, size=5678900328) FAILED: 11(invalid argument)
We can track down the source line for this error in two simple steps: (1) generate a backtrace and (2) translate an address from the backtrace to a line number in a source code file.
Generating a Backtrace
You can generate a backtrace for code compiled with the PGI compilers by setting the environment variable PGI_TERM
to trace
before running the code. Doing this with the example code results in the following output.
% export PGI_TERM='trace' % ./a.out 0: copyin Memcpy (dev=0x(nil), host=0x0x7f7406519230, size=18446744066755845160) FAILED: 11(invalid argument) /opt/pgi/linux86-64/17.10/lib/libpgf90_rpm1.so(__fort_abortx+0x17) [0x7f7408b6ea87] /opt/pgi/linux86-64/17.10/lib/libpgf90.so(__fort_abort+0x5f) [0x7f7408f4d64f] /opt/pgi/linux86-64/17.10/lib/libcudafor.so(+0x5dd8e) [0x7f741109fd8e] /opt/pgi/linux86-64/17.10/lib/libcudafor.so(pgf90_dev_copyin+0x53) [0x7f74110a0012] ./a.out() [0x403b1a] ./a.out() [0x4036d4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f74074d7830] ./a.out() [0x403589]
The line just below the line with the copyin call from the libcudafor.so
library is the statement from the executable a.out
that generated the runtime error, which has the address 0x403b1a
. Now all you need to do is convert this address to the corresponding location in the source code.
Converting an Address to Line Information
The utility addr2line
can be used to convert the address from the backtrace to the corresponding source file and line number. Just provide the executable and the address as follows.
% addr2line -e a.out 0x403b1a /home/gruetsch/./unalloc.cuf:30
Here you can see that line 30 in the file unalloc.cuf
is indeed the host-to-device transfer using the unallocated device array.
Summary
There are a number of tools available for debugging CUDA Fortran code, including cuda-gdb
and cuda-memcheck
, as well as this technique for locating the source runtime errors. Which tool you use depends on your particular circumstances. The technique in this Pro Tip is attractive because it is simple and requires no source code modification. I hope you find it useful in resolving runtime errors encounter in your CUDA Fortran projects.