We’ve all been there. Your CUDA Fortran code is humming along and suddenly you get a runtime error:
copyout, usually accompanied by
FAILED in all caps. In many cases, the error message gives you enough information to find where the problem is in your source code: you have a
copyin runtime error and you only perform a few host-to-device transfers, or your code ran fine before you added that block of code earlier today—either way, you know where to look.
Then there are the cases where you run across a runtime error in a huge code with many device arrays, and many transfers between host and device occur when staging MPI transfers. How do you track down the line of source code that caused the error in this case? Using
cuda-gdb becomes unwieldy with large MPI codes. There is always the print statement approach, but do you really want to litter your code with print statements (only to forget to remove one when you’ve resolved the error)?
Fortunately there is a simple method to pinpoint the line of source code that causes a runtime error in CUDA Fortran that involves no code modification, only recompilation with
-g if using optimization of
-O2 or higher. We describe this method in this Pro Tip.
Example CUDA Fortran Code with Runtime Error
First we need a CUDA Fortran code that generates a runtime error, such as the following.
module m Contains attributes(global) subroutine increment(a, b) implicit none integer :: a(:), b(:) integer :: i, n i = blockDim%x*(blockIdx%x-1) + threadIdx%x n = size(a) if (i <= n) a(i) = a(i)+b(i) end subroutine increment end module m program main use cudafor use m implicit none integer, parameter :: n = 1024*1024 integer, allocatable :: a(:), b(:) integer, device, allocatable :: a_d(:), b_d(:) integer :: tPB = 256 allocate(a(n), b(n), a_d(n)) a = 1 b = 3 a_d = a b_d = b call increment<<<ceiling(real(n)/tPB),tPB>>>(a_d, b_d) a = a_d if (any(a /= 4)) then write(*,*) '**** Program Failed ****' else write(*,*) 'Program Passed' endif deallocate(a, b, a_d) end program main
A quick inspection of this simple code reveals that the allocatable device array
b_d is not allocated before use, and sure enough the code compiles fine but generate a runtime error.
% pgf90 -fast -g unalloc.cuf % ./a.out 0: copyin Memcpy (dev=0x(nil), host=0x0x7f7e8986e230, size=5678900328) FAILED: 11(invalid argument)
We can track down the source line for this error in two simple steps: (1) generate a backtrace and (2) translate an address from the backtrace to a line number in a source code file.
Generating a Backtrace
You can generate a backtrace for code compiled with the PGI compilers by setting the environment variable
trace before running the code. Doing this with the example code results in the following output.
% export PGI_TERM='trace' % ./a.out 0: copyin Memcpy (dev=0x(nil), host=0x0x7f7406519230, size=18446744066755845160) FAILED: 11(invalid argument) /opt/pgi/linux86-64/17.10/lib/libpgf90_rpm1.so(__fort_abortx+0x17) [0x7f7408b6ea87] /opt/pgi/linux86-64/17.10/lib/libpgf90.so(__fort_abort+0x5f) [0x7f7408f4d64f] /opt/pgi/linux86-64/17.10/lib/libcudafor.so(+0x5dd8e) [0x7f741109fd8e] /opt/pgi/linux86-64/17.10/lib/libcudafor.so(pgf90_dev_copyin+0x53) [0x7f74110a0012] ./a.out() [0x403b1a] ./a.out() [0x4036d4] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f74074d7830] ./a.out() [0x403589]
The line just below the line with the copyin call from the
libcudafor.so library is the statement from the executable
a.out that generated the runtime error, which has the address
0x403b1a. Now all you need to do is convert this address to the corresponding location in the source code.
Converting an Address to Line Information
addr2line can be used to convert the address from the backtrace to the corresponding source file and line number. Just provide the executable and the address as follows.
% addr2line -e a.out 0x403b1a /home/gruetsch/./unalloc.cuf:30
Here you can see that line 30 in the file
unalloc.cuf is indeed the host-to-device transfer using the unallocated device array.
There are a number of tools available for debugging CUDA Fortran code, including
cuda-memcheck, as well as this technique for locating the source runtime errors. Which tool you use depends on your particular circumstances. The technique in this Pro Tip is attractive because it is simple and requires no source code modification. I hope you find it useful in resolving runtime errors encounter in your CUDA Fortran projects.