
When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM). In this post I’ll show you how to leverage these batched routines from CUDA Fortran.
The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr
type. Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.
Writing Interfaces to Batched cuBLAS Routines
At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas
module, so we first need to define the interface to the cublasSgetrfBatched()
call:
interface integer(c_int) function & cublasSgetrfBatched(h,n,Aarray,lda,ipvt,info,batchSize) & bind(c,name='cublasSgetrfBatched') use iso_c_binding use cublas type(cublasHandle), value :: h integer(c_int), value :: n type(c_devptr), device :: Aarray(*) integer(c_int), value :: lda integer(c_int), device :: ipvt(*) integer(c_int), device :: info(*) integer(c_int), value :: batchSize end function cublasSgetrfBatched end interface
The arguments of cublasSgetrfBatched()
are:
h
, the cuBLAS handle obtained fromcublasCreate()
;n
, the size of then
×n
matrices;Aarray
, the array pointers to the matrices;lda
, the leading dimension of the matrices;ipvt
, an output array containing pivot information;info
, another output array containing factorization information;- and
batchSize
, the number of independentn
×n
matrices on which to perform factorization.
Of particular importance in this interface is the use of the variable attributes device
and value
. The cuBLAS handle h
, the size of the matrices n
, the leading dimensions lda
, and the number of matrices batchSize
are all scalar parameters declared in host memory and are therefore passed by value. The integer arrays ipvt
and info
are output arrays allocated on the device using the device
attribute. Of special note in this interface is the array of pointers Aarray
.
type(c_devptr), device :: Aarray(*)
which is a device array of device pointers. Contrast this to a declaration such as:
type(c_devptr) :: Aarray(*)
which is a host array of device pointers. The batched cuBLAS calls need the former declaration because cuBLAS distributes matrices to kernel threads on the device.
Calling Batched cuBLAS from Host Code
Having defined our interface to cublasSgetrfBatched()
, we can now turn to the host code that calls this function. The complete code follows.
program testgetrfBatched use cudafor use cublas use iso_c_binding implicit none integer, parameter :: n=2, nbatch=3, lda=n real :: a(n,n,nbatch) real, device :: a_d(n,n,nbatch) type(c_devptr) :: devPtrA(nbatch) type(c_devptr), device :: devPtrA_d(nbatch) type(cublasHandle) :: h1 integer :: ipvt(n*nbatch), info(nbatch) integer, device :: ipvt_d(n*nbatch), info_d(nbatch) integer :: i, k, istat interface integer(c_int) function cublasSgetrfBatched(h, n, Aarray, lda, ipvt, info, batchSize) & bind(c,name='cublasSgetrfBatched') use iso_c_binding use cublas type(cublasHandle), value :: h integer(c_int), value :: n type(c_devptr), device :: Aarray(*) integer(c_int), value :: lda integer(c_int), device :: ipvt(*) integer(c_int), device :: info(*) integer(c_int), value :: batchSize end function cublasSgetrfBatched end interface ! intitialize arrays and transfer to device do k = 1, nbatch a(1,1,k) = 6.0*k a(2,1,k) = 4.0*k a(1,2,k) = 3.0*k a(2,2,k) = 3.0*k end do a_d = a write(*,"(/,'Input:')") do k = 1, nbatch write(*,"(2x,'Matrix: ', i0)") k do i=1, n write(*,*) a(i,:,k) enddo enddo ! build an array of pointers do k = 1, nbatch devPtrA(k) = c_devloc(a_d(1,1,k)) end do devPtrA_d = devPtrA ! create handle, call cublasSgetrfBatched, and destroy handle istat = cublasCreate(h1) if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed' istat= cublasSgetrfBatched(h1, n, devPtrA_d, lda, ipvt_d, info_d, nbatch) if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasSgetrfBatched failed: ', istat istat = cublasDestroy(h1) if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDestroy failed' a = a_d write(*,"(/, 'LU Factorization:')") do k = 1, nbatch write(*,"(2x,'Matrix: ', i0)") k do i = 1, n write(*,*) a(i,:,k) enddo enddo end program testgetrfBatched
While most of this code is straightforward, the code that assembles the device array of device pointers, devPtrA_d
, merits some discussion.
do k = 1, nbatch devPtrA(k) = c_devloc(a_d(1,1,k)) end do devPtrA_d = devPtrA
The host function c_devloc()
determines the address of the nbatch
matrices stored in the three-dimensional array a_d
. In general the matrices can be distributed across multiple variables, but we use a single three-dimensional array for convenience. The results from c_devloc()
are first stored in devPtrA
, a host array of device pointers, and then transferred to the device array devPtrA_d
. devPtrA_d
is used a few lines later in the call to cublasSgetrfBatched()
:
istat= cublasSgetrfBatched(h1, n, devPtrA_d, lda, ipvt_d, info_d, nbatch)
Other Batched Routines
We can use a similar approach for the other batched cuBLAS routines: cublas*getriBatched()
, cublas*gemmBatched()
, and cublas*trsmBatched()
. Note that in cublas*gemmBatched()
and cublas*trsmBatched()
, the parameters alpha
and beta
are scalar values passed by reference which can reside either on the host or device depending on the cuBLAS pointer mode. In such cases it is helpful to write a generic interface such as the following.
interface cublasSgemmBatched integer(c_int) function & cublasSgemmBatched_hpm(h, transa, transb, & m, n, k, alpha, Aarray, lda, Barray, ldb, & beta, Carray, ldc, batchCount) & bind(c,name='cublasSgemmBatched') use iso_c_binding use cublas type(cublasHandle), value :: h integer, value :: transa integer, value :: transb integer(c_int), value :: m, n, k real :: alpha type(c_devptr), device :: Aarray(*) integer(c_int), value :: lda type(c_devptr), device :: Barray(*) integer(c_int), value :: ldb real :: beta type(c_devptr), device :: Carray(*) integer(c_int), value :: ldc integer(c_int), value :: batchCount end function cublasSgemmBatched_hpm integer(c_int) function & cublasSgemmBatched_dpm(h, transa, transb, & m, n, k, alpha, Aarray, lda, Barray, ldb, & beta, Carray, ldc, batchCount) & bind(c,name='cublasSgemmBatched') use iso_c_binding use cublas type(cublasHandle), value :: h integer, value :: transa integer, value :: transb integer(c_int), value :: m, n, k real, device :: alpha type(c_devptr), device :: Aarray(*) integer(c_int), value :: lda type(c_devptr), device :: Barray(*) integer(c_int), value :: ldb real, device :: beta type(c_devptr), device :: Carray(*) integer(c_int), value :: ldc integer(c_int), value :: batchCount end function cublasSgemmBatched_dpm end interface cublasSgemmBatched
In this interface the *_hmp
and *_dpm
routines are for host and device pointer modes, respectively, whose interfaces differ only by the absence or presence of the device
variable attribute for the alpha
and beta
arguments. You could further write a wrapper function that calls cublasSetPointerMode()
appropriately before executing the batched cuBLAS routine.
Check out more on CUDA Fortran as well as more CUDA Pro Tips at the NVIDIA Developer Blog.