Greg Ruetsch

Greg Ruetsch is a senior applied engineer at NVIDIA, where he works on CUDA Fortran and performance optimization of HPC codes. He holds a bachelor’s degree in mechanical and aerospace engineering from Rutgers University and a Ph.D. in applied mathematics from Brown University. Prior to joining NVIDIA, he held research positions at Stanford University’s Center for Turbulence Research and Sun Microsystems Laboratories.

Posts by Greg Ruetsch

Simulation / Modeling / Design Apr 15, 2021

Using Tensor Cores in CUDA Fortran

Tensor Cores, which are programmable matrix multiply and accumulate units, were first introduced in the V100 GPUs where they operated on half-precision (16-bit)... 28 MIN READ

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.

Simulation / Modeling / Design Nov 16, 2017

Pro Tip: Pinpointing Runtime Errors in CUDA Fortran

We’ve all been there. Your CUDA Fortran code is humming along and suddenly you get a runtime error: copyin, copyout, usually accompanied by FAILED in all... 4 MIN READ

Simulation / Modeling / Design Mar 05, 2014

CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran

When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems... 7 MIN READ

Simulation / Modeling / Design Jan 01, 2014

Peer-to-Peer Multi-GPU Transpose in CUDA Fortran (Book Excerpt)

This post is an excerpt from Chapter 4 of the book CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we... 12 MIN READ

Simulation / Modeling / Design Apr 01, 2013

Finite Difference Methods in CUDA Fortran, Part 2

In the last CUDA Fortran post we dove in to 3D finite difference computations in CUDA Fortran, demonstrating how to implement the x derivative part of the... 6 MIN READ

Simulation / Modeling / Design Feb 26, 2013

Finite Difference Methods in CUDA Fortran, Part 1

In the last CUDA Fortran post we investigated how shared memory can be used to optimize a matrix transpose, achieving roughly an order of magnitude improvement... 9 MIN READ