Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It’s so common in computer graphics that programmers often use the verb “lerp” to refer to linear interpolation, a function that’s built into all modern graphics hardware (often in multiple hardware units).
You can enable linear interpolation (also known as linear filtering) on texture fetches in CUDA kernels. This hardware filtering uses a low-precision interpolant, so for this and other reasons it’s common to lerp in software.
The standard way to lerp is:
(1-t)*v0 + t*v1
Here’s a generic host/device function that performs a lerp:
template <typename T> __host__ __device__ inline T lerp(T v0, T v1, T t) { return (1-t)*v0 + t*v1; }
But we can do better. Compiled as-is this maps to three floating-point operations. The compiler won’t re-arrange floating-point computation if the transformation does not preserve numerical equality, with the exception of the straightforward merging of an FADD followed by a dependent FMUL into an FMA.
To maximize performance, you may want to manually re-arrange the terms in the computation above so it can be reduced to two FMAs:
fma(t, v1, fma(-t, v0, v0))
This FMA-optimized version also provides slightly better accuracy overall. Here’s the full function:
template <typename T> __host__ __device__ inline T lerp(T v0, T v1, T t) { return fma(t, v1, fma(-t, v0, v0)); }
For one use case (seismic processing CUDA code) we have seen performance improve by 5% just by optimizing the linear interpolation function as shown above. Not bad for a few minutes of work.
Thanks to Norbert Juffa for providing this Pro Tip.