One of the most exciting periods when working in NVIDIA is during the launch of a new GPU architecture.  So I was really thrilled when the first Maxwell board landed on my desk and the very first thing I wanted to do was to investigate how PhysX GPU particles can benefit from the new GPU architecture.  In order to do that, I need to get the performance profiles of Maxwell vs Kelper to determine how well the existing Kepler optimizations scale in Maxwell.  There are several tools available (NSight, NVIDIA Visual Profiler, PhysX Visual Debugger, etc) which can be used to profile CUDA workloads.  My personal favorite is the PhysX Visual Debugger (link) because it is a tool specifically designed to debug and profile PhysX simulations.

The chart below shows the relative performance of the GPUs across the 500 to 900 series.  The GTX 750 Ti (GM107) was the first Maxwell GPU available and designated as the replacement part for GTX 650 Ti (GK107).  From the chart, we can see that the GTX 750 Ti outperforms the GTX 650 Ti by a factor of 1.25x on 3DMark Vantage.

However, profiling the PhysX CUDA compute performance of Maxwell vs Kepler shows a slightly different picture.  For the sake of comparison, I have added in a high performance GTX 680.  GTX 750 Ti generally outperforms GTX 650 Ti by about 1.7x and in some cases, more than 2x. This performance gain is achieved out of the box without the need to tweak any parameters or code modifications. Simply adding in sm50 to the kernel compilation is all that is needed to achieve this performance gain.

On Maxwell, true shared memory atomics are implemented. This increases the throughput to 32 threads/clock in the best case and 1 thread/2 clocks in the worst case. This is a big improvement compared to Kepler which is 16 threads/clock in the best case and 1 thread/30 clocks in the worst case.

The particle system pipeline has a kernel (Let’s just call it kernel x) which makes heavy use of both global and shared memory atomic operations.  Below are the performance comparisons between GTX 650 Ti, GTX 750 Ti and GTX 680 (Lower simulation time is better).  GTX 680 is a high performance Kepler GPU which outperforms GTX 750 Ti on all kernels except this one.  Due to the implementation of native shared memory atomics, GTX 750Ti might be able to outperform a higher spec Kepler such as GTX 680 on kernels which makes heavy use of shared memory atomics.

One potential use of shared atomics is replacing parallel prefix sum (scan).  The table below shows a simple example of prefix sums:

input numbers 1 2 3 4 5 6 ...
prefix sums 1 3 6 10 15 21 ...


On Maxwell, there is a particular kernel (Let’s call it kernel y) which is used to calculate a small set of prefix sums and the ordering of the prefix sums is not mandatory.  The current implementation uses the scan algorithm but since shared atomic operations are so fast in Maxwell compared to Kepler, I switched it to using simple atomic add operations for Maxwell.  The result is pretty good as shown below:

We can see from the above profile that using shared atomics results in a 2x performance improvement.  However, using atomics comes with caveats.  Due to the nature of atomics, execution becomes non-deterministic. Using the parallel prefix sum example, we can see that the sequence of the prefix sum is in increasing order or in the general case, it has a predictable order.  Unfortunately, when using atomics, the ordering is no longer guaranteed because it depends on scheduling and how conflicts are resolved by the GPU.  So depending on the use case and algorithm, use of atomics could lead to non-deterministic ordering of the final results and this could make debugging much more difficult. The recommendation is to have a deterministic version to fall back on for debugging or to sort the final results so that ordering can be restored.

GK110 was the only Kepler class GPU that adds the ability for read-only data in global memory to be loaded through the same cache used by the texture pipeline via a standard pointer without the need to bind a texture beforehand and without the sizing limitations of standard textures.  In cases where more explicit control over the read-only data cache mechanism is desired than the const __restrict__ qualifiers provide, or where the code is sufficiently complex that the compiler is unable to detect that the read-only data cache is safe to use, the __ldg() intrinsic can be used in place of a normal pointer dereference to force the load to go through the read-only data cache.

__ldg() intrinsic, on the other hand, is supported on all Maxwell class GPUs and makes it really easy to take advantage of the texture cache.  In PhysX GPU particles, we wanted to provide an experimental feature to improve triangle mesh collisions and that is to cache the triangles chunks.   I wanted to quickly put together a prototype to see how much performance gain I could achieve by using a cache.  Using the __ldg() intrinsic , I can experiment with caching different permutations of the triangle chunk data without the need to first setup and bind them to textures.  By caching the triangle chunks in a cache, triangle collisions of PhysX GPU particles show a speedup of up to 30%.

Another huge advantage of __ldg() intrinsic is that you can experiment with other data structures such as double buffering on a single buffer or circular buffer and get the benefits of the texture cache which you previously can’t with pre-assigned texture bindings.  The code snippet below shows how easy it is to use the __ldg() intrinsic.

  #if __CUDA_ARCH__ >= 350
  // ldg() intrinsic
  #define TEX_READ(result,name,index) result = __ldg(name+index)
  #elif __CUDA_ARCH__ == 300
  // texture read (bindless texture)
  #define TEX_READ(result,name,index) tex1Dfetch(&result, name, index)
  // texture read (texture name appended with Tex)
   #define TEX_READ(result,name,index) result = tex1Dfetch(name##Tex, index)
  #define TEX_READ(result,name,index) result = name[index]
template<typename T>
__device__ int texReadInt(T input, int index)
  int result;
  TEX_READ(result, input, index);
  return result;
extern "C" __global__
void example(
#if __CUDA_ARCH__ == 300
  CUtexObject input,
  const int* input,
  int* output)
  int index = (blockIdx.x * blockDim.x) + threadIdx.x;
  output[index] = texReadInt(input, index);

In conclusion, Maxwell is not only a significant leap in performance per watt compared to Kepler but more crucially, from a CUDA developer point of view is that the performance increase comes out of the box demonstrating strong scaling in most scenarios.  With some tweaking, better performance can be achieved by making use of shared atomic operations and the __ldg() intrinsic.  The Maxwell tuning guide is an invaluable resource for anyone seeking to maximize performance (link).