CUDA By Example - Errata Page
This page lists errors and corrections to "CUDA by Example: An Introduction to General-Purpose GPU Programming."
If you find new errors or corrections, please send e-mail to cuda@nvidia.com, with the subject Errata for CUDA by Example.
List of Errors and Corrections:
-
p.23,25 - The #includes for this example are incorrectly shown as: #include <iostream> and #include "book.h." This has been corrected in the downloadable code package, but should read: #include <stdio.h> and #include "../common/book.h"
-
p.56 - The text incorrectly claims that, "When you run the application, you should see an animating visualization of the Julia Set." The Julia Set visualization is a static image, not an animation.
-
p.63 - We incorrectly state that the hardware limits the number of blocks in a single launch to 65,535. We correctly mention in other places in the book that the hardware limits the number of blocks in a single dimension of a launch to 65,535.
-
p.102 - The kernel is missing the update of maxz. At the end of the if (t > maxz) block, the value of maxz should be updated: maxz = t; This line has been added in the downloadable code package.
-
p.123-124 - "Since the for() loop leaves the input and output swapped, we first swap the input and output buffers so that the output actually contains the output of the 90th time step." This incorrectly describes what's happening in the code. It should read, "Since the for() loop leaves the input and output swapped, we pass the input buffer to the next kernel, which actually contains the output of the 90th time step."
-
p.174,177 - Both occurrences of sizeof( long ) should be sizeof( int ) instead. These lines have been corrected in the downloadable code package.
-
p. 251-254 - The text presents a lock implementation that is used in both a dot product and hash table program. It is documented in the CUDA programming guide that GPUs implement weak memory orderings which means other threads may observe stale values if memory fence instructions are not used. The lock implementation given in the text did not consider this issue and requires the addition of __threadfence() instructions in both the lock() and unlock() functions to ensure stale values are not read. These functions have been corrected in the downloadable code package.
Additionally if using Fermi architectures, this code should be compiled to target the L2 cache (using the following command line option: -Xptxas -dlcm=cg due to incoherent L1 caches.
-
p.253 - The unlock() function should contain the line, atomicExch( mutex, 0 ); and not atomicExch( mutex, 1 ); In addition, subsequent references to this in the text incorrectly refer to "1" as the argument to atomicExch(). This line has been corrected in the downloadable code package.
-
p.269 - We declare a function __device__ __host__ size_t hash( unsigned int value, size_t count ). We include __device__ functions several times in the book, but unfortunately we never explain what a __host__ function is. In short, we can add a __host__ keyword to __device__ functions to instruct the NVIDIA compiler to generate both device and host versions of the function. The device version runs on the device and can only be called from device code. Likewise, the host version runs on the host and can only be called from host code. This is a convenient mechanism for writing one implementation of a function that you intend to use in both device and host code.