After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
CUDA on NVIDIA Ampere GPU Architecture: Taking Your Algorithms to the Next Level of Performance
Carter Edwards, NVIDIA
NVIDIA Ampere GPU Architecture delivers exciting new capabilities to take your algorithms to the next level of performance. Learn how to load shared memory at the speed of light, exert control over cache residency, and configure flexible synchronization patterns. Be delighted by how easily shared memory is prefetched while computing. If you can cudaMemcpyAsync host to device memory, then you will know how to memcpy_async device to shared memory. To make the most of the new 48Mb cache, you can prioritize what data should persist in cache for frequent fast access, or what data should perturb cache as little as possible as it streams from device memory. For algorithms that have been limited by only having __syncthreads() and __syncwarp(), there is the new fully configurable barrier type. Configure your own groups of threads to coordinate through your own barrier objects. Coordinate groups into producer/consumer patterns. Will you combine these building blocks to craft a persistent systolic array kernel?