DLI GPU Teaching Kit - Accelerated Computing Syllabus
This page is the syllabus for the NVIDIA Deep Learning Institue (DLI) Accelerated Computing Teaching Kit outlining each module's organization in the downloaded Teaching Kit .zip file. It shows the content for every module as well as a link to the suggested online DLI course for each module where applicable.
Here you will also find links to all of the lecture videos.
Module 1: Course Introduction
In this module we review course goals and syllabus and introduce the concepts of heterogeneous
and parallel programming.
Lecture Slides
1.1 - Course Introduction and Overview
1.2 - Introduction to Heterogeneous Parallel Computing
1.3 - Portability and Scalability in Heterogeneous Parallel Computing
3rd Ed. Book Chapters
- Chapter 1 - Introduction
Module 2: Introduction to CUDA C
In this module we cover the basic API functions in CUDA host code and introduce CUDA threads, the main mechanism for exploiting data parallelism.
Lecture Slides
2.1 - CUDA C vs. Thrust vs. CUDA Libraries
2.2 - Memory Allocation and Data Movement API Functions
2.3 - Threads and Kernel Functions
2.4 - Introduction to the CUDA Toolkit
2.5 - Nsight Compute and NSight Systems
2.6 - Unified Memory
DLI Online Courses
- An Even Easier Introduction to CUDA
- Optimizing CUDA Machine Learning Codes With Nsight Profiling Tools
Labs
- Device Query
- CUDA Toolkit
Quiz
- Module 2 Quiz
3rd Ed. Book Chapters
- Chapter 2 - Data Parallel Computing
Module 3: CUDA Parallelism Model
In this module we introduce the CUDA kernel, efficient memory access patterns, and thread scheduling.
Lecture Slides
3.1 - Kernel-Based SPMD Parallel Programming
3.2 - Multidimensional Kernel Configuration
3.3 - Color-to-Grayscale Image Processing Example
3.4 - Image Blur Example
3.5 - Thread Scheduling
DLI Online Courses
Labs
- CUDA Image Blur
- CUDA Image Color to Grayscale
CUDA Thrust Vector Add
CUDA Vector Add
Quiz
- Module 3 Quiz
3rd Ed. Book Chapters
- Chapter 3 - Scalable Parallel Execution
Module 4: Memory and Data Locality
In this module we introduce the CUDA memory types and explore their effective use in tiled parallel algorithms.
Lecture Slides
4.1 - CUDA Memories
4.2 - Tiled Parallel Algorithms
4.3 - Tiled Matrix Multiplication
4.4 - Tiled Matrix Multiplication Kernel
4.5 - Handling Arbitrary Matrix Sizes in Tiled Algorithms
- List item
Labs
- Basic Matrix Multiplication
- CUDA Tiled Matrix Multiplication
Quiz
- Module 4 Quiz
3rd Ed. Book Chapters
- Chapter 4 - Memory and Data Locality
Module 5: Thread Execution Efficiency
In this module we explore how CUDA threads execute on SIMD Hardware and how to analyze the
performance impact of control divergence.
Lecture Slides
5.1 - Warps and SIMD Hardware
5.2 - Performance Impact of Control Divergence
Quiz
- Module 5 Quiz
3rd Ed. Book Chapters
- Chapter 5 - Performance Considerations
Module 6: Memory Access Performance
In this module we explore the significance of memory coalescing to effectively utilize memory bandwidth in CUDA.
Lecture Slides
6.1 - DRAM Bandwidth
6.2 - Memory Coalescing in CUDA
Quiz
- Module 6 Quiz
3rd Ed. Book Chapters
- Chapter 5 - Performance Considerations
Module 7: Parallel Computation Patterns (Histogram)
In this module we introduce the parallel histogram computation pattern and learn to write a
high performance kernel by privatizing outputs.
Lecture Slides
7.1 - Histogramming
7.2 - Introduction to Data Races
7.3 - Atomic Operations in CUDA
7.4 - Atomic Operation Performance
7.5 - Privatization Technique for Improved Throughput
Labs
- Histogram
- Text Histogram
Thrust Histogram Sort
Quiz
- Module 7 Quiz
3rd Ed. Book Chapters
- Chapter 9 - Parallel Patterns: Parallel Histogram Computation
Module 8: Parallel Computation Patterns (Stencil)
In this module we introduce the tiled convolution pattern. We will learn to analyze the cost and benefit of tiled parallel convolution algorithms.
Lecture Slides
8.1 - Convolution
8.2 - Tiled Convolution
8.3 - Tile Boundary Conditions
8.4 - Analyzing Data Reuse in Tiled Convolution
Labs
- Convolution
Stencil
Quiz
- Module 8 Quiz
3rd Ed. Book Chapters
- Chapter 7 - Parallel Patterns: Convolution
Module 9: Parallel Computation Patterns (Reduction)
In this module we introduce the parallel reduction pattern.
Lecture Slides
9.1 - Parallel Reduction
9.2 - A Basic Reduction Kernel
9.3 - A Better Reduction Kernel
Labs
- Reduction
Thrust Reduction
Quiz
- Module 9 Quiz
3rd Ed. Book Chapters
- Chapter 5 - Performance Considerations
Module 10: Parallel Computation Patterns (Scan)
In this module we introduce the parallel scan (prefix sum) pattern.
Lecture Slides
10.1 - Prefix Sum
10.2 - A Work-inefficient Scan Kernel
10.3 - A Work-Efficient Parallel Scan Kernel
10.4 - More on Parallel Scan
Labs
- List Scan
Thrust List Reduction
Quiz
- Module 10 Quiz
3rd Ed. Book Chapters
- Chapter 8 - Parallel Patterns: PrefixSum
Module 11: Breadth-First (BFS) Queue
In this module we cover Breadth-First Search Queue.
Labs
- Breadth-First Search Queue
Module 12: Floating-Point Considerations
In this module we introduce the fundmentals of floating-point representation.
Lecture Slides
12.1 - Floating-Point Precision and Accuracy
12.2 - Numerical Stability
3rd Ed. Book Chapters
- Chapter 6 - Numerical Considerations
Module 13: GPU as Part of the PC Architecture
In this module we introduce how GPUs fit in the PC architecture.
Lecture Slides
13.1 - GPU as Part of the PC Architecture
3rd Ed. Book Chapters
- Chapter 18 - Programming a Heterogeneous Computing Cluster
Module 14: Efficient Host-Device Data Transfer
In this module we discuss important concepts involved in copying (transferring) data between host and device.
Lecture Slides
14.1 - Pinned Host Memory
14.2 - Task Parallelism in CUDA
14.3 - Overlapping Data Transfer with Computation
14.4 - CUDA Unified Memory
DLI Online Courses
- Getting Started with Accelerated Computing in Modern CUDA C/C++, Sections 2 and 3: Unlocking the GPU's Full Potential: Harnessing Asynchrony with CUDA Streams and Implementing New Algorithms with CUDA Kernels
- Accelerating CUDA C++ Applications with Concurrent Streams
Labs
- Vector Addition Using CUDA Streams
- Vector Addition Using Pinned Memory
CUDA Unified Memory Matrix Multiplication
Quiz
- Module 14 Quiz
3rd Ed. Book Chapters
- Chapter 18 - Programming a Heterogeneous Computing Cluster
Chapter 20 - More on CUDA and Grahpics Processing Unit Computing
Module 15: Application Case Study: Advanced MRI
Reconstruction
In this module we introduce the MRI Reconstruction case study.
Lecture Slides
15.1 - Advanced MRI Reconstruction
15.2 - Kernel Optimizations
3rd Ed. Book Chapters
- Chapter 14 - Application Case Study - Non-Cartesian Magnetic Resonance Imaging
Module 16: Application Case Study: Electrostatic Potential Calculation
In this module we introduce the Electrostatic Potential Calculation case study.
Lecture Slides
- 16.1 - Electrostatic Potential Calculation - Part 1
16.2 - Electrostatic Potential Calculation - Part 2
Module 17: Computational Thinking for
Parallel Programming
In this module we provide a framework for thinking about the problems of parallel programming.
Lecture Slides
17.1 - Introduction to Computational Thinking
3rd Ed. Book Chapters
- Chapter 17 - Parallel Programming and Computational Thinking
Module 18: Related Programming Models: MPI
In this module we introduce the MPI programming model.
Lecture Slides
18.1 - Introduction to Heterogeneous Supercomputing and MPI
3rd Ed. Book Chapters
- Chapter 18 - Programming a Heterogeneous Computing Cluster
Module 19: CUDA Python using Numba
In this module we introduce CUDA Python using Numba.
DLI Online Courses
Module 20: Related Programming Models: OpenCL
In this module we introduce the OpenCL programming model.
Lecture Slides
20.1 - OpenCL Data Parallelism Model
20.2 - OpenCL Device Architecture
20.3 - OpenCL Host Code
Labs
- OpenCL Vector Addition
Quiz
- Module 20 Quiz
3rd Ed. Book Chapters
- Appendix - An Introduction to OpenCL
Module 21: Related Programming Models: OpenACC
In this module we introduce the OpenACC programming model.
Lecture Slides
21.1 - Introduction to OpenACC
21.2 - OpenACC Subtleties
DLI Online Courses
Labs
- OpenACC CUDA Vector Add
Quiz
- Module 21 Quiz
3rd Ed. Book Chapters
- Chapter 19 - Parallel Programming with OpenACC
Module 22: Related Programming Models: OpenGL
In this module we introduce the OpenGL programming model.
Module scheduled for a future release of the teaching kit
Module 23: Dynamic Parallelism
In this module we introduce dynamic parallelism.
Lecture Slides
23.1 - Dynamic Parallelism
Labs
- Dynamic Parallelism
3rd Ed. Book Chapters
- Chapter 13 - CUDA Dynamic Parallelism
Module 24: Multi-GPU
In this module we discuss programming with multiple GPUs.
Lecture Slides
24.1 - OpenMP
24.2 - Multi-GPU Introduction I
24.3 - Multi-GPU Introduction II
24.4 - OpenMP and Cooperative Groups
24.5 - Multi-GPU Heat Equation
DLI Online Courses
Labs
- Multi-GPU Heat Equation
Quiz
- Module 24 Quiz
Module 25: Using CUDA Libraries
In this module we introduce the effective use of CUDA libraries.
Lecture Slides
25.1 - cuBLAS
25.2 - cuSOLVER
25.3 - cuFFT
25.4 - Thrust
DLI Online Courses
- GPU Acceleration with the C++ Standard Library
- Scaling GPU-Accelerated Applications with the C++ Standard Library
Labs
- Equation with NVIDIA Libraries
Quiz
- Module 25 Quiz
3rd Ed. Book Chapters
- Appendix - THRUST: a Productivity-oriented Library for CUDA
Module 26: Advanced Thrust
In this module we discuss advanced Thrust topics.
Module scheduled for a future release of the teaching kit