DLI GPU Teaching Kit - Accelerated Computing Syllabus | NVIDIA Developer

DLI GPU Teaching Kit - Accelerated Computing Syllabus

This page is the syllabus for the NVIDIA Deep Learning Institue (DLI) Accelerated Computing Teaching Kit outlining each module's organization in the downloaded Teaching Kit .zip file. It shows the content for every module as well as a link to the suggested online DLI course for each module where applicable.
Here you will also find links to all of the lecture videos.

Module 1: Course Introduction

In this module we review course goals and syllabus and introduce the concepts of heterogeneous
and parallel programming.

Lecture Slides

1.1 - Course Introduction and Overview
1.2 - Introduction to Heterogeneous Parallel Computing
1.3 - Portability and Scalability in Heterogeneous Parallel Computing

Lecture Videos

3rd Ed. Book Chapters

Chapter 1 - Introduction

Module 2: Introduction to CUDA C

In this module we cover the basic API functions in CUDA host code and introduce CUDA threads, the main mechanism for exploiting data parallelism.

Lecture Slides

2.1 - CUDA C vs. Thrust vs. CUDA Libraries
2.2 - Memory Allocation and Data Movement API Functions
2.3 - Threads and Kernel Functions
2.4 - Introduction to the CUDA Toolkit
2.5 - Nsight Compute and NSight Systems
2.6 - Unified Memory

Lecture Videos

DLI Online Courses

Labs

Device Query
CUDA Toolkit

Quiz

Module 2 Quiz

3rd Ed. Book Chapters

Chapter 2 - Data Parallel Computing

Module 3: CUDA Parallelism Model

In this module we introduce the CUDA kernel, efficient memory access patterns, and thread scheduling.

Lecture Slides

3.1 - Kernel-Based SPMD Parallel Programming
3.2 - Multidimensional Kernel Configuration
3.3 - Color-to-Grayscale Image Processing Example
3.4 - Image Blur Example
3.5 - Thread Scheduling

Lecture Videos

DLI Online Courses

Getting Started with Accelerated Computing in Modern CUDA C/C++, Section 1: CUDA Made Easy: Accelerating Applications with Parallel Algorithms

Labs

CUDA Image Blur
CUDA Image Color to Grayscale
CUDA Thrust Vector Add
CUDA Vector Add

Quiz

Module 3 Quiz

3rd Ed. Book Chapters

Chapter 3 - Scalable Parallel Execution

Module 4: Memory and Data Locality

In this module we introduce the CUDA memory types and explore their effective use in tiled parallel algorithms.

Lecture Slides

4.1 - CUDA Memories
4.2 - Tiled Parallel Algorithms
4.3 - Tiled Matrix Multiplication
4.4 - Tiled Matrix Multiplication Kernel
4.5 - Handling Arbitrary Matrix Sizes in Tiled Algorithms

Lecture Videos

List item

Labs

Basic Matrix Multiplication
CUDA Tiled Matrix Multiplication

Quiz

Module 4 Quiz

3rd Ed. Book Chapters

Chapter 4 - Memory and Data Locality

Module 5: Thread Execution Efficiency

In this module we explore how CUDA threads execute on SIMD Hardware and how to analyze the
performance impact of control divergence.

Lecture Slides

5.1 - Warps and SIMD Hardware
5.2 - Performance Impact of Control Divergence

Lecture Videos

Quiz

Module 5 Quiz

3rd Ed. Book Chapters

Chapter 5 - Performance Considerations

Module 6: Memory Access Performance

In this module we explore the significance of memory coalescing to effectively utilize memory bandwidth in CUDA.

Lecture Slides

6.1 - DRAM Bandwidth
6.2 - Memory Coalescing in CUDA

Lecture Videos

Quiz

Module 6 Quiz

3rd Ed. Book Chapters

Chapter 5 - Performance Considerations

Module 7: Parallel Computation Patterns (Histogram)

In this module we introduce the parallel histogram computation pattern and learn to write a
high performance kernel by privatizing outputs.

Lecture Slides

7.1 - Histogramming
7.2 - Introduction to Data Races
7.3 - Atomic Operations in CUDA
7.4 - Atomic Operation Performance
7.5 - Privatization Technique for Improved Throughput

Lecture Videos

Labs

Histogram
Text Histogram
Thrust Histogram Sort

Quiz

Module 7 Quiz

3rd Ed. Book Chapters

Chapter 9 - Parallel Patterns: Parallel Histogram Computation

Module 8: Parallel Computation Patterns (Stencil)

In this module we introduce the tiled convolution pattern. We will learn to analyze the cost and benefit of tiled parallel convolution algorithms.

Lecture Slides

8.1 - Convolution
8.2 - Tiled Convolution
8.3 - Tile Boundary Conditions
8.4 - Analyzing Data Reuse in Tiled Convolution

Lecture Videos

Labs

Convolution
Stencil

Quiz

Module 8 Quiz

3rd Ed. Book Chapters

Chapter 7 - Parallel Patterns: Convolution

Module 9: Parallel Computation Patterns (Reduction)

In this module we introduce the parallel reduction pattern.

Lecture Slides

9.1 - Parallel Reduction
9.2 - A Basic Reduction Kernel
9.3 - A Better Reduction Kernel

Lecture Videos

Labs

Reduction
Thrust Reduction

Quiz

Module 9 Quiz

3rd Ed. Book Chapters

Chapter 5 - Performance Considerations

Module 10: Parallel Computation Patterns (Scan)

In this module we introduce the parallel scan (prefix sum) pattern.

Lecture Slides

10.1 - Prefix Sum
10.2 - A Work-inefficient Scan Kernel
10.3 - A Work-Efficient Parallel Scan Kernel
10.4 - More on Parallel Scan

Lecture Videos

Labs

List Scan
Thrust List Reduction

Quiz

Module 10 Quiz

3rd Ed. Book Chapters

Chapter 8 - Parallel Patterns: PrefixSum

Module 11: Breadth-First (BFS) Queue

In this module we cover Breadth-First Search Queue.

Labs

Breadth-First Search Queue

Module 12: Floating-Point Considerations

In this module we introduce the fundmentals of floating-point representation.

Lecture Slides

12.1 - Floating-Point Precision and Accuracy
12.2 - Numerical Stability

Lecture Videos

3rd Ed. Book Chapters

Chapter 6 - Numerical Considerations

Module 13: GPU as Part of the PC Architecture

In this module we introduce how GPUs fit in the PC architecture.

Lecture Slides

13.1 - GPU as Part of the PC Architecture

Lecture Videos

3rd Ed. Book Chapters

Chapter 18 - Programming a Heterogeneous Computing Cluster

Module 14: Efficient Host-Device Data Transfer

In this module we discuss important concepts involved in copying (transferring) data between host and device.

Lecture Slides

14.1 - Pinned Host Memory
14.2 - Task Parallelism in CUDA
14.3 - Overlapping Data Transfer with Computation
14.4 - CUDA Unified Memory

Lecture Videos

DLI Online Courses

Labs

Vector Addition Using CUDA Streams
Vector Addition Using Pinned Memory
CUDA Unified Memory Matrix Multiplication

Quiz

Module 14 Quiz

3rd Ed. Book Chapters

Chapter 18 - Programming a Heterogeneous Computing Cluster
Chapter 20 - More on CUDA and Grahpics Processing Unit Computing

Module 15: Application Case Study: Advanced MRI
Reconstruction

In this module we introduce the MRI Reconstruction case study.

Lecture Slides

15.1 - Advanced MRI Reconstruction
15.2 - Kernel Optimizations

Lecture Videos

3rd Ed. Book Chapters

Chapter 14 - Application Case Study - Non-Cartesian Magnetic Resonance Imaging

Module 16: Application Case Study: Electrostatic Potential Calculation

In this module we introduce the Electrostatic Potential Calculation case study.

Lecture Slides

16.1 - Electrostatic Potential Calculation - Part 1
16.2 - Electrostatic Potential Calculation - Part 2

Lecture Videos

Module 17: Computational Thinking for
Parallel Programming

In this module we provide a framework for thinking about the problems of parallel programming.

Lecture Slides

17.1 - Introduction to Computational Thinking

3rd Ed. Book Chapters

Chapter 17 - Parallel Programming and Computational Thinking

Module 18: Related Programming Models: MPI

In this module we introduce the MPI programming model.

Lecture Slides

18.1 - Introduction to Heterogeneous Supercomputing and MPI

3rd Ed. Book Chapters

Chapter 18 - Programming a Heterogeneous Computing Cluster

Module 19: CUDA Python using Numba

In this module we introduce CUDA Python using Numba.

DLI Online Courses

Fundamentals of Accelerated Computing with CUDA Python

Module 20: Related Programming Models: OpenCL

In this module we introduce the OpenCL programming model.

Lecture Slides

20.1 - OpenCL Data Parallelism Model
20.2 - OpenCL Device Architecture
20.3 - OpenCL Host Code

Labs

OpenCL Vector Addition

Quiz

Module 20 Quiz

3rd Ed. Book Chapters

Appendix - An Introduction to OpenCL

Module 21: Related Programming Models: OpenACC

In this module we introduce the OpenACC programming model.

Lecture Slides

21.1 - Introduction to OpenACC
21.2 - OpenACC Subtleties

Lecture Videos

DLI Online Courses

Fundamentals of Accelerated Computing with OpenACC

Labs

OpenACC CUDA Vector Add

Quiz

Module 21 Quiz

3rd Ed. Book Chapters

Chapter 19 - Parallel Programming with OpenACC

Module 22: Related Programming Models: OpenGL

In this module we introduce the OpenGL programming model.

Module scheduled for a future release of the teaching kit

Module 23: Dynamic Parallelism

In this module we introduce dynamic parallelism.

Lecture Slides

23.1 - Dynamic Parallelism

Lecture Videos

Labs

Dynamic Parallelism

3rd Ed. Book Chapters

Chapter 13 - CUDA Dynamic Parallelism

Module 24: Multi-GPU

In this module we discuss programming with multiple GPUs.

Lecture Slides

24.1 - OpenMP
24.2 - Multi-GPU Introduction I
24.3 - Multi-GPU Introduction II
24.4 - OpenMP and Cooperative Groups
24.5 - Multi-GPU Heat Equation

Lecture Videos

DLI Online Courses

Scaling Workloads Across Multiple GPUs with CUDA C++

Labs

Multi-GPU Heat Equation

Quiz

Module 24 Quiz

Module 25: Using CUDA Libraries

In this module we introduce the effective use of CUDA libraries.

Lecture Slides

25.1 - cuBLAS
25.2 - cuSOLVER
25.3 - cuFFT
25.4 - Thrust

Lecture Videos

DLI Online Courses

Labs

Equation with NVIDIA Libraries

Quiz

Module 25 Quiz

3rd Ed. Book Chapters

Appendix - THRUST: a Productivity-oriented Library for CUDA

Module 26: Advanced Thrust

In this module we discuss advanced Thrust topics.

Module scheduled for a future release of the teaching kit