DLI GPU Teaching Kit - Accelerated Computing Syllabus

This page is the syllabus for the NVIDIA Deep Learning Institue (DLI) Accelerated Computing Teaching Kit outlining each module's organization in the downloaded Teaching Kit .zip file. It shows the content for every module as well as a link to the suggested online DLI course for each module where applicable.
Here you will also find links to all of the lecture videos.

Module 1: Course Introduction

In this module we review course goals and syllabus and introduce the concepts of heterogeneous
and parallel programming. 

Lecture Slides

  • 1.1 - Course Introduction and Overview

  • 1.2 - Introduction to Heterogeneous Parallel Computing

  • 1.3 - Portability and Scalability in Heterogeneous Parallel Computing

Lecture Videos

3rd Ed. Book Chapters

  • Chapter 1 - Introduction

Module 2: Introduction to CUDA C

In this module we cover the basic API functions in CUDA host code and introduce CUDA threads, the main mechanism for exploiting data parallelism.

Lecture Slides

  • 2.1 - CUDA C vs. Thrust vs. CUDA Libraries

  • 2.2 - Memory Allocation and Data Movement API Functions

  • 2.3 - Threads and Kernel Functions

  • 2.4 - Introduction to the CUDA Toolkit

  • 2.5 - Nsight Compute and NSight Systems

  • 2.6 - Unified Memory

Lecture Videos

DLI Online Courses

Labs

  • Device Query
  • CUDA Toolkit

Quiz

  • Module 2 Quiz

3rd Ed. Book Chapters

  • Chapter 2 - Data Parallel Computing

Module 3: CUDA Parallelism Model

In this module we introduce the CUDA kernel, efficient memory access patterns, and thread scheduling.

Lecture Slides

  • 3.1 - Kernel-Based SPMD Parallel Programming

  • 3.2 - Multidimensional Kernel Configuration

  • 3.3 - Color-to-Grayscale Image Processing Example

  • 3.4 - Image Blur Example

  • 3.5 - Thread Scheduling

Lecture Videos

DLI Online Courses

Labs

  • CUDA Image Blur
  • CUDA Image Color to Grayscale
  • CUDA Thrust Vector Add

  • CUDA Vector Add

Quiz

  • Module 3 Quiz

3rd Ed. Book Chapters

  • Chapter 3 - Scalable Parallel Execution

Module 4: Memory and Data Locality

In this module we introduce the CUDA memory types and explore their effective use in tiled parallel algorithms.

Lecture Slides

  • 4.1 - CUDA Memories

  • 4.2 - Tiled Parallel Algorithms

  • 4.3 - Tiled Matrix Multiplication

  • 4.4 - Tiled Matrix Multiplication Kernel

  • 4.5 - Handling Arbitrary Matrix Sizes in Tiled Algorithms

Lecture Videos

  • List item

Labs

  • Basic Matrix Multiplication
  • CUDA Tiled Matrix Multiplication

Quiz

  • Module 4 Quiz

3rd Ed. Book Chapters

  • Chapter 4 - Memory and Data Locality

Module 5: Thread Execution Efficiency

In this module we explore how CUDA threads execute on SIMD Hardware and how to analyze the
performance impact of control divergence.

Lecture Slides

  • 5.1 - Warps and SIMD Hardware

  • 5.2 - Performance Impact of Control Divergence

Lecture Videos

Quiz

  • Module 5 Quiz

3rd Ed. Book Chapters

  • Chapter 5 - Performance Considerations

Module 6: Memory Access Performance

In this module we explore the significance of memory coalescing to effectively utilize memory bandwidth in CUDA.

Lecture Slides

  • 6.1 - DRAM Bandwidth

  • 6.2 - Memory Coalescing in CUDA

Lecture Videos

Quiz

  • Module 6 Quiz

3rd Ed. Book Chapters

  • Chapter 5 - Performance Considerations

Module 7: Parallel Computation Patterns (Histogram)

In this module we introduce the parallel histogram computation pattern and learn to write a
high performance kernel by privatizing outputs.

Lecture Slides

  • 7.1 - Histogramming

  • 7.2 - Introduction to Data Races

  • 7.3 - Atomic Operations in CUDA

  • 7.4 - Atomic Operation Performance

  • 7.5 - Privatization Technique for Improved Throughput

Lecture Videos

Labs

  • Histogram
  • Text Histogram
  • Thrust Histogram Sort

Quiz

  • Module 7 Quiz

3rd Ed. Book Chapters

  • Chapter 9 - Parallel Patterns: Parallel Histogram Computation

Module 8: Parallel Computation Patterns (Stencil)

In this module we introduce the tiled convolution pattern. We will learn to analyze the cost and benefit of tiled parallel convolution algorithms.

Lecture Slides

  • 8.1 - Convolution

  • 8.2 - Tiled Convolution

  • 8.3 - Tile Boundary Conditions

  • 8.4 - Analyzing Data Reuse in Tiled Convolution

Lecture Videos

Labs

  • Convolution
  • Stencil

Quiz

  • Module 8 Quiz

3rd Ed. Book Chapters

  • Chapter 7 - Parallel Patterns: Convolution

Module 9: Parallel Computation Patterns (Reduction)

In this module we introduce the parallel reduction pattern.

Lecture Slides

  • 9.1 - Parallel Reduction

  • 9.2 - A Basic Reduction Kernel

  • 9.3 - A Better Reduction Kernel

Lecture Videos

Labs

  • Reduction
  • Thrust Reduction

Quiz

  • Module 9 Quiz

3rd Ed. Book Chapters

  • Chapter 5 - Performance Considerations

Module 10: Parallel Computation Patterns (Scan)

In this module we introduce the parallel scan (prefix sum) pattern.

Lecture Slides

  • 10.1 - Prefix Sum

  • 10.2 - A Work-inefficient Scan Kernel

  • 10.3 - A Work-Efficient Parallel Scan Kernel

  • 10.4 - More on Parallel Scan

Lecture Videos

Labs

  • List Scan
  • Thrust List Reduction

Quiz

  • Module 10 Quiz

3rd Ed. Book Chapters

  • Chapter 8 - Parallel Patterns: PrefixSum

Module 11: Breadth-First (BFS) Queue

In this module we cover Breadth-First Search Queue.

Labs

  • Breadth-First Search Queue

Module 12: Floating-Point Considerations

In this module we introduce the fundmentals of floating-point representation.

Lecture Slides

  • 12.1 - Floating-Point Precision and Accuracy

  • 12.2 - Numerical Stability

Lecture Videos

3rd Ed. Book Chapters

  • Chapter 6 - Numerical Considerations

Module 13: GPU as Part of the PC Architecture

In this module we introduce how GPUs fit in the PC architecture.

Lecture Slides

  • 13.1 - GPU as Part of the PC Architecture

Lecture Videos

3rd Ed. Book Chapters

  • Chapter 18 - Programming a Heterogeneous Computing Cluster

Module 14: Efficient Host-Device Data Transfer

In this module we discuss important concepts involved in copying (transferring) data between host and device.

Lecture Slides

  • 14.1 - Pinned Host Memory

  • 14.2 - Task Parallelism in CUDA

  • 14.3 - Overlapping Data Transfer with Computation

  • 14.4 - CUDA Unified Memory

Lecture Videos

DLI Online Courses

Labs

  • Vector Addition Using CUDA Streams
  • Vector Addition Using Pinned Memory
  • CUDA Unified Memory Matrix Multiplication

Quiz

  • Module 14 Quiz

3rd Ed. Book Chapters

  • Chapter 18 - Programming a Heterogeneous Computing Cluster
  • Chapter 20 - More on CUDA and Grahpics Processing Unit Computing

Module 15: Application Case Study: Advanced MRI
Reconstruction

In this module we introduce the MRI Reconstruction case study.

Lecture Slides

  • 15.1 - Advanced MRI Reconstruction

  • 15.2 - Kernel Optimizations

Lecture Videos

3rd Ed. Book Chapters

  • Chapter 14 - Application Case Study - Non-Cartesian Magnetic Resonance Imaging

Module 16: Application Case Study: Electrostatic Potential Calculation

In this module we introduce the Electrostatic Potential Calculation case study.

Lecture Slides

  • 16.1 - Electrostatic Potential Calculation - Part 1
  • 16.2 - Electrostatic Potential Calculation - Part 2

Lecture Videos

Module 17: Computational Thinking for
Parallel Programming

In this module we provide a framework for thinking about the problems of parallel programming.

Lecture Slides

  • 17.1 - Introduction to Computational Thinking

3rd Ed. Book Chapters

  • Chapter 17 - Parallel Programming and Computational Thinking

Module 18: Related Programming Models: MPI

In this module we introduce the MPI programming model.

Lecture Slides

  • 18.1 - Introduction to Heterogeneous Supercomputing and MPI

3rd Ed. Book Chapters

  • Chapter 18 - Programming a Heterogeneous Computing Cluster

Module 19: CUDA Python using Numba

In this module we introduce CUDA Python using Numba.

DLI Online Courses

Module 20: Related Programming Models: OpenCL

In this module we introduce the OpenCL programming model.

Lecture Slides

  • 20.1 - OpenCL Data Parallelism Model

  • 20.2 - OpenCL Device Architecture

  • 20.3 - OpenCL Host Code

Labs

  • OpenCL Vector Addition

Quiz

  • Module 20 Quiz

3rd Ed. Book Chapters

  • Appendix - An Introduction to OpenCL

Module 21: Related Programming Models: OpenACC

In this module we introduce the OpenACC programming model.

Lecture Slides

  • 21.1 - Introduction to OpenACC

  • 21.2 - OpenACC Subtleties

Lecture Videos

DLI Online Courses

Labs

  • OpenACC CUDA Vector Add

Quiz

  • Module 21 Quiz

3rd Ed. Book Chapters

  • Chapter 19 - Parallel Programming with OpenACC

Module 22: Related Programming Models: OpenGL

In this module we introduce the OpenGL programming model.

Module scheduled for a future release of the teaching kit

Module 23: Dynamic Parallelism

In this module we introduce dynamic parallelism.

Lecture Slides

  • 23.1 - Dynamic Parallelism

Lecture Videos

Labs

  • Dynamic Parallelism

3rd Ed. Book Chapters

  • Chapter 13 - CUDA Dynamic Parallelism

Module 24: Multi-GPU

In this module we discuss programming with multiple GPUs.

Lecture Slides

  • 24.1 - OpenMP

  • 24.2 - Multi-GPU Introduction I

  • 24.3 - Multi-GPU Introduction II

  • 24.4 - OpenMP and Cooperative Groups

  • 24.5 - Multi-GPU Heat Equation

Lecture Videos

DLI Online Courses

Labs

  • Multi-GPU Heat Equation

Quiz

  • Module 24 Quiz

Module 25: Using CUDA Libraries

In this module we introduce the effective use of CUDA libraries.

Lecture Slides

  • 25.1 - cuBLAS

  • 25.2 - cuSOLVER

  • 25.3 - cuFFT

  • 25.4 - Thrust

Lecture Videos

DLI Online Courses

Labs

  • Equation with NVIDIA Libraries

Quiz

  • Module 25 Quiz

3rd Ed. Book Chapters

  • Appendix - THRUST: a Productivity-oriented Library for CUDA

Module 26: Advanced Thrust

In this module we discuss advanced Thrust topics.

Module scheduled for a future release of the teaching kit