Learning CUDA 10 Programming
4.6 (24 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
190 students enrolled

Learning CUDA 10 Programming

Harness the power of GPUs to speed up your applications
Bestseller
4.6 (23 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
190 students enrolled
Created by Packt Publishing
Last updated 12/2019
English
English
Current price: $86.99 Original price: $124.99 Discount: 30% off
5 hours left at this price!
30-Day Money-Back Guarantee
This course includes
  • 2.5 hours on-demand video
  • 1 downloadable resource
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
Training 5 or more people?

Get your team access to 4,000+ top Udemy courses anytime, anywhere.

Try Udemy for Business
What you'll learn
  • Use CUDA to speed up your applications using machine learning, image processing, linear algebra, and more
  • Learn to debug CUDA programs and handle errors
  • Use optimization techniques to get the maximum performance from your CUDA programs
  • Master the fundamentals of concurrency and parallel algorithms on GPUs
  • Learn about the wide range of GPU-accelerated libraries included with CUDA
  • Learn the next steps you can take to continue building your CUDA skills
Course content
Expand all 33 lectures 02:27:25
+ Introduction to CUDA
5 lectures 14:02

This video will give you an overview about the course.

Preview 02:21

Many programmers are interested in CUDA, but do not know much about it. This video will explain what CUDA is, and the kinds of applications it might be useful for.

   •  Explain GPU architectures and compare them to CPUs, show the potential performance benefits

   •  Discuss how CUDA makes it easy to program GPUs

   •  See the tools which are included with the CUDA toolkit

Overview of CUDA
03:12

The CUDA Toolkit must be installed to build and run CUDA programs. This video will walk through the steps for Windows users.

   •  Check if the prerequisites are already installed

   •  Run the installer

   •  Check the installation, by building and running a sample program

Installing the CUDA Toolkit on Windows
01:25

The CUDA Toolkit must be installed to build and run CUDA programs. This video will walk through the steps for Linux users.

   •  See if the prerequisites are already installed

   •  Run the installer

   •  Check the installation by building and running a sample program

Installing the CUDA Toolkit on Linux
02:15

Students need to learn the basic structure of CUDA programs, and how to compile them. We will look at a simple example program, walking through the code, and explaining what each piece does.

   •  Walk through the "main" function, explaining each piece

   •  Illustrate the typical pattern used by CUDA programs (load data on host, process on device, save or display on host)

   •  Explain the kernel function in detail

Your First CUDA Program
04:49
Test Your Knowledge
5 questions
+ Programming with CUDA
5 lectures 22:09

To use CUDA effectively, we must understand the fundamental hardware and software architecture. This video will explain these concepts.

   •  Explain the driver and runtime APIs, and how they interact

   •  Look at the threads, grids, and blocks

   •  Explore the CUDA hardware architecture, SIMT programming, and hardware multithreading

Preview 06:30

When running CUDA programs we need to set appropriate thread and block layouts for our problem. This video will illustrate 1D and 2D configurations, along with examples.

   •  Demonstrate kernels with 1D and 2D grids, in several configurations

   •  Discuss heuristics for determining block size

   •  Learn the CUDA occupancy API, for easily achieving maximum occupancy

Kernel Execution Configurations
06:36

Kernels will sometimes crash, and normal debuggers cannot locate errors in the device code. By using the Nsight debugger on Windows, we can quickly track down these errors.

   •  Look at the single-stepping and examining variables in device code

   •  Show how to move to different threads and blocks

   •  Show how to use the CUDA memory checker, to quickly find the errors

Debugging with NVIDIA Nsight on Windows
02:29

Kernels will sometimes crash, and normal debuggers cannot locate errors in the device code. By using the cuda-gdb debugger on Linux, we can quickly track down errors.

   •  Learn how to run the debugger and step through device code

   •  Show how to move to different threads and blocks

   •  See how to use the CUDA memory checker, to quickly find errors

Debugging with cuda-gdb on Linux
02:50

Errors can occur in kernels or CUDA API calls. A robust program needs to check for errors and handle them appropriately. We will learn how this is accomplished in CUDA.

   •  Show how to get error messages and continue after errors

   •  Demonstrate error checking for asynchronous kernels launches

   •  Understand how to simplify error handling with a macro

Handling Errors
03:44
Test Your Knowledge
6 questions
+ Performance Optimizations
5 lectures 28:05

We need to identify performance bottlenecks and possible fixes, in order to optimize kernel performance. With the Visual Profiler, we can identify opportunities for speeding up kernels.

   •  Learn how to set up the profiler

   •  Show an example program, which we will profile

   •  Profile the example program and examine the results

Preview 04:49

For maximum performance, we need to access memory in specific patterns. This video will explain the CUDA memory hierarchy, and show how to access memory efficiently.

   •  Explain the CUDA memory hierarchy and coalescing of memory access

   •  Learn about the strided access, and how it degrades performance

   •  Demonstrate a fix for the example program

Using Memory Efficiently
06:09

2D and 3D grids require specific memory layouts to achieve optimal performance. In this video, we will demonstrate this, and show how to use 2D and 3D memory.

   •  Show examples of 2D grid, and profile it to identify performance problems

   •  Explain alignment requirements, and how they impact performance, with the help of an example

   •  Demonstrate the use of 2D memory, to improve performance

Working with 2D and 3D Memory Layouts
06:18

Some workloads will not generate coalesced memory access. We shall show how to use specialized caches, to improve performance in some of these cases.

   •  Show an example using constant memory, and demonstrate performance benefits

   •  Explore an example using texture memory, and demonstrate performance benefits

   •  Review the benefits and uses of texture and constant memory

Texture and Constant Memory
06:37

After memory access is optimized, some additional techniques can be used to increase performance further. We will learn some of these techniques, and demonstrate them on previous examples.

   •  Improve ILP by computing multiple outputs per thread

   •  Improve compute efficiency by using fast intrinsic functions

   •  Explore various techniques for instruction and control flow optimization

Instruction and Control Flow Optimizations
04:12
+ Parallel Algorithms
4 lectures 26:22

Many algorithms require inter-thread communication. We will learn to use shared memory and synchronization to communicate within blocks.

   •  Show a naive matrix transpose with poor performance

   •  Review memory hierarchy and explain properties of shared memory

   •  Demonstrate the use of shared memory to improve transpose performance

Introduction to Shared Memory
07:39

Reduction is a fundamental building block of parallel algorithms, but the theoretical algorithm does not map directly to CUDA hardware. We will learn how to implement reduction in CUDA, by splitting the algorithm into blocks, and then combining those results.

   •  Explain the basic reduction algorithm

   •  Show how to implement reduction for a single block

   •  Look at the ways to combine reductions from multiple blocks

Reduction
07:50

Prefix sum/scan presents additional challenges because it involves multiple steps, with full synchronization after each step. We will show how to perform this synchronization, by writing intermediate results to global memory, and launching separate kernels for each step.

   •  Explain the basic scan algorithm

   •  Give an overview of the CUDA scan implementation

   •  Look at the implementation in detail

Prefix Sum
07:13

Filtering/stream compaction is a useful operation, but parallel implementation is not obvious, because the position of each output element depends on many inputs. We will learn how scan can be used as a building block to create an efficient implementation.

   •  Explain the basic approach to filtering in CUDA

   •  Demonstrate the CUDA implementation

   •  Discuss techniques for improving performance

Filtering
03:40
Test Your Knowledge
5 questions
+ GPU Accelerated Libraries
4 lectures 13:52

CUDA includes several libraries for Deep learning. We will explain each library briefly, so you will understand what they can do, and whether they can be useful to you.

   •  Give an overview of Tensor Cores

   •  Join the developer program and download Deep learning libraries

   •  Briefly explain cuDNN, TensorRT, and DeepStream SDK

Deep Learning
04:25

CUDA includes several libraries for signal, image, and video processing. We will explain each library briefly, so you will understand what they can do, and whether they can be useful to you.

   •  Overview of cuFFT

   •  Learn about NPP

   •  Look at the Codec SDK

Signal, Image, and Video
02:39

CUDA includes several libraries for linear algebra and math. We will explain each library briefly, so you will understand what they can do, and whether they can be useful to you.

   •  Explore different linear algebra libraries

   •  Learn about cuRAND

   •  See the CUDA Math Library

Linear Algebra and Math
02:23

CUDA includes several libraries for parallel algorithms. We will explain each library briefly, so you will understand what they can do, and whether they can be useful to you.

   •  Learn about Thrust

   •  Overview of CUB

   •  Study nvGRAPH and NCCL

Parallel Algorithms
04:25
Test Your Knowledge
6 questions
+ Advanced CUDA Topics
7 lectures 32:10

In order to fully utilize the GPU, it is sometimes necessary to run multiple kernels concurrently. We shall learn how to use streams and synchronization functions, to manage concurrency and improve utilization.

   •  Demonstrate a reduce kernel, which does not fully utilize the GPU

   •  Run multiple kernels in separate streams to increase utilization

   •  Overview of synchronization and events

Concurrency and Streams
07:10

Memory transfers often add significant overhead to CUDA programs. By overlapping transfers with host or device execution, we can improve performance.

   •  Show a program with significant transfer overhead

   •  Use page-locked memory, to enable asynchronous transfers

   •  Observe concurrency in the profiler

Overlapping Transfers and Computation
04:35

While CUDA abstracts away many hardware details, sometimes it is useful to have information about the available devices. In this video, we will learn about functions for querying devices, and setting the current device.

   •  Show functions for listing device

   •  Understand how to get device properties

   •  Look at the functions for setting current device

Device Management
02:28

When multiple CUDA devices are available, work can be spread between them for increased performance. However, this requires some changes to the programs. In this video, we will modify our concurrent kernel example to use multiple devices.

   •  Overview of multi-device programming

   •  Demonstrate kernel execution on multiple devices

   •  Observe performance improvement from using multiple devices

Programming with Multiple GPUs
03:39

Keeping track of host and device memory can require some bookkeeping, and it may also complicate CUDA programs. We shall learn how the unified address space allows CUDA to automatically track memory type and location, simplifying our programs.

   •  Overview of the unified address space

   •  Modify multi-device example, so as to use unified address space

   •  Show how to get and print pointer information

The Unified Address Space
02:10

Allocating device memory with host APIs, limits the flexibility of kernels. In this video, we shall show how to allocate global memory from within kernels.

   •  Overview of allocation from device code

   •  Demonstrate use of dynamic allocation in kernels

   •  Explain technical details and limitations

Dynamic Global Memory Allocation
04:43

For some problems, a static grid layout is inefficient or inconvenient. By using dynamic parallelism, we can launch nested kernels, to adjust the grid structure on the fly.

   •  Demonstrate the use of nested kernels with dynamic parallelism

   •  Discuss hardware and compilation requirements

   •  Explain technical details of dynamic parallelism

Dynamic Parallelism
07:25
Test Your Knowledge
6 questions
+ Summary and Next Steps
3 lectures 10:45

Review the course material to aid retention.

   •  Review the programming basics and optimization

   •  Look at the parallel algorithms

   •  Skim through the libraries and advanced topics

What We Have Learned
06:35

Hands-on experience is needed to master the concepts presented in this course. This video will suggest some next steps to be taken.

   •  Use CUDA in real-world projects

   •  Experiment with example programs

   •  Review the course material as needed

Next Steps
01:59

There is still more to learn about CUDA. This video will suggest some resources for further study.

   •  Look at the CUDA Toolkit documentation and examples

   •  Explore the NVIDIA developer website

   •  Learn about the GPU Gems book series

Resources to Explore
02:11
Requirements
  • A good understanding of programming in modern C++ (C++17) is required in order to implement the concepts in this course.
Description

Do you want to write GPU-accelerated applications, but don't know how to get started? With CUDA 10, you can easily add GPU processing to your C and C++ projects. CUDA 10 is the de-facto framework used to develop high-performance, GPU-accelerated applications.

In this course, you will be introduced to CUDA programming through hands-on examples. CUDA provides a general-purpose programming model which gives you access to the tremendous computational power of modern GPUs, as well as powerful libraries for machine learning, image processing, linear algebra, and parallel algorithms.

After working through this course, you will understand the fundamentals of CUDA programming and be able to start using it in your applications right away.

About the Author

Nathan Weston is a software developer and lives in the Boston, USA area. He has worked in the visual effects industry, where he made extensive use of CUDA, and also has experience with software engineering research and scientific applications. He now works as a consultant with local and remote clients.

Who this course is for:
  • If you want to learn how to use parallel and high-performance computing techniques to develop modern applications using GPUs and CUDA, then this course is for you.