CUDA programming Masterclass
4.2 (437 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
2,852 students enrolled

CUDA programming Masterclass

Learn parallel programming on GPU's with CUDA from basic concepts to advance algorithm implementations.
4.2 (437 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
2,854 students enrolled
Created by Kasun Liyanage
Last updated 12/2018
English
English
Current price: $83.99 Original price: $119.99 Discount: 30% off
5 hours left at this price!
30-Day Money-Back Guarantee
This course includes
  • 11 hours on-demand video
  • 62 downloadable resources
  • Full lifetime access
  • Access on mobile and TV
  • Assignments
  • Certificate of Completion
Training 5 or more people?

Get your team access to 4,000+ top Udemy courses anytime, anywhere.

Try Udemy for Business
What you'll learn
  • All the basic knowladge about CUDA programming
  • Ability to desing and implement optimized parallel algorithms
  • Basic work flow of parallel algorithm design
Course content
Expand all 83 lectures 10:47:51
+ Introduction to CUDA programming and CUDA programming model
18 lectures 02:06:49
The purpose of this quick assignment is to make you aware of some background on parallel programming in general.
Let's investigate some background.
3 questions
How to install CUDA toolkit and first look at CUDA program
06:12
Basic elements of CUDA program
16:50
Organization of threads in a CUDA program - threadIdx
08:38
Organization of thread in a CUDA program - blockIdx,blockDim,gridDim
06:14
Programming exercise 1
00:29
Unique index calculation using threadIdx blockId and blockDim
09:20
Unique index calculation for 2D grid 1
05:53
Unique index calculation for 2D grid 2
05:09
Memory transfer between host and device
11:13
Programming exercise 2
01:04
Sum array example with validity check
09:13
Sum array example with error handling
04:32
Sum array example with timing
08:18
In the assignment you have to implement array summation in GPU which can sum 3 arrays. You have to use error handling mechanisms, timing measuring mechanisms as well.Then you have to measure the execution time of you GPU implementations.
Extend sum array implementation to sum up 3 arrays
1 question
Device properties
05:30
Summary
04:17
+ CUDA Execution model
16 lectures 02:23:22
All about warps
09:43
Warp divergence
12:28
Resource partitioning and latency hiding 1
05:35
Resource partitioning and latency hiding 2
10:41
Occupancy
11:16
Profile driven optimization with nvprof
12:04
Parallel reduction as synchronization example
19:08
Parallel reduction as warp divergence example
10:11
Parallel reduction with loop unrolling
07:03
Parallel reduction as warp unrolling
06:48
Reduction with complete unrolling
04:09
Performance comparison of reduction kernels
05:18
CUDA Dynamic parallelism
10:03
Reduction with dynamic parallelism
05:33
Summary
04:36
+ CUDA memory model
12 lectures 01:36:58
CUDA memory model
06:49
Different memory types in CUDA
09:04
Memory management and pinned memory
07:19
Zero copy memory
08:45
Unified memory
04:39
Global memory access patterns
12:55
Global memory writes
03:53
AOS vs SOA
06:03
Matrix transpose
19:34
Matrix transpose with unrolling
06:21
Matrix transpose with diagonal coordinate system
08:36
Summary
03:00
+ CUDA Shared memory and constant memory
13 lectures 01:37:15
Introduction to CUDA shared memory
09:04
Shared memory access modes and memory banks
09:06
Row major and Column major access to shared memory
08:51
Static and Dynamic shared memory
04:19
Shared memory padding
05:44
Parallel reduction with shared memory
04:44
Synchronization in CUDA
03:38
Matrix transpose with shared memory
11:53
CUDA constant memory
13:10
Matrix transpose with Shared memory padding
05:47
CUDA warp shuffle instructions
14:59
Parallel reduction with warp shuffle instructions
03:50
Summary
02:10
+ CUDA Streams
8 lectures 49:28
How to use CUDA asynchronous functions
07:10
How to use CUDA streams
10:28
Overlapping memory transfer and kernel execution
05:23
Stream synchronization and blocking behavious of NULL stream
06:57
Explicit and implicit synchronization
02:31
CUDA events and timing with CUDA events
06:03
Creating inter stream dependencies with events
04:31
+ Parallel Patterns and Applications
6 lectures 43:57
Scan algorithm introduction
05:38
Simple parallel scan
08:24
Work efficient parallel exclusive scan
09:33
Work efficient parallel inclusive scan
07:41
Parallel scan for large data sets
04:52
Parallel Compact algorithm
07:49
+ Bonus: Introduction to Image processing with CUDA
6 lectures 01:02:24
Introduction part 1
08:04
Introduction part 2
11:41
Digital image processing
09:39
Digital image fundametals : Human perception
11:10
Digital image fundamentals : Image formation
15:22
OpenCV installation
06:28
Requirements
  • Basic C or C++ programming knowladge
  • How to use Visual studio IDE
  • CUDA toolkit
  • Nvidia GPU
Description

This course is all about CUDA programming. We will start our discussion by looking at basic concepts including CUDA programming model, execution model, and memory model. Then we will show you how to implement advance algorithms using CUDA. CUDA programming is all about performance. So through out this course you will learn multiple optimization techniques and how to use those to implement algorithms. Also we will extensively discuss profiling techniques and some of the tools including nvprof, nvvp, CUDA Memcheck, CUDA-GDB tools in the CUDA toolkit. This course contains following sections.

                                             Introduction to CUDA programming and CUDA programming model

                                             CUDA Execution model

                                             CUDA memory model-Global memory

                                             CUDA memory model-Shared and Constant memory

                                             CUDA streams

                                             Tuning CUDA instruction level primitives

                                             Algorithm implementation with CUDA

                                             CUDA tools

With this course we include lots of programming exercises and quizzes as well. Answering all those will help you to digest the concepts we discuss here.

This course is the first course of the CUDA master class series we are current working on. So the knowledge you gain here is essential of following those course as well.

Who this course is for:
  • Any one who wants to learn CUDA programming from scartch to intermidiate level