Mastering Parallel programming with CUDA platform

Name: Mastering Parallel programming with CUDA platform
Rating: 4.2 (1873 reviews)

Unofficial guide to parallel programming on GPU's with CUDA from basic concepts to advance algorithm implementations.

Created byKasun Liyanage

Last updated 7/2025

English

What you'll learn

All the basic knowladge about CUDA programming
Ability to desing and implement optimized parallel algorithms
Basic work flow of parallel algorithm design
Advance CUDA concepts

Course content

8 sections • 83 lectures • 10h 47m total length

Very very important7:48
Introduction to parallel programming8:50
Parallel computing and Super computing7:19
Let's investigate some background.
How to install CUDA toolkit and first look at CUDA program6:12
Basic elements of CUDA program16:50
Organization of threads in a CUDA program - threadIdx8:38
Organization of thread in a CUDA program - blockIdx,blockDim,gridDim6:14
Programming exercise 10:29
Unique index calculation using threadIdx blockId and blockDim9:20
Unique index calculation for 2D grid 15:53
Unique index calculation for 2D grid 25:09
Memory transfer between host and device11:13
Programming exercise 21:04
Sum array example with validity check9:13
Sum array example with error handling4:32
Sum array example with timing8:18
Extend sum array implementation to sum up 3 arrays
Device properties5:30
Summary4:17

Understand the device better8:46
All about warps9:43
Warp divergence12:28
Resource partitioning and latency hiding 15:35
Resource partitioning and latency hiding 210:41
Occupancy11:16
Profile driven optimization with nvprof12:04
Parallel reduction as synchronization example19:08
Parallel reduction as warp divergence example10:11
Parallel reduction with loop unrolling7:03
Parallel reduction as warp unrolling6:48
Reduction with complete unrolling4:09
Performance comparison of reduction kernels5:18
CUDA Dynamic parallelism10:03
Reduction with dynamic parallelism5:33
Summary4:36

Introduction to CUDA shared memory9:04
Shared memory access modes and memory banks9:06
Row major and Column major access to shared memory8:51
Static and Dynamic shared memory4:19
Shared memory padding5:44
Parallel reduction with shared memory4:44
Synchronization in CUDA3:38
Matrix transpose with shared memory11:53
CUDA constant memory13:10
Matrix transpose with Shared memory padding5:47
CUDA warp shuffle instructions14:59
Parallel reduction with warp shuffle instructions3:50
Summary2:10

Requirements

Basic C or C++ programming knowladge
How to use Visual studio IDE
CUDA toolkit
Nvidia GPU
You should be familiar with basic setup of a C++ project, how to change project properties etc

Description

This course is an in-depth, unofficial guide to parallel programming using GPU computing techniques with C++. We'll begin by exploring foundational concepts such as the GPU programming model, execution structure, and memory hierarchy. From there, you’ll dive into hands-on development, implementing advanced parallel algorithms optimized for high-performance graphics processors.

Since performance is at the heart of GPU-based computing, this course places a strong emphasis on optimization techniques. You’ll learn how to fine-tune your code for maximum speed and efficiency, and apply industry-standard tools for profiling and debugging, including nvprof, nvvp, memcheck, and GDB-based GPU debuggers.

The course includes the following core sections:

Introduction to GPU programming concepts and execution models
Understanding execution behavior on parallel processors
Deep dive into memory systems: global, shared, and constant memory
Using streams to manage concurrent execution
Fine-tuning instruction-level behavior for performance
Implementing real-world algorithms using GPU acceleration
Profiling and debugging tools overview

To reinforce learning, this course includes programming exercises and quizzes designed to help you internalize each concept.

This is the first course in a masterclass series on GPU-based parallel computing. The knowledge you gain here will form a strong foundation for exploring more advanced topics in future courses.

As GPUs continue to drive innovation in fields like AI and scientific computing, mastering these tools and techniques will set you apart in the tech industry.

Note: This course is not affiliated with or endorsed by NVIDIA Corporation. CUDA is a registered trademark of NVIDIA Corporation, used here solely for educational reference purposes.

Who this course is for:

Any one who wants to learn CUDA programming from scartch to intermidiate level

Mastering Parallel programming with CUDA platform

What you'll learn

Explore related topics

Course content

Introduction to CUDA programming and CUDA programming model18 lectures • 2hr 7min

CUDA Execution model16 lectures • 2hr 23min

CUDA memory model12 lectures • 1hr 37min

CUDA Shared memory and constant memory13 lectures • 1hr 37min

CUDA Streams8 lectures • 49min

Performance Tuning with CUDA instruction level primitives4 lectures • 28min

Parallel Patterns and Applications6 lectures • 44min

Bonus: Introduction to Image processing with CUDA6 lectures • 1hr 2min

Requirements

Description

Who this course is for: