Learning CUDA 10 Programming
- 2.5 hours on-demand video
- 1 downloadable resource
- Full lifetime access
- Access on mobile and TV
- Certificate of Completion
Get your team access to 4,000+ top Udemy courses anytime, anywhere.Try Udemy for Business
- Use CUDA to speed up your applications using machine learning, image processing, linear algebra, and more
- Learn to debug CUDA programs and handle errors
- Use optimization techniques to get the maximum performance from your CUDA programs
- Master the fundamentals of concurrency and parallel algorithms on GPUs
- Learn about the wide range of GPU-accelerated libraries included with CUDA
- Learn the next steps you can take to continue building your CUDA skills
Many programmers are interested in CUDA, but do not know much about it. This video will explain what CUDA is, and the kinds of applications it might be useful for.
• Explain GPU architectures and compare them to CPUs, show the potential performance benefits
• Discuss how CUDA makes it easy to program GPUs
• See the tools which are included with the CUDA toolkit
The CUDA Toolkit must be installed to build and run CUDA programs. This video will walk through the steps for Windows users.
• Check if the prerequisites are already installed
• Run the installer
• Check the installation, by building and running a sample program
Students need to learn the basic structure of CUDA programs, and how to compile them. We will look at a simple example program, walking through the code, and explaining what each piece does.
• Walk through the "main" function, explaining each piece
• Illustrate the typical pattern used by CUDA programs (load data on host, process on device, save or display on host)
• Explain the kernel function in detail
To use CUDA effectively, we must understand the fundamental hardware and software architecture. This video will explain these concepts.
• Explain the driver and runtime APIs, and how they interact
• Look at the threads, grids, and blocks
• Explore the CUDA hardware architecture, SIMT programming, and hardware multithreading
When running CUDA programs we need to set appropriate thread and block layouts for our problem. This video will illustrate 1D and 2D configurations, along with examples.
• Demonstrate kernels with 1D and 2D grids, in several configurations
• Discuss heuristics for determining block size
• Learn the CUDA occupancy API, for easily achieving maximum occupancy
Kernels will sometimes crash, and normal debuggers cannot locate errors in the device code. By using the Nsight debugger on Windows, we can quickly track down these errors.
• Look at the single-stepping and examining variables in device code
• Show how to move to different threads and blocks
• Show how to use the CUDA memory checker, to quickly find the errors
Kernels will sometimes crash, and normal debuggers cannot locate errors in the device code. By using the cuda-gdb debugger on Linux, we can quickly track down errors.
• Learn how to run the debugger and step through device code
• Show how to move to different threads and blocks
• See how to use the CUDA memory checker, to quickly find errors
Errors can occur in kernels or CUDA API calls. A robust program needs to check for errors and handle them appropriately. We will learn how this is accomplished in CUDA.
• Show how to get error messages and continue after errors
• Demonstrate error checking for asynchronous kernels launches
• Understand how to simplify error handling with a macro
We need to identify performance bottlenecks and possible fixes, in order to optimize kernel performance. With the Visual Profiler, we can identify opportunities for speeding up kernels.
• Learn how to set up the profiler
• Show an example program, which we will profile
• Profile the example program and examine the results
For maximum performance, we need to access memory in specific patterns. This video will explain the CUDA memory hierarchy, and show how to access memory efficiently.
• Explain the CUDA memory hierarchy and coalescing of memory access
• Learn about the strided access, and how it degrades performance
• Demonstrate a fix for the example program
2D and 3D grids require specific memory layouts to achieve optimal performance. In this video, we will demonstrate this, and show how to use 2D and 3D memory.
• Show examples of 2D grid, and profile it to identify performance problems
• Explain alignment requirements, and how they impact performance, with the help of an example
• Demonstrate the use of 2D memory, to improve performance
Some workloads will not generate coalesced memory access. We shall show how to use specialized caches, to improve performance in some of these cases.
• Show an example using constant memory, and demonstrate performance benefits
• Explore an example using texture memory, and demonstrate performance benefits
• Review the benefits and uses of texture and constant memory
After memory access is optimized, some additional techniques can be used to increase performance further. We will learn some of these techniques, and demonstrate them on previous examples.
• Improve ILP by computing multiple outputs per thread
• Improve compute efficiency by using fast intrinsic functions
• Explore various techniques for instruction and control flow optimization
Many algorithms require inter-thread communication. We will learn to use shared memory and synchronization to communicate within blocks.
• Show a naive matrix transpose with poor performance
• Review memory hierarchy and explain properties of shared memory
• Demonstrate the use of shared memory to improve transpose performance
Reduction is a fundamental building block of parallel algorithms, but the theoretical algorithm does not map directly to CUDA hardware. We will learn how to implement reduction in CUDA, by splitting the algorithm into blocks, and then combining those results.
• Explain the basic reduction algorithm
• Show how to implement reduction for a single block
• Look at the ways to combine reductions from multiple blocks
Prefix sum/scan presents additional challenges because it involves multiple steps, with full synchronization after each step. We will show how to perform this synchronization, by writing intermediate results to global memory, and launching separate kernels for each step.
• Explain the basic scan algorithm
• Give an overview of the CUDA scan implementation
• Look at the implementation in detail
Filtering/stream compaction is a useful operation, but parallel implementation is not obvious, because the position of each output element depends on many inputs. We will learn how scan can be used as a building block to create an efficient implementation.
• Explain the basic approach to filtering in CUDA
• Demonstrate the CUDA implementation
• Discuss techniques for improving performance
CUDA includes several libraries for Deep learning. We will explain each library briefly, so you will understand what they can do, and whether they can be useful to you.
• Give an overview of Tensor Cores
• Join the developer program and download Deep learning libraries
• Briefly explain cuDNN, TensorRT, and DeepStream SDK
In order to fully utilize the GPU, it is sometimes necessary to run multiple kernels concurrently. We shall learn how to use streams and synchronization functions, to manage concurrency and improve utilization.
• Demonstrate a reduce kernel, which does not fully utilize the GPU
• Run multiple kernels in separate streams to increase utilization
• Overview of synchronization and events
Memory transfers often add significant overhead to CUDA programs. By overlapping transfers with host or device execution, we can improve performance.
• Show a program with significant transfer overhead
• Use page-locked memory, to enable asynchronous transfers
• Observe concurrency in the profiler
While CUDA abstracts away many hardware details, sometimes it is useful to have information about the available devices. In this video, we will learn about functions for querying devices, and setting the current device.
• Show functions for listing device
• Understand how to get device properties
• Look at the functions for setting current device
When multiple CUDA devices are available, work can be spread between them for increased performance. However, this requires some changes to the programs. In this video, we will modify our concurrent kernel example to use multiple devices.
• Overview of multi-device programming
• Demonstrate kernel execution on multiple devices
• Observe performance improvement from using multiple devices
Keeping track of host and device memory can require some bookkeeping, and it may also complicate CUDA programs. We shall learn how the unified address space allows CUDA to automatically track memory type and location, simplifying our programs.
• Overview of the unified address space
• Modify multi-device example, so as to use unified address space
• Show how to get and print pointer information
Allocating device memory with host APIs, limits the flexibility of kernels. In this video, we shall show how to allocate global memory from within kernels.
• Overview of allocation from device code
• Demonstrate use of dynamic allocation in kernels
• Explain technical details and limitations
For some problems, a static grid layout is inefficient or inconvenient. By using dynamic parallelism, we can launch nested kernels, to adjust the grid structure on the fly.
• Demonstrate the use of nested kernels with dynamic parallelism
• Discuss hardware and compilation requirements
• Explain technical details of dynamic parallelism
- A good understanding of programming in modern C++ (C++17) is required in order to implement the concepts in this course.
Do you want to write GPU-accelerated applications, but don't know how to get started? With CUDA 10, you can easily add GPU processing to your C and C++ projects. CUDA 10 is the de-facto framework used to develop high-performance, GPU-accelerated applications.
In this course, you will be introduced to CUDA programming through hands-on examples. CUDA provides a general-purpose programming model which gives you access to the tremendous computational power of modern GPUs, as well as powerful libraries for machine learning, image processing, linear algebra, and parallel algorithms.
After working through this course, you will understand the fundamentals of CUDA programming and be able to start using it in your applications right away.
About the Author
Nathan Weston is a software developer and lives in the Boston, USA area. He has worked in the visual effects industry, where he made extensive use of CUDA, and also has experience with software engineering research and scientific applications. He now works as a consultant with local and remote clients.
- If you want to learn how to use parallel and high-performance computing techniques to develop modern applications using GPUs and CUDA, then this course is for you.