Accelerating Applications with CUDA

Three day course

Our most popular CUDA training course is a full 3 day course. The target audience is people who are reasonably competent programmers, and in particular we assume they are familiar with basic C, but no other knowledge is assumed. In particular, we do not assume they know anything about parallel programming or OpenMP. The course is organised as follows:

Day 1, Lecture 1 (1.5 hrs)
Introductory lecture: a simple non-trivial Monte Carlo application is taken from a serial C code to a fully working CUDA accelerated program. At the end of the lecture, attendees will know the hardware and software aspects of CUDA programming.

Day 1, Practical 1 (4 hrs)
Attendees apply what was learnt in Lecture 1. Since this is often their first encounter with CUDA the practical session allows ample time, and questions build on each other starting from a very simple level.

Day 1, Lecture 2 (1.5 hrs)
The lecture focuses on caching (coalesced transfers), the memory hierarchy, data layout, and our experiences using different memory types. Finally, we address the question of error handling, explain the problems, and present the solution adopted in the NAG GPU routines.

Day 2, Lecture 3 (1 hr)
Performance issues. We explore warp divergence, occupancy and practical issues when tuning register counts. Next, we consider memory bandwidth bound applications and potential strategies when increasing occupancy doesn't help. Lastly, we examine atomics and advanced caching features accessible through inline PTX.

Day 2, Practical 2 (4 hrs)
We demonstrate many of the issues covered in Lecture 3, in particular the performance impact of atomics (often surprisingly light!) and the fact that occupancy can, at times, be a red herring.

Day 2, Lecture 4 (1.5 hrs)
Concurrency. The aim is to understand the hardware capabilities, software limitations and CUDA API, with the aim of achieving three-way overlap of data upload to GPU, data download from GPU, and kernel execution. Several caveats are discussed which can easily trip you up. Finally, we discuss mapped memory.

Day 2/3, Practical 3 (4 hrs)
Attendees write a program which achieves simultaneous overlap of copy to and from the GPU with kernel execution. We explore mapped memory and demonstrate performance considerations of these and PCIe data transfers in general.

Day 3, Lecture 5 (1hr)
We examine multi-gpu programming, the Unified Address Space and GPU-direct (direct copy and peer direct access).

Day 3, Practical 4 (3hrs)
Attendees write a multi-gpu version of a simple Monte Carlo simulation. They also experiment with several features of the UAS and gpu-direct, and examine the performance of their system on these more advanced PCIe operations.