CUDA Essentials II - Kernel launching

Slides: 2.2.2 CUDA Essentials - Kernel Launching-2.pdf Download 2.2.2 CUDA Essentials - Kernel Launching-2.pdf

Transcription of the video lecture

Slide 2 – Four Key-Points

There are four main points I want to emphasize.

First, we call kernel launching the invocation of a function to be run on GPUs.
Second, CUDA follows the SIMT model in which each GPU thread executes the kernel function.
Third, threads in CUDA are organized in thread blocks so we need to decide the number of thread blocks and the number of threads per block (the so-called execution configuration).
Finally, when we define a kernel in our code, you need to prepend the function qualifiers before the function.

Slide 3 – CUDA Workflow

We saw in the previous lecture, that the CUDA workflow consists of three phases and provides a guide in developing applications to run on GPUs. In the previous lecture, we focused on the first and third phases of the CUDA workflow. In this and next lecture, we focus on the second phase of the CUDA workflow: we will learn how to define and execute a function to run on the GPU cores.

Slide 4 - CUDA Code Execution on GPU

In the GPU architectures slides, we learnt that CUDA follows the SIMT paradigm. In the SIMT model, each CUDA thread executes the same function (what CUDA calls kernel) on the GPU cores.

In order to run a kernel on the CUDA threads, we need two things. First, in the main() function of the program, we call the function to be executed by each thread on the GPU. This invocation is called Kernel Launch and with it we need provide the number of threads and their grouping. Second, we need a way to indicate that a function needs to be executed on the GPU. In the following slides, we will focus on these two aspects.

Slide 5 - Kernel Launching

Once we have completed the phase 1 of the CUDA workflow, that is having data on the GPU memory, we are ready to call a function to be run on the GPU. Function invocation, also called kernel launching, is simply done by using the function name with its argument between the parenthesis. The difference with a normal function invocation is that we provide now two additional arguments between the triple angle brackets. These two additional arguments are the number of thread blocks (Dg) and the number of threads per block (Db). These two values constitute the Execution Configuration.

Slide 6 - Thread Organization in CUDA

But what are these Dg and Db? In practice, these two values are related to how CUDA groups threads on a GPU. Threads are divided in blocks, called thread blocks. The number of these blocks is Dg. Each block has Db threads. There is a maximum number of threads per block, equal to 1024. If you want to get the total number of threads, then you need to multiple the number of blocks (Dg) by the number of threads per block (Db).

In addition, threads blocks are organized in grids that can be one-, two- and three-dimensional. In the case of two and three dimension grids, we need to provide the number of blocks in each additional direction. In the lab assignments and project, we will mainly use one dimensional grids so we don’t have to worry too much about 2D and 3D. However, two and three-dimensional grids might be handy, when you are doing image and video processing respectively. Just to give you an example of thread organization, we can look at the figure at the bottom of this page. In this case, we have a one-dimensional grid with 7 threads blocks, so Dg will be 7. Each block includes 256 threads so Db will be 256. So, in total we will have Dg*Db that is equal to 1,792 threads running on the GPU.

Slide 7 - Thread Organization in CUDA II

I think that with an example, it will be a little bit clearer. Let’s focus on the one-dimensional case where number of thread blocks and threads per block are simply two scalar values Dg and Db. We initialize first Dg and Db to 7 and 256 respectively. We then include them between the triple angle bracket.

A one-dimensional grid can be also seen as three-dimensional grid. A 3D grid can be defined by using the CUDA dim3 type and only 1 grid point in the y and z directions. For this reason, we can also use the equivalent form dim3(Dg,1,1), dim3(Db,1,1) in the execution configuration.

Slide 8 - Work ≠ Resources

When I start using CUDA, I was always confused by trying to map the number of threads to the given architecture in use. For instance, I thought that if I have two SMs, I need to use two thread blocks. I thought that, if I have 8 SPs per SM, then I have to use 8 threads per block. But I was dead wrong. In fact, thread blocks and grids are abstractions to express the application work. When we launch a kernel, we specify the work we want to do. The GPU then uses its resources (SMs, SPs, ..) to perform the work. In general, we provide much more work than the available resources because we use the technique of oversubscribing to hide latency. In general, the number of blocks is higher than the number of SMs, and the number of threads per block is higher than the number of SPs per SM.

Slide 10 - How do you Define Kernel Function?

We will learn now how to indicate that a function needs to be run on a GPU. CUDA makes distinction between functions, depending if the CPU or the GPU is calling, and if the function will be run on the GPU or the CPU. CUDA makes this distinction by prepending special function type qualifiers. The one we will likely use the most is __global__ that is the qualifier for kernels. If we prepend __device__ then functions are called from the device and execute on the device. If you have a function that is called from a kernel than it needs the __device__ qualifier in the declaration. Finally, the boring default __host__ qualifier that is for a function called from the host and executed on the host.

Slide 11 - Question: which qualifier do you have before the function you call from the CPU and you want to run on GPU

So now a question for us: which qualifier do we have before the function, we call from the CPU to run on GPU? And the answer is __global__

Slide 12 - Kernel Limitations

When I have introduced kernel as special functions to be run on GPU, I have omitted two important aspects about kernels. The first point is that kernels execute on the GPU and do not, in general, have access to data stored on the host side. The second important point is that kernels cannot return a value, so the return type is always void. How do I get the results from my kernel? We will need to make a cudaMemcpy() to move data from the GPU to CPU!

Slide 13 – To Summarize

In this lecture, we looked at how to run a function on GPUs. We focused on four main points. First, we called kernel launching the invocation of a function to be run on GPUs. Second, CUDA follows the SIMT model in which each CUDA thread executes the kernel function. Third, threads in CUDA are organized in thread blocks, so we need to decide the number of thread blocks and the number of threads per block, the so-called execution configuration. Finally, when we define a kernel in our code, you need to prepend the __global__ or __device__ qualifiers before the function name.