CUDA Essentials III - Kernel execution, indexing and vector types

Slides:  Download 2.2.3 CUDA Essentials - Kernel Execution, Indexing and Vector Types-1.pdf

Transcription of the video lecture

Slide 2 – Three Key-Points

The key-points of this lecture are three.

  • First, in a kernel, to determine which part of the array work on, or determine different execution paths, we will need to obtain the thread ID. For that, we have to retrieve the ID of the block, the ID of the thread in the block, and do a simple math that is called thread indexing.
  • Second, we need to keep in mind that kernel execution, except for CUDA data movement functions, are asynchronous so we might use synchronization to ensure correct execution.
  • Third, CUDA provides vector types, similar to C structures with 2,3 and 4 members, to improve the memory bandwidth.

 

Slide 3 –  Kernel Execution

As we saw in previous lectures, CUDA uses the Single Instructions Multiple Threads paradigm. In this approach, each thread executes the same kernel function. However, we want to a have a way for a kernel to operate on different element of an input array or execute a different series of operations. The way to do that is to associate an ID to each thread and use the ID for threads to access different elements of the input array. For instance, if we have an input array with 20 elements, we might want to have 20 threads to operate on the 20 array elements in parallel. Thread 0 can take element 0, Thread 1 can take element 1, Thread 2 can take element 2, and so on and so forth. For this reason, we need to find a way to determine the thread ID number. The problem of determining the thread ID is called thread indexing.

 

Slide 4 - CUDA Built-in Variables: Dimension and Index Variables

How do we find the thread ID? CUDA, within the kernel, allows to retrieve two index variables, blockIdx.x and threadIdx.x, to determine the ID of the block and the ID of the thread within the block. Knowing these we can calculate the thread ID, but do we miss anything? Yes, within the kernel, we don’t have information about how many threads a block has. We can get this information from a so-called dimension variable, blockDim.x, that gives us the number of threads per block.

 

Slide 5 - What is my thread ID?

Let’s look at the practical example of finding the ID of the thread in blue in this diagram. In the main() code, we launch a kernel using an execution configuration of 7 blocks with 256 threads. We see this in the triple angle brackets. When we write our kernel k(), each thread needs to retrieve to determine its own thread ID. The way to do that is first to calculate first the off-set to the first thread ID in the block by multiplying the block ID by the number of threads of per block. Once we can calculate the off-set, we can simply add the ID number within the block and voila’ we have the thread ID. It was easy, right?

 

Slide 6 - Asynchronous Execution

An important aspect of kernel launching is that kernel launching has an asynchronous execution.

Asynchronous execution means that after the kernel execution returns control to the CPU immediately after starting up the GPU process, before the kernel has finished executing. To synchronize all threads or just a part of the threads, CUDA provides two functions: first, cudaDeviceSynchronize() to synchronize all threads in a GPU; second, __synchThreads() to synchronize threads only within a block.

The traditional example, where we need cudaDeviceSynchronize(), is when we need use printf in a kernel. We launch a kernel that does a calculation and then print the result. I don’t want to disappoint you but if you haven’t tried yet, you won’t have any print to the display. The reason is the asynchronous nature of kernel launching. A kernel launch is asynchronous. This means that the control returns to the CPU immediately after the launch, before the kernel has finished executing. So what is the next thing in the CPU after the kernel finish? Application exit. At application exit, GPU ability to send output to the standard output is terminated by the OS. Thus, the output that is generated later by the kernel has nowhere to go, and we won't see it. On the other hand, if we use cudaDeviceSynchronize(), then the kernel is guaranteed to finish and the output from the kernel will find a waiting standard output queue, before the application is allowed to exit.

 

Slide 7 – CUDA Data Movement Functions are Synchronous

By default, CUDA data transfers, like cudaMalloc() and cudaMemcpy(), are synchronous by default. In other words, the functions do not return until the data transfer is complete, so for instance cudaMemcpy() finishes execution before the CPU can move to other operations.

 

Slide 8 - CUDA Vector Types

A nice feature of CUDA is the support of vector types to extends the standard C data types of length up to 4. For instance, if we want a vector with four elements of type float, we can use the type float4. If we want to use a vector with two elements of type integer, we can use CUDA vector type int2. We can access individual components using the suffixes .x, .y, .z, and .w like accessing members of a structure. Accessing components beyond those declared for the vector type is an error.

 

Slide 9 - Why to use CUDA vector types is good?

CUDA vector types really look like standard C structures so it is legitimate to ask ourselves what the benefit could be. It turns out that the use of vector types allows for vectorized loads and stores in CUDA C/C++ and helps increase bandwidth utilization. For this reason, we should use them when possible.

 

Slide 10 – CUDA Data types for Index and Dimension Variables

At this point, it might not be a surprise for you that for the index variables, blockIdx and threadIdx, are indeed a CUDA vector of type int.  Index variables are three element vector of type unsigned integer that corresponds to the CUDA data type uint3. In addition, when I have introduced CUDA grids, I mention briefly the type dim3. CUDA uses the vector type dim3 for the dimension variables, gridDim and blockDim. The dim3 type is a CUDA type that is equivalent to uint3 with unspecified entries set to 1. dim3 variables are used for specifying execution configuration.

 

Slide 11 – To Summarize

This lecture concludes the CUDA essential series. In this lecture, we focused on three points.  First, in a kernel we will need to obtain the thread ID. For that, we have to retrieve the ID of the block, the ID of the thread in the block, and do thread indexing. Second, we need to keep in mind that kernel execution, except for CUDA data movement functions, are asynchronous so we might use synchronization to ensure correct execution, like printing from a device. Third, CUDA provides vector types, similar to C structures with 2,3 and 4 members, to improve the memory bandwidth.