Our first CUDA program

Slides: 2.3 Our First CUDA Program-1.pdf Download 2.3 Our First CUDA Program-1.pdf

Transcription of the video lecture

Slide 2 – One main point

We write our first program by following the CUDA workflow, implementing few simple steps: first we will create a CUDA code, starting from a C code and slightly modifying it. Second, we allocate memory on GPU and launch the kernel providing the execution configuration. Third, we define kernels to be run on the GPU. And finally, we retrieve the thread ID from index and dimension variables in the kernel.

Slide 3 – Problem: dist_v1

The problem I would like to solve is the following: I would like to calculate an array of distances, from a reference point set to 0.5, to each of N points uniformly spaced between 0 and 1 along a line segment. In this exercise, there are two computations. First, I want to assign a position between 0 and 1 to each of the N initial point. We will call this operation scale operation. Second, I want to calculate the distance between each element of the array and a reference point set to 0.5. We will call this operation distance operation.

Slide 4 - Serial C Code

This slide presents a simple serial C code to perform the scale and distance operations. The code has a single loop that scales the loop index to create an input location, and then computes/stores the distance from the reference location. We will see that CUDA implementation will eliminate the for loop by using a block grid.

Slide 5 - Overall Strategy for Porting to GPUs

This slide presents a basic strategy to use CUDA to solve this problem. We have N points to process, so will launch N threads each one carrying out the scale and distance operations in parallel. In CUDA, we provide the number of blocks and threads per blocks to set the total number of threads. Typically, the number of threads per block TPB is fixed to a multiple of 32. So, we fix TPB and we calculate the number of blocks as N divided by TPB as we want one thread per each grid point. Once the kernel is launched, each thread will compute the scale and distance operation on its own grid point and store it on the GPU memory as an element of d_out.

Slide 6 - 1. Create the CUDA Source File

The first step is to create a source file kernel.cu. In fact, CUDA codes have extension .cu. Once we have done that, we can copy and paste the content of main.cpp into kernel.cu. Question: Is this a CUDA code? Yes, all the regular C codes are also CUDA codes that runs only on the CPU.

Slide 7 - 2.1 Modify kernel.cu

A second step, we start modifying kernel.cu. We can delete #include <math.h> because CUDA internal files already include math.h, and insert <stdio.h> to enable printing the output. We also want to set the number of threads per block to 32 using #define TPB 32 as we said in the previous slides.

Slide 8 – 2.2 Modify kernel.cu

A more fundamental modification is the elimination of the for loop in the C serial code. Instead, of performing serially the scale and distance operations using the loop, we will calculate these operations in parallel using N threads. We copy the loop body outside the main() in a distanceKernel() function comprising scale() and distance() functions. We then replace the for loop with the kernel with an execution configuration N/TPB, TPB. Note that we want to have a thread for each grid point so that total number of threads should be N.

Slide 9 – 3.1 Create Results Array (d_out) on the GPU

We will need then to create an array on the GPU memory to store the results of scale and distance operation for each thread. To do that we declare a pointer to d_out and provide a pointer to this pointer as first argument of the cudaMalloc() function that allocate memory on the GPU. Here, it is easy to forget the ampersand in the first argument. Also note that the size is provided in number of bytes.

Slide 10 – 4.1 Create Kernel Definition

We have now three functions and we need to use function type qualifiers to indicate where the function should be executed. distanceKernel() is kernel launched by the host to be run on the GPU so the correct answer is __global__

Slide 11 - 4.2 Create Kernel Definition

Let’s look at the scale function. In this case, we launch this kernel from the distanceKernel kernel so the correct answer is __device__

Slide 12 - 4.3 Create Kernel Definition

What about distance kernel? Which qualifier? Well, this is like the scale function, so it is again __device__.

Slide 13 – 5 Get the my Thread ID in the Kernel

We work now in the kernel function. At this point, we want to associate one grid point to each thread. To calculate the thread ID number, we use the block ID (blockIdx.x) multiplied by the number of threads per block (blockDim.x) and we add the thread ID within the block (threadIdx.x).

Slide 14 - Putting Everything Together

We are now ready, to put everything together. We start form the main. We first allocate memory on the GPU with cudaMalloc() and we launch N threads, divided in N over NTB blocks, to compute in parallel the distanceKernel kernel on the GPU. Each thread will execute distanceKernel on a different grid node. The distanceKernel has the __global__ qualifier while scale() and distance() have the __device__ function type qualifier. Is anything missing? Yes, the cudaDeviceSynchronize() otherwise we won’t see any printing on the display when printing from a kernel.

Slide 15 – Running on Tegner

This code can be simply run on Tegner by compiling it first with nvcc, asking for a brief interactive session with slurm salloc and run it with slurm srun.

Slide 16 - Where is my Data: Host or Device Memory?

A first point I want to mention is that kernels cannot return a value to the host. In addition, the kernel has access to device memory, but generally not to the host memory. In fact, we missed the last part of the CUDA workflow that is to move data from the GPU memory to the CPU memory. How do we do that? We will need to make a cudaMemcpy().

Slide 17 - Careful with Integer Arithmetic!

A second point I want to raise is to pay attention to the execution configuration. In our code, the kernel execution configuration is specified so that each block has TPB threads, and there are N/TPB block. We might have a problem is N is not a multiple of 32, let’s say 65. In this case, we would get 65 divided by 32 equal to 2 blocks of 32 threads. The last entry in the array would not get computed because there is no thread with the corresponding index. How do we fix this problem? By rounding up the number of blocks. The simple trick is to change the number of blocks as (N+TPB-1)/TPB to ensure that the number of blocks is rounded up.

Slide 18- How do I choose TPB or Execution Configuration?

One legitimate question you might have is how to choose TPB. In fact, the specific execution configuration that will produce the best performance involve both art and science. To choose some multiple of 32 is reasonable since it matches up somehow with the number of CUDA cores in an SM. There are limits though: a single block cannot contain more than 1,024 threads. For large problems, reasonable to test are 128, 256 and 512.

Slide 19 - GPU Occupancy – choose TPB

If you want to try a less experimental approach in determining TPB, you might want to use the CUDA Occupancy Calculator to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. You can download the excel file from the link posted on this slide.

Slide 20 – To Summarize

In summary, we create our first program! Hurrah. We have done following few simple steps: we allocated memory on GPU, defined and launched kernels.