CUDA Essentials I - Workflow and data movement

 

Slides:  Download 2.2.1 CUDA Essentials - CUDA Workflow and Data Movement-3.pdf

Transcription of the video lecture

Slide 2 – Four Key-Points

In this lecture, I introduce CUDA terminology and workflow, and describe how to move data between GPU and CPU memories. I would like to stress, four main points.

  • First, CUDA has its own particular terms, to express traditional concepts.
  • Second, to develop an application to run on a GPU, we will follow the CUDA workflow that consists of three phases: move data to GPU, compute on the GPU and move back the results to the CPU.
  • The third point is that we use cudaMalloc()  to allocate memory on the GPU.  
  • Finally, to move data between GPU and CPU and vice-versa is achieved by copying from a GPU memory location to a CPU memory location with cudaMemcpy().

 

Slide 3 –  CUDA Jargon

When starting with CUDA, it is important to realize that NVIDIA introduced specific terms to express traditional concepts, that have different names in other contexts. Big companies, like NVIDIA and Google, enjoy introducing new terms for old stuff. I am not real sure I am sharing the same enthusiasm these big company have in changing the names of things. However, it is important to know about these terms as they are now widely in use in the literature. First, we have already met the term host that means CPU. Second, GPU is called device. Third, you might have heard the term kernel, that is a function executed on the GPU by a single thread. Fourth, we use the term launch to express the fact that a CPU instructs GPU to execute a kernel. Finally, an execution configuration is the definition of how many threads to run on the GPU, and how to group them.

 

Slide 4 - CUDA Workflow

When developing a new application, or when porting an application from CPU to GPU systems, it is convenient to follow the CUDA workflow. The CUDA workflow is the series of steps we need to do during the application development. The first step is, CPU, to allocate memory on the GPU and move the input data from the CPU to the GPU memory. The second step is, CPU, to instruct to GPU to execute a code in parallel. The third and last step, is to move back the results of the computation on the GPU, from the GPU to the CPU.

 

Slide 5 - Allocate Memory on GPU

As we saw in the previous slide, the first step of the CUDA workflow is to allocate variables on the GPU memory. This can be achieved by using the CUDA function cudaMalloc() that takes two arguments: a pointer to a pointer, and the size in bytes of the linear memory allocated in the GPU memory. I want to emphasize two points about this cudaMalloc(). First, we use a pointer to a pointer to allocate memory on the GPU. That is different from the C malloc() function that returns a pointer to the allocated memory. I am not going into detail why cudaMalloc() uses this pointer to pointer but I am wondering if the NVIDIA computer scientists could have come up with a more elegant design solution. The second point I want to make is that in CUDA when we provide a size of an array, we provide the size in byte of the array. If C is not your mother-tongue, you might find this also little bit unnecessary complication. So how do you in practice the memory allocation? Let’s take an example of allocating an array called d_a of type double with 20 elements. We first declare a pointer d_a, and we use cudaMalloc() with the address of the pointer (remember the pointer to pointer thing) and the size in bytes of the array with 20 elements (that would likely be, 20 times 8 bytes, that is the typical size of a double). When I use cudaMalloc(), I usually forget about the ampersand (&) before the name of the pointer. So be careful with that!

 

Slide 6 - Deallocate Memory on GPU

As typically in C and C++, CUDA doesn’t have garbage collector that takes care of the memory management, so we need to remember to deallocate memory once we don’t need it anymore. We do this with cudaFree() that frees the device memory space that is no longer in use. This is really like free() function in C. So, I don’t have a word a caution about this.

 

Slide 7 - Move Data from CPU to GPU Memory

When we talk about moving data from CPU to GPU, we are actually more talking about copying from CPU to GPU. To copy data from the CPU memory to the GPU memory we use the cudaMemcpy() function. The first argument of cudaMemcpy() is a pointer to the destination memory, the second argument is a pointer to the source memory, then we have the size again in bytes, and finally the value cudaMemcpyHostToDevice.  In the example on this slide, we first allocate a on the CPU, we fill it and finally we copy it to the GPU memory with CudaMemcpy(). Here, remember to allocate d_a first with CudaMalloc()!

 

Slide 8 - How do we move back data from GPU to CPU mem.?

In the last part of the CUDA workflow, we move back the results of the computation on the GPU memory, to the CPU memory. I think that you have already realized that to do that, we can use again cudaMemcpy(), and just change the value of the last argument to cudaMemcpyDeviceToHost.

 

Slide 9 – To Summarize

We reached the end of the first CUDA lecture. I have first introduced some terms of the CUDA jargon that is useful when reading GPU literature. I have then described the CUDA workflow that is series of steps we follow in designing an application for GPU. I focused then on the first and third step of the CUDA workflow that involves data movement between GPU and CPU memory and vice-versa. We allocate memory on the device, using the cudaMalloc() function and we move data by using the cudaMemcpy() function.