CUDA Memories, shared and atomics

Slides:  Download 2.4 CUDA Memories, Shared and Atomics-1.pdf

Code:  Download kernel-1.cu

  Download kernel.h  Download main.cpp

Transcription of the video lecture

Slide 2 – Three Key-Points

In this lecture, I want to focus on three main points.

  • First, GPU has three kinds of memories: register, global, and shared.
  • Second, shared memory is local to the SM, faster than global and it can be used for performance optimization.
  • Third, when using global memory, we might need to use atomic operations to avoid race conditions.

 

Slide 3 –  Two types of GPU Memory so far ...

So far, we have used two types of memories without really talking about them. The first type of memory is Register memory. Register memory is where the local variables for each thread are stored. Register memory is as close to the SM as possible, so it is fastest but its scope is local only to a single thread. That means that threads cannot access register memory of other threads.

The second kind of memory we saw is the Global memory. Global memory is the GPU memory that has been allocated with cudaMalloc(). It provides the bulk of the device memory capacity, but it is far from the GPU, so it slower than register memory. However, this memory is accessible by all threads.

 

Slide 4 - Question: is x a register or global variable?

I have a question for you. I have the code from our first CUDA code. If we go in the distanceKernel(), is x a register or a global variable? And the answer is … register.

 

Slide 5 - Global Memory is very Convenient. However, …

The Global memory is the most convenient memory to use because it is accessible by all the threads. But there are three problems. Two problems are related to the performance of data access to GPU global memory. In fact, we might have problem with memory traffic and synchronization. When a large grid is launched and we have an array that is on global memory, there may be millions of threads trying to read and write values to and from the same array. The performance problem is that access to global memory is relatively slow because far from the GPU. Finally, we might also have problem with correctness. In fact, concurrent update of global memory might result in data corruption (race condition). In this lecture, we are going to check how to solve this problem.

 

Slide 6 - 1. A third Kind of GPU Memory: Shared Memory

GPUs also have shared memory which aims to bridge the gap in memory speed between global and register. Its usage leads to a performance increase in most of the cases. Shared memory resides adjacent to the SM and provides up to 48KB of storage that can be accessed efficiently by all threads in a block.

 

 Slide 7 - How to Declare Shared Variables (Fixed Size)

Shared variables are declared in the kernel function. We declare shared arrays using the __shared__ qualifier. If you create your shared array with a fixed size, the array can be created simply prepending __shared__ to the type of the array and the size of the array in the brackets. This is rather simple but a warning though. You might be tempted so do an allocation of a dynamic shared array, by simply including the size of the array at runtime, in the square brackets. This is not possible, for doing a dynamic allocation you need to follow the instructions in the next slide.

 

Slide 8 – How to Declare Shared Variables (Dynamic All.)

If we allocate the array dynamically, things are little bit more complicated. The declaration requires the keyword extern before shared and also no element size between the square brackets. There is no information about the size of our shared array s_in. Where do we specify the size?  We will the solution of this question in the next slide.

 

Slide 9 –  Launching a Kernel Using a Shared Variable

If we are using a dynamic shared array in the kernel, the kernel launch requires a third argument within the triple arrow brackets. The third argument of the execution parameter specifies the size of the shared memory allocation in bytes.  The example in this slide shows that we calculate first the size in byte of the shared array, and we passed it to the kernel as third argument of the execution configuration.

 

Slide 10 – Reduction Operation

We look now, at a very common type of problem, where we can use shared variables: the reduction operation. In the reduction, elements of an input array are combined to obtain a single output. When do you need reductions? Pretty much, all the time. Typical examples are the calculation of dot products, image similarity measures, integral properties and binning operations.

 

Slide 11 - Parallel Reduction: Parallel dot Product

We now want to calculate the dot product, that is a reduction operation, with CUDA. We can start from serial CPU code to calculate the dot product of two arrays, a and b, of size N. Here, the simple serial implementation.

 

Slide 12 - What would be one Easy Implementation for GPU?

A basic strategy to use CUDA could be the following: first, we move a and b to GPU memory with cudaMalloc and cudaMemcpy. Second, we create a global array on the GPU to hold the results of single element multiplication. Third, we copy the global array to an array on the CPU memory. Finally, sum up serially all the elements of this array on the CPU.

 

 Slide 13 – Possible solution: tiles and shared

However, the previous proposed implementation might suffer performance problems because of the use of the global memory. The global memory traffic can be reduced by using a so-called tile approach: we break the large input vectors up to in ,N/Number of blocks, pieces that is our number of threads per block. Each tile would consist of threads_per_block-sized shared arrays to store the result of the multiplication.

 

Slide 14 - The Tile Approach I

To follow the tile approach, we create a shared memory array at beginning of the kernel to store the product of corresponding entries in the tiles of the input arrays, as in the code on this slide.

 

Slide 15 – Synchronization of Shared Arrays

One thing we learnt in the CUDA essentials is that kernel launches are asynchronous: we can’t assume that all the input data has been loaded in the shared memory array before threads execute the statement using shared arrays. To ensure that all the data has been properly stored, we employ the CUDA function: __synchthreads() just after the assignment of a value to a shared array element. __synchthreads() forces all the threads in the block to complete the previous statements before any thread in the block proceeds.

 

Slide 16 – Parallel dot

We now put everything together to calculate the dot product in parallel on the GPU. For sake of clarity, I have omitted the main and include only the dotLauncher function that receives as input a and b. The dotLauncher function does that data movement part with cudaMalloc and cudaMemcpy and launch the kernel that is using the shared array. The question is why we don’t include a third argument with the size of the shared array in the execution configuration? The reason is that the shared array is allocated statically within the kernel.

The kernel dotKernel is quite interesting, we allocate a shared array s_prod with TPB (Threads per block) elements on each block. Each s_prod element stores the result of d_a multiplied by d_b. Before proceeding to the next operation, we need to ensure that the update is complete using the __syncThreads(). Then we ask only one thread per block (the thread with threadIdx equal to 0) to do the summation of the shared array and to sum it up to the global d_res. What it can a problem? The problem is that different Threads 0 from different blocks might update d_res at the same time.

 

Slide 17 – Code Correctness

In fact, thread 0 in each block reads a value of d_res from global memory, adds its value of blockSum and stores the results back into the memory location where d_res is stored. In this situation, the final outcome of these operations depends on the sequence in which they are performed by each thread.

 

Slide 18 -  Race Condition

This situation, in which the order of operations whose sequencing is uncontrollable, is called a race condition. Race condition results in undefined behavior and most of times results in data corruption.

 

Slide 19 – Cuda Atomic Functions

The solution of this problem is to use CUDA atomic functions. Atom in ancient Greek means uncuttable or indivisible. An atomic function performs read-modify-write sequence of operations as an indivisible unit.

 

Slide 20 – Using atomicAdd() to solve the race condition

cudaAtomicAdd() is a CUDA function that allows us to do an atomic sum to a global variable. In order to avoid race condition, in our code we substitute the simple sum to a global variable with a CUDA atomic sum to a global variable.

 

Slide 21 - Other CUDA Atomic Operations

Together with atomicAdd(), CUDA offers 10 other atomic functions that can be used when performing reduction operations.

 

Slide 21 – To Summarize

In summary, this lecture focused on three main points. First, GPU has three kinds of memories: register, global, and shared that is accessible only by threads on the SM. Second, shared memory is local to the SM, faster than global and it can be used for performance optimization. Third, when using global memory, we need to use atomic operations to avoid race conditions

 

Slide 21 - CUDA Advanced Features

We have reached now the end of our introduction to CUDA. We haven’t had time to cover other CUDA more advanced features, like: CUDA Unified Memory, CUDA Dynamic Parallelism and CUDA Streams.