CUDA Unified Memory

Slides: 3.2.1 CUDA Unified Memory.pdf Download 3.2.1 CUDA Unified Memory.pdf

Transcription of the Video Lecture

Slide 2 – Two Key Points

In this lecture, we have two main points. The first point is that CUDA Unified is single memory space for host and device memories. Second, by using CUDA Unified Memory, we can eliminate to explicitly move data from CPU to GPU and vice-versa since the CUDA runtime automatically takes of data migration for us.

Slide 3 – What is Unified Memory?

With CUDA 6.0, a new feature called Unified Memory was introduced to simplify memory management in CUDA. Unified Memory creates a pool of managed memory, where each allocation from this memory pool is accessible on both the CPU and GPU with the same memory address. The underlying system automatically migrates data in the unified memory space between the host and device. This data movement is transparent to the application, simplifying the application code.

Slide 4 - Allocating Unified Memory

Managed memory can be allocated statically or dynamically. We can statically declare a device variable as a managed variable by adding a __managed__ annotation to its declaration. We can also allocate managed memory dynamically using the following CUDA runtime function cudaMallocManaged(). This function allocates size bytes of managed memory and returns a pointer. The pointer is valid on all devices and the host. A program that uses managed memory can take advantage of automatic data migration and duplicate pointer elimination.

Slide 5 - Code Example of Cuda Unified Memory

This is a code example taken from the Nvidia blog on Cuda unified memory. There are a few things that I would like to highlight. First, x and y are accessible from both CPU and GPU and we don’t have the two GPU copies of x and y. We called them d_x and d_y. Second, CudaMemcpy() disappeared as the CUDA driver takes care of the data movement automatically. One last important thing is that Kernel launch is asynchronous with respect to the host. So, we need to explicitly synchronize on the host side before directly accessing the output of the kernel.

Slide 7 – How Does it Work?

The way the CUDA Unified works and performs strongly depends on whether the GPU we are using is a pre- or post-Pascal GPU. In particular post-Pascal GPU support on-demand paging that allows achieving much higher performance in automatic data movement.

In particular, on pre-Pascal GPUs, cudaMallocManaged() allocates managed memory on the GPU. Internally, the driver also sets up page table entries for all pages covered by the allocation. Upon launching a kernel, the CUDA runtime must migrate all pages previously migrated to host memory back to the GPU memory.

Differently, on post-Pascal GPUs, managed memory may not be physically allocated when cudaMallocManaged() returns. It may only be populated on access. Pages and page table entries may not be created until they are accessed by the GPU or the CPU

Slide 8 – To Summarize

In this lecture, we looked together at CUDA unified memory. There are two main points we need to remember. First, what is CUDA unified memory? CUDA Unified is a single memory space for host and device memories. The second point is that by using CUDA Unified Memory, we can eliminate to explicitly move data from CPU to GPU and vice-versa since the CUDA runtime automatically will do the job for us.