Optimizing Host-Device Data Communication I - Pinned Host Memory

Slides: Download 3.1.1 Pinned Host Memory.pdf

Transcription of the video lecture

Slide 2 – Four Key-Points

In this lecture, I would like to emphasize four main points. First, the communication between the host and device are the slowest link of data movement involved in GPU computing, so it is important we optimize transfers. Second, we can use pinned host memory to avoid intermediate transfers. Third, to use pinned host memory we use cudaHostAlloc instead of the usual malloc or new. Finally, one further optimization technique to optimize data transfer when using pinned memory is that we can batch all the small transfers in one large data transfer.

 

Slide 3 – CPU-GPU Data Communication with DMA

Modern computer systems use a specialized hardware mechanism called direct memory access - or DMA in short- to transfer data between an I/O device and the system DRAM. When a program requests an I/O operation, say reading from a disk drive, the OS makes an arrangement by setting a DMA operation defined by the starting address of the data in the I/ O device buffer memory, the starting address of the DRAM memory, the number of bytes to be copied, and the direction of the copy.  

Using a specialized hardware mechanism to copy data between I/O devices and system DRAM has two major advantages.

First, the CPU is not burdened with the chore of copying data. So, while the DMA hardware is copying data, the CPU can execute programs that do not depend on the I/O data.  

The second advantage of using a specialized hardware mechanism to copy data is that the hardware is designed to perform a copy. The hardware is very simple and efficient. There is no overhead of fetching and decoding instructions while performing the copy. As a result, the copy can be done at a higher speed than most processors can. As we will see in this lecture, DMA is used in data copy operations between a CPU and a GPU. It requires pinned memory in DRAM and has subtle implications on how applications should allocate memory.

 

Slide 4 - Virtual Memory Management

In order to understand the concept pinned memory, I need to present a little more background on memory management in operating systems. The operating system manages a virtual memory space for applications. Each application has access to a large, consecutive address space. In reality, the system has a limited amount of physical memory that needs to be shared among all running applications. This sharing is performed by partitioning the virtual memory space into pages and mapping only the actively used pages into physical memory. When there is much demand for memory, the operating system needs to “page out” some of the pages from the physical memory to mass storage such as disks. Therefore, an application may have its data paged out any time during its execution.

 

 

Slide 5 - Data Transfer and Virtual Memory

The implementation of cudaMemcpy() uses the DMA device. When a cudaMemcpy() function is called to copy between the host and device memories, its implementation uses a DMA device to complete the task. On the host memory side, the DMA hardware operates on physical addresses. That is, the operating system needs to give a translated physical address to the DMA device. However, there is a chance that the data may be paged out before the DMA operation is complete. The physical memory locations for the data may be reassigned to another virtual memory data. In this case, the DMA operation can be potentially corrupted since its data can be overwritten by the paging activity.

 

Slide 6 - Pinned Memory and DMA Data Transfer

Pinned memory is virtual memory pages that are specially marked so that they cannot be paged out. They are allocated with special system API function calls. The important point for us is that CPU memory that serves as the source of destination of a DMA transfer must be allocated as pinned memory.

 

Slide 7 - CUDA Data Transfer uses Pinned Memory

If a source or destination of a cudaMemcpy() in the host memory is not allocated in pinned memory, it needs to be first copied to a pinned memory. This causes an extra overhead. When we allocate and use pinned memory, we can avoid this extra step and extra overhead. Therefore, cudaMemcpy()is faster if the host memory source or destination is allocated in pinned memory since no extra copy is needed.

 

Slide 8 - Allocate/Free Pinned Memory

Pinned memory is allocated with a special cudaHostAlloc() function. This function ensures that the allocated memory is pinned or page-locked from paging activities. The cudaHostAlloc() function takes three parameters. The first two are the same as cudaMalloc(). The third specifies some options for more advanced users. For most basic use cases, we can simply use the default value cudaHostAllocDefault. As usual, we also need to remember to free the pinned memory at the end of our CUDA application with CudaFreeHost() function.

 

Slide 9 - Allocate/Free Pinned Memory

This is an example of allocating three arrays on the host side with pinned memory. We use the cudaHostAlloc() function instead of the usual malloc or new.  The three array names have an h_ instead of the usual d_ because these arrays are allocated on the host memory.

 

Slide 10 - Using Pinned Memory in CUDA

The host memory allocation is done with the cudaHostAlloc() function rather than the standard malloc() function. The difference is that the cudaHostAlloc() function allocates a pinned memory buffer, some- times also referred to as page-locked memory buffer. One important point is that host pinned memory is a limited resource. Therefore, a pinned memory allocation might fail and we should always check for errors during the allocation of pinned memory.

 

Slide 11 - Performance Advantages

So how much performance improvement can we get from host pinned memory? The answer is that it depends on how much data we are moving. In fact, pinned memory is much more expensive to allocate and deallocate but provides higher transfer throughput for large memory transfers. Batching many smaller transfers into one larger transfer improves the performance of pinned memory.

 

Slide 12 - Batching Small Size Transfers

Due to the overhead associated with each transfer and allocation of pinned memory, it is preferable to batch many small transfers together into a single transfer. This is easy to do by using a temporary array and packing it with the data to be transferred. In the example on this slide, we can pack three arrays a, b and c in one temporary array pack_abc and pin this temporary array using cudaHostAlloc.

 

Slide 9 – To Summarize

We reached the end of the first lecture on how to optimize host device data transfer. In this lecture I have stressed four points.  The first point is that it is important that we optimize transfers. Second, we can use pinned host memory to avoid intermediate for using DMA. Third, we use cudaHostAlloc instead of the usual malloc or new to pin host memory. Finally, one further optimization technique is that we can batch all the small transfers in one large data transfer.