Optimizing Host-Device Data Communication III -Code Examples
Slides: 3.1.3 Code Examples .pdf Download 3.1.3 Code Examples .pdf
The codes discussed in this video lecture are from Nvidia blog repository, available at https://github.com/NVIDIA-developer-blog/code-samples/blob/master/series/cuda-cpp/overlap-data-transfers/async.cu Links to an external site.
Transcription of the Video Lecture
Slide 2 – One Key Point
In this lecture, we are going to look at a practical example of how to use CUDA streams and asynchronous memory copies to a stream.
Slide 3 – Problem
This is the problem statement of this lecture. We want to overlap communication and computation in the following code. As we discussed in the previous lecture, this code results in sequential memory copies and computation on GPU with no communication and computation overlap.
Slide 4 - Solution – CUDA Streams
We address this problem by using CUDA streams and asynchronous memory copies. The basic strategy is the following. We break up first the array of size N into chunks of streamSize elements. The number of non-default streams used is therefore nStreams=N/streamSize. So, each stream is taking care only of a smaller part of the array. There are two main approaches to batch operations with CUDA streams. We are going to look at these two approaches in the next two slides.
Slide 5 - 1st Approach
One approach is the following. We break out our a and d_a array in chunks of streamSize elements. We then have a loop iterating over the number of streams that we are using. In each iteration, a stream performs asynchronous memory copy to d_a, execute the kernel on GPU and asynchronously move back the result to a.
Slide 7 – 2nd Approach
The second approach is slightly different. We batch similar operations together. We issue all the host-to-device asynchronous transfers first. These are followed by all kernel launches on different streams. Finally, we loop over all the streams and perform all device-to-host transfers.
Slide 7 - Which Version Performs Better?
The question is which approach performs better? And, as usual, the answer is it depends. The two approaches might perform differently if you are using some of the older GPUs. The performance of streams on older GPUs might be different because of the difference in hardware. In fact, CUDA devices contain engines for various tasks. Dependencies between tasks in different engines are maintained. Within any engine, all external dependencies are lost. Tasks in each engine’s queue are executed in the order they are issued.
Slide 8 – GPU with 1 Copy Engine and 1 Kernel Engine, e.g. Tesla C1060
Let’s look at an example of a GPU with only one copy engine like the Tesla C1060. For the first asynchronous approach of our code the order of execution in the copy engine is: H2D stream(1), D2H stream(1), H2D stream(2), D2H stream(2), and so forth. This is why we do not see any speed-up when using the first asynchronous approach on this architecture. Tasks are issued to the copy engine in an order that precludes any overlap of kernel execution and data transfer. For approach two, however, where all the host-to-device transfers are issued before any of the device-to-host transfers, the overlap is possible as indicated by the lower execution time.
Slide 9- GPU 2 Copy Engines and 1 Kernel Engine,e.g. Tesla C2050
Things are different when the GPU has two copy engines like the Tesla C2050. To have two copy engines explains why asynchronous approach 1 achieves now good speed-up: the device-to-host transfer of data in stream[i] does not block the host-to-device transfer of data in stream[i+1] as it did on the previous GPU. This is because there is a separate engine for each copy direction.
But what about the performance degradation in asynchronous version 2? This is related to the C2050’s ability to concurrently run multiple kernels. When multiple kernels are issued back-to-back in different (non-default) streams, the scheduler tries to enable concurrent execution of these kernels and as a result, delays a signal that normally occurs after each kernel completion until all kernels complete. So, while there is overlap between host-to-device transfers and kernel execution in the second approach of our asynchronous code, there is no overlap between kernel execution and device-to-host transfers.
Slide 10 – Hyper-Q
The difference in the performance of the two codes might be scary. The good news is that for devices with compute capability 3.5 (the K20 series), the Hyper-Q feature eliminates the need to tailor the launch order, so either approach will work with the same performance. So, if you are using a modern GPU both approaches will be performant while you might check the performance of different approaches and number of copy and kernel engines if you are using older GPUs.
Slide 13 – To Summarize
In summary, in this lecture we looked at an example of how to use streams and asynchronous memory copies in practice to overlap communication and computation in GPU applications