Optimizing Host-Device Data Communication II -CUDA Streams & Asynchronous Copy
Slides: 3.1.2 CUDA Streams & Asynchronous Copy.pdf Download 3.1.2 CUDA Streams & Asynchronous Copy.pdf
Transcription of the Video Lecture
Slide 2 – Three Key-Points
In this lecture, I would like to emphasize three main points. The first point is that CUDA supports parallel execution of kernel and memory copies by using Streams. CUDA streams are nothing else than an ordered sequence of operations. The second point is that we can create and destroy CUDA streams with cudaStreamCreate()/cudaStreamDestroy(). In addition, we use cudaMemcpyAsync() to copy asynchronously on a stream. The third point is that we use the stream identifier as the fourth parameter of the execution configuration when launching a kernel on a stream.
Slide 3 – Serialized Data Communication and Computation
Until now, the way we do data communication and computation is the way presented in this slide. We allocate our arrays, we copy them to GPU memory, we compute, and we copy the result back to the CPU memory. From the perspective of the host, the data transfers are blocking or synchronous, while the kernel launch is asynchronous. Since the host-to-device data transfer on the first two lines is synchronous, the CPU thread will not reach the kernel call on the third line until the host-to-device transfer is complete. Once the kernel is issued, the CPU thread moves to the fourth line, but the transfer on that line cannot begin due to the device-side order of execution. So, communication and computation are performed sequentially. This inefficient as we can’t use all the hardware available. During the communication, we don’t use the GPU while during the computation on GPU we don’t use the bus for communication. In this case, communication and computation are performed sequentially.
Slide 4 - Can we do better? Pipeline Communication & Compute
So, the question is can we do better and overlap communication and computation? And, of course, the answer is yes. The way to do that is breaking out the arrays in two or more pieces and pipeline the memory copies and computation relative to one array piece on different pipeline lanes. As in the previous slides we want to perform c = a + b where a, b and c are arrays. Differently from the previous slide, we break out the three arrays in three parts and assign the corresponding three pieces to a different pipeline lane. In this case, we will have three lanes. Each lane is performing three operations: copying in the specific part of the array, compute the sum and copying the result. Because now different communication and computations is performed concurrently, the utilization of the hardware is much more efficient.
Slide 5 - How do we do that? Cuda Streams
How do we do that with CUDA? Basically, each pipeline lane of the previous slide is associated with a CUDA stream. Each stream is a queue of operations that kernel launches and memory copy operations. Operations on different CUDA streams can execute in parallel.
Slide 7 – The default stream
All device operations (kernels and data transfers) in CUDA run in a stream. When no stream is specified, the default stream is used. So without knowing it, we were using streams already. The default stream is different from other streams because it is a synchronizing stream with respect to operations on the device: no operation in the default stream will begin until all previously issued operations in any stream on the device have completed, and an operation in the default stream must complete before any other operation (in any stream on the device) will begin.
Slide 7 - Non-Default CUDA Streams
When we want to overlap communication and computation by creating additional non-default streams and associate to computation and memory copies. The code in this slide declare one variable that is of CUDA built-in type cudaStream_t. Recall that the CUDA built-in types are declared in cuda.h. These variables are then used in calling the cudaStreamCreate() function. Each call to the cudaStreamCreate() creates a new stream and deposits a pointer to the stream into its parameter. We use CudaStreamDestroy to destroy the stream once it is not needed anymore.
Slide 8 - Data Transfer to a Non-Default Stream
To issue a data transfer to a non-default stream we use the cudaMemcpyAsync() function, which is similar to the cudaMemcpy() function discussed in the previous lectures. The main difference is that it takes a stream identifier as a fifth argument. cudaMemcpyAsync() is non-blocking on the host, so control returns to the host thread immediately after the transfer is issued.
Slide 9- Kernel on a Non-Default Stream
To issue a kernel to a non-default stream we specify the stream identifier as a fourth execution configuration parameter. The third execution configuration parameter allocates shared device memory, and in this case, is 0.
Slide 10 - Overlapping Kernel Execution and Data Transfers
Our main goal is to overlap kernel execution with data transfers. There are two requirements for this to happen. First, the kernel execution and the data transfer to be overlapped must both occur in different, non-default streams. Second, the host memory involved in the data transfer must be pinned memory. We saw in the previous lecture how to allocate host pinned memory.
Slide 11 - Synchronization of Streams
Since all operations in non-default streams are non-blocking with respect to the host code, we will have situations where we need to synchronize the host code with operations in a stream. There are several ways to do this. The “heavy hammer” way is to use cudaDeviceSynchronize(). This blocks the host code until all previously issued operations on the device have completed. In most cases this is overkill. It can really hurt performance due to stalling the entire device and host thread.
The CUDA stream API has multiple less severe methods of synchronizing the host with a stream. The most used one is to use the function cudaStreamSynchronize(stream). It can be used to block the host thread until all previously issued operations in the specified stream have completed. cudaStreamSynchronize is used in the host code takes one parameter that is the stream identifier.
Slide 12 - CUDA7 Streams – Per-Thread Default Stream
Before CUDA 7, released in 2015, each device has a single default stream used for all host threads, which causes implicit synchronization. Two commands from different streams cannot run concurrently if the host thread issues any CUDA command to the default stream between them.
CUDA 7 introduces a new option, the per-thread default stream, that has two effects. First, it gives each host thread its own default stream. This means that commands issued to the default stream by different host threads can run concurrently. Second, these default streams are regular streams. This means that commands in the default stream may run concurrently with commands in non-default streams. So, remember to use the flag - - default-stream per-thread when running our CUDA code with streams.
Slide 13 – To Summarize
In summary, in this lecture we looked at three points. We looked first at CUDA streams that support parallel execution of kernel and memory copy operations. We can create and destroy CUDA streams with cudaStreamCreate() andcudaStreamDestroy(). We use cudaMemcpyAsync() to copy asynchronously on a stream. Finally, we use the stream identifier as the fourth parameter of the execution configuration when launching a kernel on a stream.