Introduction to OpenACC - parallel and loop Constructs
The slides of this lecture are available here Download here.
Lecture Transcription
Slide 2 – Three Key Points
In this lecture, we have three main points. First, the parallel construct is the most important construct for us. It allows us to specify which code should be executed on the GPU. Second, OpenACC uses the fork-join model for parallelism. Third, the loop construct allows to use of work-sharing in for loops.
Slide 3 – parallel construct
The parallel construct is a construct to specify which part of the program is to be executed on the accelerator. When a program encounters a parallel construct, the execution of the code within the structured block of the construct (also called a parallel region) is moved to the accelerator.
Slide 4 – Fork Join Parallelism.
OpenACC is using fork-join parallelism. What is it? If you took a course in high-performance computing or multithreading in Java you probably already familiar with this concept. OpenACC and OpenMP use the same model of parallelism. If you are not familiar with the concept, let me quickly explain it to you. The fork-join model is a way of preparing and executing parallel programs. The execution of different threads forks from a sequential part of the code in parallel at designated points in the program. The execution of different threads is then joined or synchronized at a subsequent point and resume sequential execution. Parallel sections may fork in a nested way until a certain task granularity is reached.
Slide 5 – How Many GPU Threads? Gangs of Workers
We saw in CUDA that we set up the number of threads by deciding the number of blocks and threads per block. In OpenACC we can leave unspecified the number of threads and compiler will pick it up a number of threads for us. However, there is also a way to specify the number of threads by setting up two parameters: the number of gangs and the number of workers. The number of gangs corresponds to the number of blocks while the number of workers corresponds to the number of threads per block. So, how many threads we have in the example of in this example?
A total of 1,024 x 32 = 32,768 workers are created. The a = 23 statement will be executed in parallel and redundantly by 1,024 gang leads.
Slide 6 – What is the problem?
Let’s look at an example of how the parallel construct works. In this case, we open a parallel region to indicate which code we want to run on the GPU. We choose 1,024 threads and leave number of workers unspecified. So, we will fork a certain number of threads and each one will execute the code that is in the parallel region. In this case, we will have each thread executing the loop assigning the value of 23 to a. This is not exactly what we wanted to achieve. What we wanted to achieve is to have thread executing one of the iterations of the loop. We wanted each thread assigning 24 to a, instead of each thread executing the loop. We do that by having work sharing with the loop construct.
Slide 7 – loop Construct for Work-Sharing
To get speedup, we need to distribute the 2,048 iterations among the threads. For doing that, before the loop, we add a pragma acc loop to enable work-sharing. This is equivalent to ta concise formulation where we merge parallel and loop construct in one line. So the merged parallel loop means that we open a parallel region and then we do work sharing with loop.
Slide 8 – Scheduling Work-Sharing
There are three levels of parallelism schedule that can be express in OpenACC: gang, worker, vector. We can place them after the loop construct to guide the compiler. However, this in practice, won’t make much difference and we are best to let the compiler decide
Slide 9 – To summarize
We reached the end of this lecture. We look at three main points. First, the parallel construct is the construct that allows us to offload a computation in OpenACC. Second, we saw that OpenACC uses the fork-join parallelism that forks and joins threads on GPU. Third, the loop construct before a for loop allows using work-sharing in for loops