Introduction to OpenACC - Our First OpenACC Program

The lecture slides are available Download here

.

Lecture Transcripts

Slide 2 – Two Key Points

In this lecture, we have two main points. The first point is about what is OpenACC. OpenACC is a compiler technology and consists of a set of compiler directives, library routines and environment variables for programming GPUs. The second point is that we are going to implement our first program in OpenACC by simply annotating a serial C code with a series of pragmas or compiler directives.

Slide 3 – What it is OpenACC?

OpenACC is a compiler technology. That means that we need a compiler that supports OpenACC. We are going to annotate our C code with so-called pragmas or compiler directives. The compiler will catch the compiler directives and translate it to some code for GPU. The OpenACC framework consists of a number of compiler directives, library functions and environment variables. OpenACC support codes written in C and Fortran and supports also C++ but to use OpenACC in C++ codes is a little bit more cumbersome. The specs defining the OpenACC pragmas were originally developed by the Portland Group Inc. that is the PGI company. The PGI company has been bought by Nvidia in 2013. For this reason, the PGI implementation of the OpenACC standard is done in CUDA.

Slide 4 – History of OpenACC

Approximately around 2010, compiler vendors started to look at directives for GPUs to simplify the way to program GPUs using compiler. At its inception, OpenACC intended to be for all different kinds of accelerators. However, it ended up supporting only Nvidia GPUs. OpenMP that is an approach similar to OpenACC has the goal of supporting all different kinds of accelerators. The OpenACC standard was released in November 2011 at the Supercomputing conference. The standard is mainly supported by Cray and PGI. In the future, OpenACC will be replaced by OpenMP. OpenMP 4 standard has support for accelerators. However, currently few compilers implement support OpenMP 4 for accelerators. The good news for us is that if we knew OpenACC, it will be very easy for us to use OpenMP.

Slide 5 – OpenACC code

One big difference between OpenACC and CUDA C is the use of compiler directives in OpenACC. To understand what a compiler directive is and the advantages of using compiler directives, let’s take a look at our first OpenACC program in this slide. The code does the matrix multiplication.

 The code in the slide is almost identical to the sequential version. Except for the two lines with #pragma. In C, the #pragma directive is the method to provide, to the compiler, information that is not specified in the standard language. OpenACC uses the compiler directive mechanism to extend the base C language.

Slide 6 – Loop executed on GPU - #pragma acc

In this example, the #pragma tells the compiler to generate code for the i loop block so that the loop iterations are executed in parallel on the accelerator.

Slide 7 – Data Movement with OpenACC – copyin and copyout

The copyin clause and the copyout clause specify how the matrix data should be transferred between the host and the accelerator.  In particular, we are telling the compiler to memcpy the arrays M and N and memcpy back P to the host at the end of the parallel region.

Slide 8 – Work-Sharing

This #pragma instructs the compiler to map the inner j loop to the second level of parallelism on the accelerator. Remember that we are already in a parallel region on GPU because we are in the I loop block. So, this is nested parallelism.

Slide 9 – Compiling OpenACC

Not all the compilers support OpenACC. The PGI compiler supports OpenACC and we are going to use it for our assignment. If you are using PGI compiler on your laptop you need to call the C PGI compiler called pgcc. On the supercomputer, remember to swap the compilation environment from the default one to the PGI compiler framework. Then you will call the cc wrapper to call the PGI compiler. The most important thing is ”don’t forget the minus acc” flag. If you don’t provide the flag, the pragmas will be ignored, and we will only run the code as serial code.

Slide 10 – OpenACC Pros/Cons

So, what are the advantages and disadvantages of OpenACC? Well, the first advantage is that we don’t need to completely rewrite the code if we have an existing code that we want to port to GPU. Some existing scientific applications are large and their developers don’t want to rewrite them for accelerators. OpenACC lets these developers keep their applications looking like normal C or FORTRAN code, and they can go in and put the directives in the code where they are needed one place a time. Second, by using the compiler directive approach, OpenACC allows a programmer to write OpenACC programs in such a way that when the directives are ignored, the program can still run sequentially and gives the same result as when the program is run in parallel.  A disadvantage of OpenACC might be the performance when compared to the performance of pure CUDA code. Therefore, the performance of an OpenACC program depends more on the capability of the OpenACC compiler used.  

Slide 11 – To Summarize

In summary, in this lecture, we look first at what OpenACC is. We then added pragmas or compiler directives to offload loop of computation to GPU. In the next lecture, we are going to look at more in detail how to use the parallel and loop constructs to offload computation to GPU using OpenACC.