7.4 GPU device code using OpenACC pragmas with the pgcc compiler¶
In this section we will finally see how we use OpenACC pragmas and pgcc compiler options to indicate that we want to execute the addition of the two arrays on the GPU device.
Same Command line and helper functions as before¶
Let’s examine the GPU OpenACC code below by starting with the compiler arguments shown at the bottom of the code. Here’s what each one means:
-fast : Use the highest level of code optimization.
-acc=gpu : Compile code for the GPU when encountering ‘#pragma acc’ lines.
-gpu=managed : Use managed memory between the host and the device. For more details, see the CUDA chapters in our PDC for Beginners book, at learnpdc.org. For most cases it is best to use this setting.
-Minfo=accel : Display information about how the code was compiled for the GPU. This helps us see whether the code generated is parallelized and how.
Next, let’s look at the OpenACC pragmas used on lines 10 and 11 in the code below.
The kernels directive on line 10 indicates that the next block of code should be executed on the GPU (if you are familiar with CUDA kernel functions, this name will make some sense).
The loop directive is indicating that this is a for loop that should be decomposed onto as many GPU threads as possible. The compiler will set up underlying CUDA blocks of threads for this task.
The independent directive is needed because when using the kernels directive, the compiler is very conservative and won’t parallelize the loop unless you indicate that the computations inside the loop are independent (that is, they contain no data dependencies and therefore can be executed independently in any order).
Note
With a GPU device containing thousands of cores, the programming model is different: we consider that there are enough threads to work on each data element of the array independently. This isn’t strictly true when our arrays are extremely large, but the GPU system manages which threads in different thread blocks will map to the updates of the elements of the arrays in the loop.
In the main program, code is executing on the host CPU until the function GPUadd() is called, then execution begins on the GPU with array memory being managed between the device and the host. When GPUadd() completes and device memory is copied back to the host, execution begins again on the host CPU.
Exercises
Run as is to see that the output still looks the same with small arrays.
- Note the compiler output after the ===== STANDARD ERROR ===== line in the output. Be careful to notice that the compiler is indicating two important concepts:
That the data in the array called x is being copied in from the host and the data in the array called y is being copied in and back out, as indicated by the keyword ‘copy’.
That the compiler is parallelizing the loop for the GPU and in this case is setting up gangs (equivalent to CUDA blocks) of 128 threads.
Remove ‘-n’, ‘10’ from the square brackets in the command arguments and run again with the default size.
Explore the need for the loop independent: try using [‘-n’, ‘8192’] for the command line arguments and eliminating the word ‘independent’ from the second pragma in the GPUadd function. carefully observe the compiler output. When you see output like this, you should be aware that the compiler is choosing not to run the loop in parallel. This is the result of the compiler being conservative- as the developer you need to tell it that the calculations are independent or it often will not choose to set up the parallelism.
Important point¶
As with other examples in this chapter, we are using openMP functions to time our code, and specifically how long it took to copy data to the device, compute the addition of each of the elements, and copy the result back. The main point you should see here: this version with GPUadd function runs slower than the previous CPU versions. We need more work to make running functions on the GPU device worthwhile. The idea here is that there is a cost for the data movement between the host and the GPU device. The amount of time for computations must be high enough so that the data movement time is an insignificant portion of the overall time. We will see examples where this is the case in the next couple of chapters.