8.1 The collapse() pragma directive on parallel for loop¶
The previous chapter ended with our first example of OpenACC pragmas for the parallel for loop pattern run on the GPU. We continue with a few more code examples that use additional pragmas that also have counterparts in OpenMP shared memory computing.
Note
Central facet of these examples: ‘flattened’ 2-dimensional matrices
The important aspect of these examples is that they illustrate the declaration and manipulation of elements in 2D matrices. A key aspect of running code on GPUs is that the architecture works best if your code is accessing n-dimensional data as flattened 1-dimensional arrays that are allocated in ‘heap memory’ by the malloc function in C (or ‘new’ in C++). This differs from declaring arrays in a function’s stack memory, which is not suitable for moving or sharing between the host and the GPU device. In the cases we show here, we are using one or more 2-D matrices to illustrate this point.
Example: Update cells of each matrix¶
The code below has these key areas to focus on:
In main(), a square 2D matrix called A, whose number of rows equals its number of columns is allocated in heap memory by calling the malloc function, and the matrix is ‘flattened’ into a 1D array.
The function called MatrixUpdate() manipulates the values in each element of A by doing so in parallel on the GPU. Following the CUDA manycore computing model, we think of single threads on the GPU working on each data element of the array separately.
The new pragma directive used in matrixUpdate() is collapse(2). This indicates that the two for loops following that pragma should be considered as one. This is appropriate in particular for this flattened array. This sets up the notion that each thread will work on each element of the array independently, which is appropriate for this example.
Look for these parts of the code below. After the code, we discuss the command line arguments and how you can explore this example.
This code takes 3 optional command line arguments, in this order:
The size of one side of the square matrix being used.
Whether to print out the arrays after the manipulation (default of zero is don’t print, non-zero is print). This should be used only with very small values of the size of a side of the matrix, since this book doesn’t return large print buffers and it is hard to read.
Whether to check if the results are correct. The particular contrived computation we chose is easy to check.
Exercises
The command line arguments above enable you to see what the result of the manipulation of the data elements produces and that the data check is correct.
After running the default, try matrix sizes that are larger to take advantage of the GPU. Try [‘5000’], [‘10000’], [‘20000’], and [‘40000’] in the command line arguments. Jot down times for each one.
How many times more calculations than the previous trial are we doing when we double the size of one side of the matrix like this? (Hint: try with 2x2, then 4x4, then 8x8, then 16x16, dividing the current one by the preceding one.) This can give you some sense of the scalability of this GPU solution by observing the times you see from Exercise 2.
Try commenting out this line in the MatrixUpdate() function and observe that the compiler can still determine how to generate underlying code for ensuring that the data is moved correctly.
#pragma acc data copy(A) // not strictly necessary
Recall from the previous chapter that we can change the compiler flags to create a multicore version of this code. Replace these compiler arguments: ‘-acc=gpu’, ‘-gpu=managed’, with ‘-acc=multicore’. Then try taking some measurements with this version of the code.
If you wish to experiment further on your own hardware, you could get the code we have provided and try building a multicore version and compare it to this GPU version for different problem sizes.
Note
Here’s a spoiler alert for exercises 5 and 6: the GPU version is faster for larger problem sizes, likely because we now are doing enough computation on each data element to make the data management and movement from host to device worth the cost. Your results will certainly vary, but we have found that with high quality GPU cards this type of computation is worth it at even smaller problem sizes.
Another aspect of this example to realize is that we could also compile a sequential version for the host where the pgcc compiler will ignore the pragmas and create code for the host that computes each new matrix value one at a time (look back in the previous chapter for how to so this). On this book, running the code this way will time out at sizes over 10000 (demonstrating the advantages of scaling with parallelism!).
Some ‘best practices’ in this code¶
As we introduce more examples, we are also introducing some new coding practices along the way, and keep using some that we used earlier in this book and introduced in the PDC for Beginners book. What follows is a few that we included in the code above.
restrict keyword¶
The OpenACC organization published a Guide for OpenACC programming online, dated 2015. Some of the information pertains to previous versions of OpenACC and older GPU cards, but some information in there is worthwhile. For example, in chapter 3, p. 17, they state:
Best Practice: C programmers should use the restrict keyword (or the __restrict decorator in C++) whenever possible to inform the compiler that the pointers are not aliased, which will frequently give the compiler enough information to then parallelize loops that it would not have otherwise. In addition to the restrict keyword, declaring constant variables using the const keyword may allow the compiler to use a read-only memory for that variable if such a memory exists on the accelerator. Use of const and restrict is a good programming practice in general, as it gives the compiler additional information that can be used when optimizing the code.
Note that we used the restrict keyword in the MatrixUpdate function definition like this:
void MatrixUpdate(int size, float * __restrict__ A) {
Declaring sizes for malloc¶
Note in the code that we use a variable of type size_t as the input to malloc, like this:
size_t num_elements = size * size * sizeof(float);
A = (float *)malloc(num_elements);
The reason for doing this is that size_t is an unsigned integer that is specifically designed to hold the size of objects in memory on a particular machine in bytes. The width of size_t is the width of a word in memory on the machine you are compiling on. Using this data type makes our code portable (between 32-bit and 64-bit machines for example) and more secure (can’t point outside memory area) and less prone to error (since it must contain a positive number).
Check results for correctness¶
Also note that as we have in other examples in this book, we have some mechanism for determining whether the result is what we expect. This is always a good practice. We have also used command line arguments to choose whether to do this test or not. We were able to do this fairly easily because of the data manipulation we are doing on initial values of Pi; for other problems this can be a bit more difficult, but you should try to do it. One possible way in a case like this is to compare it to the sequential version, which you know is correct.
Setting default values and overriding with command line arguments¶
Notice at the beginning of main() we initialize the variables called size, verbose, and check to have a default value that will enable the code to run properly. Then we read in command line arguments that can override those original default values. If no arguments are supplied the program will complete. This is a good practice. There are additional practices we could use that are often done with larger applications, such as checking to make sure that the size of the matrix is not too large, or that verbose printing wasn’t chosen when the matrix is large (we introduced a similar technique in the previous chapter).