8.3 A classic linear algebra example: matrix multiply¶
Let’s now introduce an operation that takes two matrices as input and creates a third as output: matrix multiply.
If you are unfamiliar with how matrix multiplication code works, you should read over our explanation of in Chapter 6, section 2 of our PDC for Beginners book, where we describe the problem and introduce solutions, including a sequential, OpenMP, and CUDA version.
Here we essentially are able to start from the sequential version and add OpenACC pragmas that enable the pgcc/nvc compiler to generate a GPU version that is similar in performance to the CUDA version we introduced in PDC for Beginners. As we saw there, this example works extremely well on GPUs and enables us to work on much larger matrices without waiting very long for the operations to complete.
The key features to note in this code are:
The matrices are initialized differently than the other examples in the prior sections: each cell value is initialized to a float equivalent to the value of that row. Look for that in the output when the matrices are printed (only one initial matrix is printed).
The work to be done on the GPU is found in the function called MatrixMult().
In the MatrixMult() function, we want to follow the same concept about the programming model for manycore machines: each thread will update one element of the resulting matrix. In this case it makes sense then to parallelize over the two outer loops in this function and let each thread compute the third innermost loop itself. This is essentially the dot product calculation and should be completed by one thread. Therefore, we have added a new clause to the loop pragma: seq. This indicates that this innermost loop can be run sequentially on each thread created by the two outermost loop directives.
As in prior examples in this chapter, the command line arguments are the same, in this order:
The size of one side of the square matrix being used.
Whether to print out the arrays after the manipulation (default of zero is don’t print, non-zero is print). This should be used only with very small values of the size of a side of the matrix, since this book doesn’t return large print buffers and it is hard to read.
Whether to check if the results are correct. The particular contrived computation we chose is easy to check.
Exercises
You can try running the following problem sizes for the side of each square matrix. What do you observe about the changes in the running times? You can refer to the detailed explanation in Chapter 6, section 2 of our PDC for Beginners book to get a better sense of the big-Oh order of this algorithm and why the times scale the way that they do as you double the size of one side of each matrix.
['1024']
['2048']
['4096']
['8192']
Visit Chapter 6, section 2 of our PDC for Beginners book and scroll to the complete code and the section labeled ‘Experimenting with the programs’. In there are a sequential and an OpenMP version of code for this problem. Try collecting times for the sequential version in the first tab and the OpenMP version with 8 cores in the second tab with the OpenACC GPU version here. How many times faster is the OpenACC version than either of those two for 1024x1024? Note that this is the way that we compare GPU device versions that use many cores that are slower than a CPU core to versions run on CPUs.
Final thought: a way of work¶
In this chapter we have used examples that stay true to a general ‘way of work’ for creating parallel versions of code.
First, have a method for verifying the correctness of your solution.
Next, run experiments to determine how well your program scales. From these examples, you can see that knowing how the algorithm works and its sequential performance in terms of big-Oh often helps you to see and explain the improvements on the manycore version.
For manycore GPU versions of algorithms, we usually examine their performance by considering how many times faster they can run on the same problem size than a sequential or multicore version.
As we look at other examples, we will also see that determining where to focus on parallelizing a larger program will be another step in our working process of parallelizing its code.