8.2 Reduction with a parallel for loop¶
We add one feature to the example in the previous section to this example below: after performing the updates to each cell, we are going to add the new computed value to a sum designed to contain all of the values.
As we have seen in other examples of this kind, we need to ensure that this sum is computed correctly when using many threads by making it part of a reduction clause (this clause has the same syntax as we use for OpenMP). Look for this in the function called matrixSum() below, which has pragmas for running it on the GPU.
Just like the previous example, this code takes 3 optional command line arguments, in this order:
- The size of one side of the square matrix being used. 
- Whether to print out the arrays after the manipulation (default of zero is don’t print, non-zero is print). This should be used only with very small values of the size of a side of the matrix, since this book doesn’t return large print buffers and it is hard to read. 
- Whether to check if the results are correct. The particular contrived computation we chose is easy to check. 
Exercises
These exercises are very similar to the previous section’s example.
- The command line arguments above enable you to see what the result of the manipulation of the data elements produces and that the data check is correct, including whether the sum has the right value. 
- After running the default, try matrix sizes that are larger to take advantage of the GPU. Try [‘5000’], [‘10000’], [‘20000’], and [‘40000’] in the command line arguments. Jot down times for each one. 
- How many times more calculations than the previous trial are we doing when we double the size of one side of the matrix like this? (Hint: try with 2x2, then 4x4, then 8x8, then 16x16, dividing the current one by the preceding one.) This can give you some sense of the scalability of this GPU solution by observing the times you see from Exercise 2. 
- You could try creating a multicore CPU version and test it for correctness and timing. 
