8.3 A classic linear algebra example: matrix multiply¶

Let’s now introduce an operation that takes two matrices as input and creates a third as output: matrix multiply.

If you are unfamiliar with how matrix multiplication code works, you should read over our explanation of in Chapter 6, section 2 of our PDC for Beginners book, where we describe the problem and introduce solutions, including a sequential, OpenMP, and CUDA version.

Here we essentially are able to start from the sequential version and add OpenACC pragmas that enable the pgcc/nvc compiler to generate a GPU version that is similar in performance to the CUDA version we introduced in PDC for Beginners. As we saw there, this example works extremely well on GPUs and enables us to work on much larger matrices without waiting very long for the operations to complete.

The key features to note in this code are:

The matrices are initialized differently than the other examples in the prior sections: each cell value is initialized to a float equivalent to the value of that row. Look for that in the output when the matrices are printed (only one initial matrix is printed).
The work to be done on the GPU is found in the function called MatrixMult().
In the MatrixMult() function, we want to follow the same concept about the programming model for manycore machines: each thread will update one element of the resulting matrix. In this case it makes sense then to parallelize over the two outer loops in this function and let each thread compute the third innermost loop itself. This is essentially the dot product calculation and should be completed by one thread. Therefore, we have added a new clause to the loop pragma: seq. This indicates that this innermost loop can be run sequentially on each thread created by the two outermost loop directives.

/*
* OpenACC GPU version of matrix multiplication.
*/
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h> // just for timing

// function declarations
void fillMatrix(int size, float * A);
void getArguments(int argc, char **argv, int *size, int *verbose, int *check);
void debugPrintMatrix(int verbose, int size, float *matrix, const char *msg);
void showMatrix(int size, float * matrix);
void verfiyCorrect(int size, float *matrix);

// mutiply matrix A times matrix B, placing result in matrix C
void MatrixMult(int size, float * __restrict__ A,
            float * __restrict__ B, float * __restrict__ C) {

#pragma acc data copyin(A,B) copy(C)
    #pragma acc kernels
    #pragma acc loop collapse(2) independent
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
        float tmp = 0.;
        #pragma acc loop seq
        for (int k = 0; k < size; ++k) {
            tmp += A[i*size + k] * B[k*size + i];
        }
        C[i*size + j] = tmp;    // update cell of C once
        }
    }
}

int main (int argc, char **argv) {

// default values
    int size = 256;          // num rows, cols of square matrix
    int verbose = 0;         // default to not printing matrices
        int check = 0;           // check for errors if >0
    getArguments(argc, argv, &size, &verbose, &check); //change defaults

printf("matrix rows, cols = %d\n", size);

float * A;  // input matrix
    float * B;  // input matrix
    float * C;  // output matrix

// Use a 'flattened' 1D array of contiguous memory for the matrices
    // size = number of rows = number of columns in the square matrices
    size_t num_elements = size * size * sizeof(float);
    A = (float *)malloc(num_elements);
    B = (float *)malloc(num_elements);
    C = (float *)malloc(num_elements);

fillMatrix(size, A);
    fillMatrix(size, B);
    char msgA[32] = "matrix A after filling:";
    debugPrintMatrix(verbose, size, A, msgA);

double startTime = omp_get_wtime();

MatrixMult(size, A, B, C);

char msgC[32] = "matrix C after MatrixMult(): ";
    debugPrintMatrix(verbose, size, C, msgC);

double endTime = omp_get_wtime();

printf("\nTotal omp runtime %f seconds (%f milliseconds)\n",
    (endTime-startTime), (endTime-startTime)*1000);

if (check) {
        verfiyCorrect(size, C);
    }

free(A); free(B); free(C);
    return 0;
}
////////////////////////////////////// end main

// fill a given square matrix with rows of float values
// equal to each row number
void fillMatrix(int size, float * A) {
    for (int i = 0; i < size; ++i) {
        for (int j = 0; j < size; ++j) {
            A[i*size + j] = ((float)i);
        }
    }
}

void getArguments(int argc, char **argv, int *size, int *verbose, int *check) {
    // 3 arguments optional:
    //   size of one side of square matrix
    //   verbose printing for debugging
    //   whether to check for correct result
    if (argc > 4) {
        fprintf(stderr,"Use: %s [size] [verbose] [check] \n", argv[0]);
        exit(EXIT_FAILURE);
    }

if (argc >= 2) {
        *size = atoi(argv[1]);
        if (argc >= 3) {
            *verbose = atoi(argv[2]);
        }
        if (argc == 4) {
            *check = atoi(argv[3]);
        }
    }

if (*verbose) {
        printf("size of matrix side: %d\n", *size);
    }
}

void debugPrintMatrix(int verbose, int size, float *matrix, const char *msg) {
    if (verbose){
        printf("%s \n", msg);
        showMatrix(size, matrix);
    }
}

// display a given square matrix for debugging purposes
void showMatrix(int size, float * matrix) {
    int i, j;
    for (i=0; i<size; i++){
        for (j=0; j<size; j++) {
            printf("element [%d][%d] = %f \n",i,j, matrix[i*size + j]);
        }
    }
}

// Check whether last row of result matrix is what we expect
void verfiyCorrect(int size, float *matrix) {
    // determine what the last row should contain
    float lastRowValue = 0.0;
    float maxError = 0.0;
    float nextVal = 0.0;

for (int i=0; i<size; i++)
        lastRowValue += i * (size-1);

for (int j=0; j<size; j++) {
        nextVal = matrix[(size-1)*size + j];
        maxError = fmaxf(maxError, fabs(nextVal - lastRowValue));
    }
    printf("max error of last row matrix C values: %f\n", maxError);
}

As in prior examples in this chapter, the command line arguments are the same, in this order:

The size of one side of the square matrix being used.
Whether to print out the arrays after the manipulation (default of zero is don’t print, non-zero is print). This should be used only with very small values of the size of a side of the matrix, since this book doesn’t return large print buffers and it is hard to read.
Whether to check if the results are correct. The particular contrived computation we chose is easy to check.

Exercises

You can try running the following problem sizes for the side of each square matrix. What do you observe about the changes in the running times? You can refer to the detailed explanation in Chapter 6, section 2 of our PDC for Beginners book to get a better sense of the big-Oh order of this algorithm and why the times scale the way that they do as you double the size of one side of each matrix.

['1024']
['2048']
['4096']
['8192']

Visit Chapter 6, section 2 of our PDC for Beginners book and scroll to the complete code and the section labeled ‘Experimenting with the programs’. In there are a sequential and an OpenMP version of code for this problem. Try collecting times for the sequential version in the first tab and the OpenMP version with 8 cores in the second tab with the OpenACC GPU version here. How many times faster is the OpenACC version than either of those two for 1024x1024? Note that this is the way that we compare GPU device versions that use many cores that are slower than a CPU core to versions run on CPUs.

Final thought: a way of work¶

In this chapter we have used examples that stay true to a general ‘way of work’ for creating parallel versions of code.

First, have a method for verifying the correctness of your solution.
Next, run experiments to determine how well your program scales. From these examples, you can see that knowing how the algorithm works and its sequential performance in terms of big-Oh often helps you to see and explain the improvements on the manycore version.
For manycore GPU versions of algorithms, we usually examine their performance by considering how many times faster they can run on the same problem size than a sequential or multicore version.

As we look at other examples, we will also see that determining where to focus on parallelizing a larger program will be another step in our working process of parallelizing its code.

You have attempted of activities on this page