5.1 MPI Tags as Sequence Numbers¶

Tags can be placed on messages that are sent from a non-conductor process and received by the conductor process. Using tags is an alternative form of simulating the barrier.

/* sequenceNumbers.c
  *  ... shows how to acheive barrier-like behavior
  *      by using MPI message tags as sequence numbers.
  */

#include <stdio.h>   // printf()
  #include <mpi.h>     // MPI

/* Have workers send messages to the conductor, which prints them.
  * @param: id, an int
  * @param: numProcesses, an int
  * @param: hostName, a char*
  * @param: messageNum, a char*
  * @param: tagValue, an int
  *
  * Precondition: this routine is being called by an MPI process
  *               && id is the MPI rank of that process
  *               && numProcesses is the number of processes in the computation
  *               && hostName points to a char array containing the name of the
  *                    host on which this MPI process is running
  *               && messageNum is "FIRST", "SECOND", "THIRD", ...
  *               && tagValue is the value for the tags of the message
  *                    being sent and received this invocation of the function.
  *
  * Postcondition: each process whose id > 0 has sent a message to process 0
  *                    containing id, numProcesses, hostName, messageNum,
  *                    and tagValue
  *                && process 0 has received and output each message.
  */

#define BUFFER_SIZE 200
  #define CONDUCTOR      0

void sendReceivePrint(int id, int numProcesses, char* hostName,
                          char* messageNum, int tagValue) {
      char buffer[BUFFER_SIZE] = {'\0'};;
      MPI_Status status;

if (id != CONDUCTOR) {
          // Worker: Build a message and send it to the Conductor
          int length = sprintf(buffer,
                                "This is the %s message from process #%d of %d on %s.\n",
                                  messageNum, id, numProcesses, hostName);
          MPI_Send(buffer, length+1, MPI_CHAR, 0, tagValue, MPI_COMM_WORLD);
      } else {
          // Conductor: Receive and print the messages from all Workers
          for(int i = 0; i < numProcesses-1; i++) {
            MPI_Recv(buffer, BUFFER_SIZE, MPI_CHAR, MPI_ANY_SOURCE,
                      tagValue, MPI_COMM_WORLD, &status);
            printf("%s", buffer);
          }
      }
  }

int main(int argc, char** argv) {
      int id = -1, numProcesses = -1, length = -1;
      char myHostName[MPI_MAX_PROCESSOR_NAME] = {'\0'};

MPI_Init(&argc, &argv);
      MPI_Comm_rank(MPI_COMM_WORLD, &id);
      MPI_Comm_size(MPI_COMM_WORLD, &numProcesses);
      MPI_Get_processor_name (myHostName, &length);

sendReceivePrint(id, numProcesses, myHostName, "FIRST", 1);

sendReceivePrint(id, numProcesses, myHostName, "SECOND", 1);
  //    sendReceivePrint(id, numProcesses, myHostName, "SECOND", 2);
  //    sendReceivePrint(id, numProcesses, myHostName, "THIRD", 3);
  //    sendReceivePrint(id, numProcesses, myHostName, "FOURTH", 4);

MPI_Finalize();
      return 0;
  }

Study the function called sendReceivePrint(), noting what each worker does and what the conduction does (pay attention to the for loop). Then try the following exercise.

Exercise:

Run the program several times, noting the intermixed outputs.
In main(), Comment out the sendReceivePrint(…, “SECOND”, 1); call and uncomment the sendReceivePrint(…, “SECOND”, 2); call; then rerun, noting how the output changes. Note that the last parameter is the tag that is used when the message is sent to the conductor.
Uncomment the sendReceivePrint(…, “THIRD”, 3); and sendReceivePrint(…, “FOURTH”, 4); calls, then rerun, noting how the output changes.
Explain the differences: what has caused the changes in the program’s behavior, and why?
Can you figure out what the different tags represent and how the tags work in relation to the send and receive functions?

5.2 Broadcast and Data Decomposition with Parallel for Loop¶

We now expand upon data decomposition using parallel-for loop with equal-sized chunks to incorporate broadcast and gather. We begin by filling an array with values and broadcasting this array to all processes. Afterwards, each process works on their portion of the array which has been determined by the equal sized chunks data decomposition pattern. Lastly, all of the worked on portions of the array are gathered into an array containing the final result. Below is a diagram of the code executing using 4 processes. The diagram assumes that we have already broadcast the filled array to all processes.

../_images/AdvancedBroadcastParallelLoop.png

/* broadcastLoop.c
  * ... illustrates the use of MPI_Bcast() for arrays
  * combined with data decomposition pattern using a parallel-for loop with
  * equal chunks. Wraps up with a gather so that completed work is back in
  * conductor process.
  *
  * Libby Shoop, Macalester College, July, 2017
  *
  * Usage: mpirun -np N ./broadcastLoop
  *
  * Exercise:
  * - Compile and run, using 2, 4, and 8 processes
  * - Use source code to trace execution and output
  * - Explain behavior/effect of MPI_Bcast(), MPI_Gather().
  * - optional: change MAX to be another multiple of 8, such as 16
  */

#include <mpi.h>
  #include <stdio.h>
  #include <stdlib.h>

/* fill an array with some arbitrary values
  * @param: a, an int*.
  * @param: size, an int.
  * Precondition: a is the address of an array of ints.
  *              && size is the number of ints a can hold.
  * Postcondition: a has been filled with arbitrary values
  *                { 11, 12, 13, ... }.
  */
  void fill(int* a, int size) {
      int i;
      for (i = 0; i < size; i++) {
          a[i] = i+11;
      }
  }

/*
  * Perform the data decomposition pattern on chunk of the array.
  *
  * @param: reps, number of repetions to traverse array
  * @param: numProcesses, total number of processes being used
  * @param: id, the rank, or id of current process executing this function
  * @param: array, the array of integers whose chunk this process will work on.
  * @param: myChunk, a smaller array that will contain the completed work
  *
  * This function will work on a portion of the array by doubling the value
  * at each index in the array that this process id is responsible for.
  * The original array is intact and the work done is stored in a smaller
  * array, myChunk.
  *
  * preconditions:
  *        reps is divisible by numProcesses to ensure equal chunks
  *        size of myChunk is reps/numProcesses
  * postconditions:
  *        array is unchanged
  *        myChunk contains completed work
  */
  void workOnChunk(int reps, int numProcesses, int id, int* array, int* myChunk) {

int chunkSize = reps / numProcesses;      // find chunk size
      int start = id * chunkSize;               // find starting index
      int stop = start + chunkSize;             // find stopping index

int chunkIndex = 0;
      for (int i = start; i < stop; i++) {     // iterate through our range
          printf("Process %d is performing iteration %d\n", id, i);
          // perform calculation, leaving original array intact and updating
          // local chunk with result.
          myChunk[chunkIndex] = array[i] *2;
          chunkIndex++;
      }
  }

/* display a string, a process id, and its array values
  * @param: str, a char*
  * @param: id, an int
  * @param: a, an int*.
  * Precondition: str points to a string to describe this array being printed
  *              && id is the rank of this MPI process
  *              && a is the address of an int array with numElements.
  * Postcondition: str, id, and a have all been written to stdout.
  */
  void print(char* str, int id, int* a, int numElements) {
      printf("%s process %d is: ", str, id);
      for (int i = 0; i < numElements; i++) {
          printf("%d ", a[i]);
      }
      printf("\n");
  }

#define MAX 8

/*
  *  Main program that double the values in an array by dividing the work
  *  equally among processes.
  */
  int main(int argc, char** argv) {
      int array[MAX] = {0};
      int* myChunk = NULL;
      int* gatherArray = NULL;
      int numProcs = -1, myRank = -1;

MPI_Init(&argc, &argv);
      MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
      MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

// check conditions for  equal-sized chunks
      if ((MAX % numProcs) == 0 && numProcs <= MAX) {
          if (myRank == 0) {     // conductor:
              fill(array, MAX);                                 // populate original array
              gatherArray = (int*) malloc( MAX * sizeof(int) ); // allocate result array
          }

print("BEFORE Bcast, original array on", myRank, array, MAX);

MPI_Bcast(array, MAX, MPI_INT, 0, MPI_COMM_WORLD);

print("AFTER Bcast, original array on", myRank, array, MAX);

myChunk = (int*) malloc(MAX/numProcs * sizeof(int));  // holds my work
          workOnChunk(MAX, numProcs, myRank, array, myChunk);

// print("AFTER doubling, original array on ", myRank, array, MAX);  //array should not change
          print("After doubling, computed portion on", myRank, myChunk, MAX/numProcs);

MPI_Gather(myChunk, MAX/numProcs, MPI_INT,            //  gather chunk vals
                      gatherArray, MAX/numProcs, MPI_INT,       //   into gatherArray
                      0, MPI_COMM_WORLD);

if (myRank == 0) {                                    // conductor has everything
              print("AFTER gather, gatherArray on", myRank, gatherArray, MAX);
              free(gatherArray);                                //clean up
          }
      } else {                                                  // bail if unequal chunks
          if (myRank == 0) {
              printf("Please run with -np divisible by and less than or equal to %d\n.", MAX);
          }
      }

free(myChunk);                                            // clean up
      MPI_Finalize();
      return 0;
  }

Note that we chose to keep the original array, array, intact. Each process allocates memory, myChunk to store their worked on portion of the array. Later, the worked on portions from all processes are gathered into a final result array, gatherArray. This way of working on array is useful in instances in which we want to be able to access the initial array after working on it.

Exercise:

Run using 2, 4, and 8 processes
Use source code to trace execution and output
Why do you observe different output when you run it several times?
Explain behavior/effect of MPI_Bcast(), MPI_Gather().
Verify that the original array on each process has not changed by uncommenting the print() call in main
optional: change MAX to be another multiple of 8, such as 16

5.3 Scatter, Data Decomposition with Parallel for Loop, then Gather¶

Recall this image from the previous example:

In that example the conductor process broadcast the entire array to all processes. In this next example, we will instead illustrate how to scatter the original array so that every process has a portion of it. Then the same computation on each process will occur, and the portions will be gathered back together onto the conductor process.

/* scatterLoopGather.c
  * ... scatters an array of data into equal-sized chunks,
  *      has each process use a loop to double the values in its chunk,
  *      and then gathers the chunks back to the conductor process.
  *
  * Joel Adams, Calvin University, December 2019.
  *
  * Precondition: ARRAY_SIZE is evenly divisible by N
  *               && N <= ARRAY_SIZE.
  *
  * Note: The output of different process's steps will be interleaved
  *       (even using barriers) b/c stdout is buffered
  *       and MPI does not guarantee FIFO output behavior.
  */

#include <stdio.h>     // printf
  #include <stdlib.h>    // malloc, exit, ...
  #include <mpi.h>       // MPI functionality

#define CONDUCTOR     0
  #define ARRAY_SIZE 8

void fill(int* a, int size);
  void printSeparator(const char* separator, int id);
  void print(char* locLabel, int id, char* aName, int* a, int numElements);

/*
  *  Main function: double the values in an array
  *  by dividing the work equally among N processes.
  */
  int main(int argc, char** argv) {
      int* scatterArray = NULL;
      int* chunkArray = NULL;
      int* gatherArray = NULL;
      int numProcs = -1, myRank = -1, chunkSize = -1;

MPI_Init(&argc, &argv);
      MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
      MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

printSeparator("", myRank);

if (ARRAY_SIZE % numProcs || numProcs > ARRAY_SIZE) {
          if (myRank == CONDUCTOR) {
              printf("Please run with -np N divisible by and less than or equal to %d\n.", ARRAY_SIZE);
          }
          MPI_Finalize();
          exit(0);
      }

if (myRank == CONDUCTOR) {
          scatterArray = (int*) malloc( ARRAY_SIZE * sizeof(int) ); // allocate input array
          fill(scatterArray, ARRAY_SIZE);                           // populate it
          gatherArray = (int*) malloc( ARRAY_SIZE * sizeof(int) );  // allocate result array
      }

print("BEFORE Scatter", myRank, "scatterArray", scatterArray, ARRAY_SIZE);

chunkSize = ARRAY_SIZE / numProcs;
      chunkArray = (int*) malloc(chunkSize * sizeof(int));          // allocate chunk array

MPI_Scatter(scatterArray, chunkSize, MPI_INT,                 // scatter input array
                  chunkArray, chunkSize, MPI_INT, 0, MPI_COMM_WORLD);

print("AFTER Scatter", myRank, "chunkArray", chunkArray, chunkSize);

for (unsigned i = 0; i < chunkSize; ++i) {                    // compute using chunk
          chunkArray[i] *= 2;
      }

print("AFTER doubling", myRank, "chunkArray", chunkArray, chunkSize);

MPI_Gather(chunkArray, chunkSize, MPI_INT,                   //  gather chunks
                  gatherArray, chunkSize, MPI_INT,                 //   into gatherArray
                  0, MPI_COMM_WORLD);

print("AFTER gather", myRank, "gatherArray", gatherArray, ARRAY_SIZE);

free(chunkArray);                                            // everyone clean up
      if (myRank == 0) {                                           // conductor clean up
          free(gatherArray);
          free(scatterArray);
      }

printSeparator("", myRank);

MPI_Finalize();
      return 0;
  }

/* fill an array with some easy-to-check values
  * @param: a, an int*.
  * @param: size, an int.
  * Precondition: a is the address of an array of ints.
  *              && size is the number of ints a can hold.
  * Postcondition: a has been filled with the values
  *                { 11, 12, 13, ... }.
  */
  void fill(int* a, int size) {
      for (int i = 0; i < size; ++i) {
          a[i] = i+11;
      }
  }

/* display a separator, synchronizing all processes
  * @param: separator, a char*
  * @param: id, an int.
  * Precondition: separator points to a string to be be displayed
  *               && id is the MPI rank of this process.
  * Postcondition: separator has been displayed
  *                 and all MPI processes have been syncronized.
  */
  void printSeparator(const char* separator, int id) {
      MPI_Barrier(MPI_COMM_WORLD);
      if (id == CONDUCTOR) {
          printf("%s\n", separator);
      }
      MPI_Barrier(MPI_COMM_WORLD);
  }

/* display a string, a process id, and its array values
  * @param: locLabel, a char*
  * @param: id, an int
  * @param: aName, a char*
  * @param: a, an int*.
  * @param: numElements, an int.
  * Precondition: locLabel points to a string describing our location
  *              && id is the rank of this MPI process
  *              && aName is the name of the array being printed
  *              && a is the address of an int array
  *              && numElements is the number of int-values in a.
  * Postcondition: str, id, and a have all been written to stdout.
  */
  void print(char* locLabel, int id, char* aName, int* a, int numElements) {
      printf("%s, process %d has this %s: {", locLabel, id, aName);
      if (a != NULL) {
          for (int i = 0; i < numElements - 1; ++i) {
              printf("%d, ", a[i]);
          }
          printf("%d}\n", a[numElements - 1]);
      } else {
          printf("}\n");
      }
  }

Exercise:

Run, using 1, 2, 4, and 8 processes
Use source code to trace execution and output
Explain behavior/effect of MPI_Scatter(), MPI_Gather().
Optional: change ARRAY_SIZE to be another multiple of 8, such as 16
Optional: eliminate calls to print() to display each array at each step, keeping only the final gatherArray

5.4 Scatter and Gather with any size array and odd or even number of processes¶

In the previous two examples, we needed to ensure that the array size was divisible by the number of processes. Since this is often not the case in a normal application, MPI has functions that enable us to scatter and gather variable-sized ‘chunks’ of our arrays. We still need to ensure that the number of processes is less than the array size.

The functions we will use for this are called MPI_Scatterv and MPI_Gatherv. We will use a way of splitting the arrays into nearly equal sized chunks that we demonstrated in example 02 of the program structure section of the previous chapter.

In the code below, the call to MPI_Scatterv looks like this:

MPI_Scatterv(scatterArray, chunkSizeArray, offsetArray, MPI_INT,
               chunkArray, chunkSize, MPI_INT,
               CONDUCTOR, MPI_COMM_WORLD);

Note

These new functions take new second and third arguments that are arrays of integers designed to show how to split the original data array. The conductor process uses these arrays to send a portion to each worker process. As with all coordination functions, all processes must call this function.

The second and third arguments are arrays whose size is the same as the number of processes. The values at index 0 are for process 0, index 1 for process 1, and so on. The second argument is an array that contains the total number of elements to be scattered to each process, and the third argument is the offset into the original array where the chunk to be given to that process is.

The code for setting up these arrays is very similar to how we set up nearly equal sized chunks for decomposition using the for-loop pattern, and looks like this:

// find chunk size for part of processes
  int chunkSize1 = (int)ceil(((double)ARRAY_SIZE) / numProcs);
  int chunkSize2 = chunkSize1 - 1;
  int remainder = ARRAY_SIZE % numProcs;

// compute chunkSize and offset array entries for each process
  for (int i = 0; i < numProcs; ++i) {
      if (remainder == 0 || (remainder != 0 && i < remainder)) {
          chunkSizeArray[i] = chunkSize1;
          offsetArray[i] = chunkSize1 * i;
      } else {
          chunkSizeArray[i] = chunkSize2;
          offsetArray[i] = (remainder * chunkSize1) + (chunkSize2 * (i - remainder));
      }
  }

Suppose that the data array is set to have 10 elements and we use 3 processes. After the data is initialized by the conductor and the above code is executed, the status as we call MPI_Scatterv looks like this:

Then after MPI_Scatterv has completed, the arrays on each process would look like this:

The complete code is below so that you can run it. It also contains a call to MPI_Gatherv that enables the conductor process to gather all of the computed values back into a separate array.

/* scatterV_gatherV.c
* ... scatters an array of data into different-sized chunks,
*      has each process use a loop to double the values in its chunk,
*      and then gathers the chunks back to the conductor process.
*
* Joel Adams, Calvin University, December 2019.
*
* Precondition: N <= ARRAY_SIZE.
*
*
*/

#include <stdio.h>     // printf
#include <stdlib.h>    // malloc, exit, ...
#include <math.h>      // ceil
#include <mpi.h>       // MPI functionality

#define CONDUCTOR     0
#define ARRAY_SIZE 10

void fill(int* a, int size);
void printSeparator(const char* separator, int id);
void print(char* locLabel, int id, char* aName, int* a, int numElements);

/*
*  Main function: double the values in an array
*  by dividing the work equally among N processes.
*/
int main(int argc, char** argv) {
    int* scatterArray = NULL;
    int* chunkSizeArray = NULL;
    int* offsetArray = NULL;
    int* chunkArray = NULL;
    int* gatherArray = NULL;
    int numProcs = -1, myRank = -1, chunkSize = -1;

MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myRank);

printSeparator("", myRank);

if (numProcs > ARRAY_SIZE) {
        if (myRank == CONDUCTOR) {
            printf("Please run with -np N less than or equal to %d\n.", ARRAY_SIZE);
        }
        MPI_Finalize();
        exit(0);
    }

chunkSizeArray = (int*) malloc( numProcs * sizeof(int) );    // chunkSizes
    offsetArray = (int*) malloc( numProcs * sizeof(int) );       // offsets

// find chunk size for part of processes
    int chunkSize1 = (int)ceil(((double)ARRAY_SIZE) / numProcs);
    int chunkSize2 = chunkSize1 - 1;
    int remainder = ARRAY_SIZE % numProcs;

// compute chunkSize and offset array entries for each process
    for (int i = 0; i < numProcs; ++i) {
        if (remainder == 0 || (remainder != 0 && i < remainder)) {
            chunkSizeArray[i] = chunkSize1;
            offsetArray[i] = chunkSize1 * i;
        } else {
            chunkSizeArray[i] = chunkSize2;
            offsetArray[i] = (remainder * chunkSize1) + (chunkSize2 * (i - remainder));
        }
    }

if (myRank == CONDUCTOR) {
        scatterArray = (int*) malloc( ARRAY_SIZE * sizeof(int) ); // input array
        fill(scatterArray, ARRAY_SIZE);                           // populate it
        gatherArray = (int*) malloc( ARRAY_SIZE * sizeof(int) );  // result array

print("BEFORE Scatter", myRank, "scatterArray", scatterArray, ARRAY_SIZE);
    }

chunkSize = chunkSizeArray[myRank];                      // all processes
    chunkArray = (int*) malloc(chunkSize * sizeof(int));     // allocate chunk array
    for (int i = 0; i < chunkSize; ++i) chunkArray[i] = 0;   // and initialize

print("BEFORE Scatter", myRank, "chunkArray", chunkArray, chunkSize);

MPI_Scatterv(scatterArray, chunkSizeArray, offsetArray, MPI_INT, // scatter scatterArray
                chunkArray, chunkSize, MPI_INT,                  //  into chunkArray
                CONDUCTOR, MPI_COMM_WORLD);

print("AFTER Scatter", myRank, "chunkArray", chunkArray, chunkSize);

for (unsigned i = 0; i < chunkSize; ++i) {                    // compute using chunk
        chunkArray[i] *= 2;
    }

print("AFTER doubling", myRank, "chunkArray", chunkArray, chunkSize);

MPI_Gatherv(chunkArray, chunkSize, MPI_INT,                   //  gather chunks
                gatherArray, chunkSizeArray, offsetArray, MPI_INT, //   into gatherArray
                CONDUCTOR, MPI_COMM_WORLD);

if (myRank == CONDUCTOR) {
        print("AFTER gather", myRank, "gatherArray", gatherArray, ARRAY_SIZE);
    }

free(chunkArray);                                            // everyone clean up
    if (myRank == 0) {                                           // conductor clean up
        free(gatherArray);
        free(scatterArray);
    }

printSeparator("", myRank);

MPI_Finalize();
    return 0;
}

/* fill an array with some easy-to-check values
* @param: a, an int*.
* @param: size, an int.
* Precondition: a is the address of an array of ints.
*              && size is the number of ints a can hold.
* Postcondition: a has been filled with the values
*                { 11, 12, 13, ... }.
*/
void fill(int* a, int size) {
    for (int i = 0; i < size; ++i) {
        a[i] = i+11;
    }
}

/* display a separator, synchronizing all processes
* @param: separator, a char*
* @param: id, an int.
* Precondition: separator points to a string to be be displayed
*               && id is the MPI rank of this process.
* Postcondition: separator has been displayed
*                 and all MPI processes have been syncronized.
*/
void printSeparator(const char* separator, int id) {
    MPI_Barrier(MPI_COMM_WORLD);
    if (id == CONDUCTOR) {
        printf("%s\n", separator);
    }
    MPI_Barrier(MPI_COMM_WORLD);
}

/* display a string, a process id, and its array values
* @param: locLabel, a char*
* @param: id, an int
* @param: aName, a char*
* @param: a, an int*.
* @param: numElements, an int.
* Precondition: locLabel points to a string describing our location
*              && id is the rank of this MPI process
*              && aName is the name of the array being printed
*              && a is the address of an int array
*              && numElements is the number of int-values in a.
* Postcondition: str, id, and a have all been written to stdout.
*/
void print(char* locLabel, int id, char* aName, int* a, int numElements) {
    printf("%s, process %d has this %s: {", locLabel, id, aName);
    if (a != NULL) {
        for (int i = 0; i < numElements - 1; ++i) {
            printf("%d, ", a[i]);
        }
        printf("%d}\n", a[numElements - 1]);
    } else {
        printf("}\n");
    }
}

Exercise:

Compile and run, using 1, 2, 3, 4, 5, and 10 processes
Use source code to trace execution and output
Explain behavior/effect of MPI_Scatterv() and Gatherv()
Optional: change ARRAY_SIZE

You have attempted of activities on this page