6.3 MPI: the Message Passing Interface

Many early generation commercial parallel computers were based on the message-passing architecture due to its lower cost relative to shared-address-space architectures. Since message-passing is the natural programming paradigm for these machines, this resulted in the development of many different message-passing libraries. In fact, message-passing became the modern-age form of assembly language, in which every hardware vendor provided its own library, that performed very well on its own hardware, but was incompatible with the parallel computers offered by other vendors. Many of the differences between the various vendor-specific message-passing libraries were only syntactic; however, often enough there were some serious semantic differences that required significant re-engineering to port a message-passing program from one library to another.

The message-passing interface, or MPI as it is commonly known, was created to essentially solve this problem. MPI defines a standard library for message-passing that can be used to develop portable message-passing programs using either C or Fortran. The MPI standard defines both the syntax as well as the semantics of a core set of library routines that are very useful in writing message-passing programs. MPI was developed by a group of researchers from academia and industry, and has enjoyed wide support by almost all the hardware vendors. Vendor implementations of MPI are available on almost all commercial parallel computers.

The MPI library contains over 125 routines, but the number of key concepts is much smaller. In fact, it is possible to write fully-functional message-passing programs by using only the six routines shown in Table 6.1. These routines are used to initialize and terminate the MPI library, to get information about the parallel computing environment, and to send and receive messages.

In this section we describe these routines as well as some basic concepts that are essential in writing correct and efficient message-passing programs using MPI.

Table 6.1. The minimal set of MPI routines.

MPI_Init

Initializes MPI.

MPI_Finalize

Terminates MPI.

MPI_Comm_size

Determines the number of processes.

MPI_Comm_rank

Determines the label of the calling process.

MPI_Send

Sends a message.

MPI_Recv

Receives a message.

6.3.1 Starting and Terminating the MPI Library

MPI_Init is called prior to any calls to other MPI routines. Its purpose is to initialize the MPI environment. Calling MPI_Init more than once during the execution of a program will lead to an error. MPI_Finalize is called at the end of the computation, and it performs various clean-up tasks to terminate the MPI environment. No MPI calls may be performed after MPI_Finalize has been called, not even MPI_Init. Both MPI_Init and MPI_Finalize must be called by all the processes, otherwise MPI's behavior will be undefined. The exact calling sequences of these two routines for C are as follows:

int MPI_Init(int *argc, char ***argv) 
int MPI_Finalize()

The arguments argc and argv of MPI_Init are the command-line arguments of the C program. An MPI implementation is expected to remove from the argv array any command-line arguments that should be processed by the implementation before returning back to the program, and to decrement argc accordingly. Thus, command-line processing should be performed only after MPI_Init has been called. Upon successful execution, MPI_Init and MPI_Finalize return MPI_SUCCESS; otherwise they return an implementation-defined error code.

The bindings and calling sequences of these two functions are illustrative of the naming practices and argument conventions followed by MPI. All MPI routines, data-types, and constants are prefixed by "MPI_". The return code for successful completion is MPI_SUCCESS. This and other MPI constants and data-structures are defined for C in the file "mpi.h". This header file must be included in each MPI program.

6.3.2 Communicators

A key concept used throughout MPI is that of the communication domain. A communication domain is a set of processes that are allowed to communicate with each other. Information about communication domains is stored in variables of type MPI_Comm,that are called communicators. These communicators are used as arguments to all message transfer MPI routines and they uniquely identify the processes participating in the message transfer operation. Note that each process can belong to many different (possibly overlapping) communication domains.

The communicator is used to define a set of processes that can communicate with each other. This set of processes form a communication domain. In general, all the processes may need to communicate with each other. For this reason, MPI defines a default communicator called MPI_COMM_WORLD which includes all the processes involved in the parallel execution. However, in many cases we want to perform communication only within (possibly overlapping) groups of processes. By using a different communicator for each such group, we can ensure that no messages will ever interfere with messages destined to any other group. How to create and use such communicators is described at a later point in this chapter. For now, it suffices to use MPI_COMM_WORLD as the communicator argument to all the MPI functions that require a communicator.

6.3.3 Getting Information

The MPI_Comm_size and MPI_Comm_rank functions are used to determine the number of processes and the label of the calling process, respectively. The calling sequences of these routines are as follows:

int MPI_Comm_size(MPI_Comm comm, int *size) 
int MPI_Comm_rank(MPI_Comm comm, int *rank)

The function MPI_Comm_size returns in the variable size the number of processes that belong to the communicator comm. So, when there is a single process per processor, the call MPI_Comm_size(MPI_COMM_WORLD, &size) will return in size the number of processors used by the program. Every process that belongs to a communicator is uniquely identified by its rank. The rank of a process is an integer that ranges from zero up to the size of the communicator minus one. A process can determine its rank in a communicator by using the MPI_Comm_rank function that takes two arguments: the communicator and an integer variable rank. Up on return, the variable rank stores the rank of the process. Note that each process that calls either one of these functions must belong in the supplied communicator, otherwise an error will occur.

Example 6.2 Hello World

We can use the four MPI functions just described to write a program that prints out a "Hello World" message from each processor.

1   #include <mpi.h> 
2 
3   main(int argc, char *argv[]) 
4   { 
5     int npes, myrank; 
6 
7     MPI_Init(&argc, &argv); 
8     MPI_Comm_size(MPI_COMM_WORLD, &npes); 
9     MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 
10    printf("From process %d out of %d, Hello World!\n", 
11            myrank, npes); 
12    MPI_Finalize(); 
13  }

6.3.4 Sending and Receiving Messages

The basic functions for sending and receiving messages in MPI are the MPI_Send and MPI_Recv, respectively. The calling sequences of these routines are as follows:

int MPI_Send(void *buf, int count, MPI_Datatype datatype, 
        int dest, int tag, MPI_Comm comm) 
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, 
        int source, int tag, MPI_Comm comm, MPI_Status *status)

MPI_Send sends the data stored in the buffer pointed by buf. This buffer consists of consecutive entries of the type specified by the parameter datatype. The number of entries in the buffer is given by the parameter count. The correspondence between MPI datatypes and those provided by C is shown in Table 6.2. Note that for all C datatypes, an equivalent MPI datatype is provided. However, MPI allows two additional datatypes that are not part of the C language. These are MPI_BYTE and MPI_PACKED.

MPI_BYTE corresponds to a byte (8 bits) and MPI_PACKED corresponds to a collection of data items that has been created by packing non-contiguous data. Note that the length of the message in MPI_Send, as well as in other MPI routines, is specified in terms of the number of entries being sent and not in terms of the number of bytes. Specifying the length in terms of the number of entries has the advantage of making the MPI code portable, since the number of bytes used to store various datatypes can be different for different architectures.

The destination of the message sent by MPI_Send is uniquely specified by the dest and comm arguments. The dest argument is the rank of the destination process in the communication domain specified by the communicator comm. Each message has an integer-valued tag associated with it. This is used to distinguish different types of messages. The message-tag can take values ranging from zero up to the MPI defined constant MPI_TAG_UB. Even though the value of MPI_TAG_UB is implementation specific, it is at least 32,767.

MPI_Recv receives a message sent by a process whose rank is given by the source in the communication domain specified by the comm argument. The tag of the sent message must be that specified by the tag argument. If there are many messages with identical tag from the same process, then any one of these messages is received. MPI allows specification of wildcard arguments for both source and tag. If source is set to MPI_ANY_SOURCE, then any process of the communication domain can be the source of the message. Similarly, if tag is set to MPI_ANY_TAG, then messages with any tag are accepted. The received message is stored in continuous locations in the buffer pointed to by buf. The count and datatype arguments of MPI_Recv are used to specify the length of the supplied buffer. The received message should be of length equal to or less than this length. This allows the receiving process to not know the exact size of the message being sent. If the received message is larger than the supplied buffer, then an overflow error will occur, and the routine will return the error MPI_ERR_TRUNCATE.

Table 6.2. Correspondence between the datatypes supported by MPI and those supported by C.

MPI Datatype

C Datatype

MPI_CHAR

signed char

MPI_SHORT

signed short int

MPI_INT

signed int

MPI_LONG

signed long int

MPI_UNSIGNED_CHAR

unsigned char

MPI_UNSIGNED_SHORT

unsigned short int

MPI_UNSIGNED

unsigned int

MPI_UNSIGNED_LONG

unsigned long int

MPI_FLOAT

float

MPI_DOUBLE

double

MPI_LONG_DOUBLE

long double

MPI_BYTE

MPI_PACKED

After a message has been received, the status variable can be used to get information about the MPI_Recv operation. In C, status is stored using the MPI_Status data-structure. This is implemented as a structure with three fields, as follows:

typedef struct MPI_Status { 
  int MPI_SOURCE; 
  int MPI_TAG; 
  int MPI_ERROR; 
};

MPI_SOURCE and MPI_TAG store the source and the tag of the received message. They are particularly useful when MPI_ANY_SOURCE and MPI_ANY_TAG are used for the source and tag arguments. MPI_ERROR stores the error-code of the received message.

The status argument also returns information about the length of the received message. This information is not directly accessible from the status variable, but it can be retrieved by calling the MPI_Get_count function. The calling sequence of this function is as follows:

int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, 
        int *count)

MPI_Get_count takes as arguments the status returned by MPI_Recv and the type of the received data in datatype, and returns the number of entries that were actually received in the count variable.

The MPI_Recv returns only after the requested message has been received and copied into the buffer. That is, MPI_Recv is a blocking receive operation. However, MPI allows two different implementations for MPI_Send. In the first implementation, MPI_Send returns only after the corresponding MPI_Recv have been issued and the message has been sent to the receiver. In the second implementation, MPI_Send first copies the message into a buffer and then returns, without waiting for the corresponding MPI_Recv to be executed. In either implementation, the buffer that is pointed by the buf argument of MPI_Send can be safely reused and overwritten. MPI programs must be able to run correctly regardless of which of the two methods is used for implementing MPI_Send. Such programs are called safe. In writing safe MPI programs, sometimes it is helpful to forget about the alternate implementation of MPI_Send and just think of it as being a blocking send operation.

Avoiding Deadlocks The semantics of MPI_Send and MPI_Recv place some restrictions on how we can mix and match send and receive operations. For example, consider the following piece of code in which process 0 sends two messages with different tags to process 1, and process 1 receives them in the reverse order.

1   int a[10], b[10], myrank; 
2   MPI_Status status; 
3   ... 
4   MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 
5   if (myrank == 0) { 
6     MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); 
7     MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); 
8   } 
9   else if (myrank == 1) { 
10    MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); 
11    MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); 
12  } 
13  ...

If MPI_Send is implemented using buffering, then this code will run correctly provided that sufficient buffer space is available. However, if MPI_Send is implemented by blocking until the matching receive has been issued, then neither of the two processes will be able to proceed. This is because process zero (i.e., myrank == 0) will wait until process one issues the matching MPI_Recv (i.e., the one with tag equal to 1), and at the same time process one will wait until process zero performs the matching MPI_Send (i.e., the one with tag equal to 2). This code fragment is not safe, as its behavior is implementation dependent. It is up to the programmer to ensure that his or her program will run correctly on any MPI implementation. The problem in this program can be corrected by matching the order in which the send and receive operations are issued. Similar deadlock situations can also occur when a process sends a message to itself. Even though this is legal, its behavior is implementation dependent and must be avoided.

Improper use of MPI_Send and MPI_Recv can also lead to deadlocks in situations when each processor needs to send and receive a message in a circular fashion. Consider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes).

1   int a[10], b[10], npes, myrank; 
2   MPI_Status status; 
3   ... 
4   MPI_Comm_size(MPI_COMM_WORLD, &npes); 
5   MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 
6   MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); 
7   MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); 
8   ...

When MPI_Send is implemented using buffering, the program will work correctly, since every call to MPI_Send will get buffered, allowing the call of the MPI_Recv to be performed, which will transfer the required data. However, if MPI_Send blocks until the matching receive has been issued, all processes will enter an infinite wait state, waiting for the neighboring process to issue a MPI_Recv operation. Note that the deadlock still remains even when we have only two processes. Thus, when pairs of processes need to exchange data, the above method leads to an unsafe program. The above example can be made safe, by rewriting it as follows:

1   int a[10], b[10], npes, myrank; 
2   MPI_Status status; 
3   ... 
4   MPI_Comm_size(MPI_COMM_WORLD, &npes); 
5   MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 
6   if (myrank%2 == 1) { 
7     MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); 
8     MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); 
9   } 
10  else { 
11    MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); 
12    MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); 
13  } 
14  ...

This new implementation partitions the processes into two groups. One consists of the odd-numbered processes and the other of the even-numbered processes. The odd-numbered processes perform a send followed by a receive, and the even-numbered processes perform a receive followed by a send. Thus, when an odd-numbered process calls MPI_Send,the target process (which has an even number) will call MPI_Recv to receive that message, before attempting to send its own message.

Sending and Receiving Messages Simultaneously The above communication pattern appears frequently in many message-passing programs, and for this reason MPI provides the MPI_Sendrecv function that both sends and receives a message.

MPI_Sendrecv does not suffer from the circular deadlock problems of MPI_Send and MPI_Recv. You can think of MPI_Sendrecv as allowing data to travel for both send and receive simultaneously. The calling sequence of MPI_Sendrecv is the following:

int MPI_Sendrecv(void *sendbuf, int sendcount, 
        MPI_Datatype senddatatype, int dest, int sendtag, 
        void *recvbuf, int recvcount, MPI_Datatype recvdatatype, 
        int source, int recvtag, MPI_Comm comm, 
        MPI_Status *status)

The arguments of MPI_Sendrecv are essentially the combination of the arguments of MPI_Send and MPI_Recv. The send and receive buffers must be disjoint, and the source and destination of the messages can be the same or different. The safe version of our earlier example using MPI_Sendrecv is as follows.

1   int a[10], b[10], npes, myrank; 
2   MPI_Status status; 
3   ... 
4   MPI_Comm_size(MPI_COMM_WORLD, &npes); 
5   MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 
6   MPI_SendRecv(a, 10, MPI_INT, (myrank+1)%npes, 1, 
7                b, 10, MPI_INT, (myrank-1+npes)%npes, 1, 
8                MPI_COMM_WORLD, &status); 
9   ...

In many programs, the requirement for the send and receive buffers of MPI_Sendrecv be disjoint may force us to use a temporary buffer. This increases the amount of memory required by the program and also increases the overall run time due to the extra copy. This problem can be solved by using that MPI_Sendrecv_replace MPI function. This function performs a blocking send and receive, but it uses a single buffer for both the send and receive operation. That is, the received data replaces the data that was sent out of the buffer. The calling sequence of this function is the following:

int MPI_Sendrecv_replace(void *buf, int count, 
        MPI_Datatype datatype, int dest, int sendtag, 
        int source, int recvtag, MPI_Comm comm, 
        MPI_Status *status)

Note that both the send and receive operations must transfer data of the same datatype.

6.3.5 Example: Odd-Even Sort

We will now use the MPI functions described in the previous sections to write a complete message-passing program that will sort a list of numbers using the odd-even sorting algorithm. Recall from Section 9.3.1 that the odd-even sorting algorithm sorts a sequence of n elements using p processes in a total of p phases. During each of these phases, the odd-or even-numbered processes perform a compare-split step with their right neighbors. The MPI program for performing the odd-even sort in parallel is shown in Program 6.1. To simplify the presentation, this program assumes that n is divisible by p.

Program 6.1 Odd-Even Sorting

[View full width]

  1  #include <stdlib.h> 
  2  #include <mpi.h> /* Include MPI's header file */ 
  3 
  4  main(int argc, char *argv[]) 
  5  { 
  6    int n;         /* The total number of elements to be sorted */ 
  7    int npes;      /* The total number of processes */ 
  8    int myrank;    /* The rank of the calling process */ 
  9    int nlocal;    /* The local number of elements, and the array that stores them */ 
 10    int *elmnts;   /* The array that stores the local elements */ 
 11    int *relmnts;  /* The array that stores the received elements */ 
 12    int oddrank;   /* The rank of the process during odd-phase communication */ 
 13    int evenrank;  /* The rank of the process during even-phase communication */ 
 14    int *wspace;   /* Working space during the compare-split operation */ 
 15    int i; 
 16    MPI_Status status; 
 17 
 18    /* Initialize MPI and get system information */ 
 19    MPI_Init(&argc, &argv); 
 20    MPI_Comm_size(MPI_COMM_WORLD, &npes); 
 21    MPI_Comm_rank(MPI_COMM_WORLD, &myrank); 
 22 
 23    n = atoi(argv[1]); 
 24    nlocal = n/npes; /* Compute the number of elements to be stored locally. */ 
 25 
 26    /* Allocate memory for the various arrays */ 
 27    elmnts  = (int *)malloc(nlocal*sizeof(int)); 
 28    relmnts = (int *)malloc(nlocal*sizeof(int)); 
 29    wspace  = (int *)malloc(nlocal*sizeof(int)); 
 30 
 31    /* Fill-in the elmnts array with random elements */ 
 32    srandom(myrank); 
 33    for (i=0; i<nlocal; i++) 
 34      elmnts[i] = random(); 
 35 
 36    /* Sort the local elements using the built-in quicksort routine */ 
 37    qsort(elmnts, nlocal, sizeof(int), IncOrder); 
 38 
 39    /* Determine the rank of the processors that myrank needs to communicate during 
the */ 
 40    /* odd and even phases of the algorithm */ 
 41    if (myrank%2 == 0) { 
 42      oddrank  = myrank-1; 
 43      evenrank = myrank+1; 
 44    } 
 45    else { 
 46      oddrank  = myrank+1; 
 47      evenrank = myrank-1; 
 48    } 
 49 
 50    /* Set the ranks of the processors at the end of the linear */ 
 51    if (oddrank == -1 || oddrank == npes) 
 52      oddrank = MPI_PROC_NULL; 
 53    if (evenrank == -1 || evenrank == npes) 
 54      evenrank = MPI_PROC_NULL; 
 55 
 56    /* Get into the main loop of the odd-even sorting algorithm */ 
 57    for (i=0; i<npes-1; i++) { 
 58      if (i%2 == 1) /* Odd phase */ 
 59        MPI_Sendrecv(elmnts, nlocal, MPI_INT, oddrank, 1, relmnts, 
 60            nlocal, MPI_INT, oddrank, 1, MPI_COMM_WORLD, &status); 
 61      else /* Even phase */ 
 62        MPI_Sendrecv(elmnts, nlocal, MPI_INT, evenrank, 1, relmnts, 
 63            nlocal, MPI_INT, evenrank, 1, MPI_COMM_WORLD, &status); 
 64 
 65      CompareSplit(nlocal, elmnts, relmnts, wspace, 
 66                   myrank < status.MPI_SOURCE); 
 67    } 
 68 
 69    free(elmnts); free(relmnts); free(wspace); 
 70    MPI_Finalize(); 
 71  } 
 72 
 73  /* This is the CompareSplit function */ 
 74  CompareSplit(int nlocal, int *elmnts, int *relmnts, int *wspace, 
 75               int keepsmall) 
 76  { 
 77    int i, j, k; 
 78 
 79    for (i=0; i<nlocal; i++) 
 80      wspace[i] = elmnts[i]; /* Copy the elmnts array into the wspace array */ 
 81 
 82    if (keepsmall) { /* Keep the nlocal smaller elements */ 
 83      for (i=j=k=0; k<nlocal; k++) { 
 84        if (j == nlocal || (i < nlocal && wspace[i] < relmnts[j])) 
 85          elmnts[k] = wspace[i++]; 
 86        else 
 87          elmnts[k] = relmnts[j++]; 
 88      } 
 89    } 
 90    else { /* Keep the nlocal larger elements */ 
 91      for (i=k=nlocal-1, j=nlocal-1; k>=0; k--) { 
 92        if (j == 0 || (i >= 0 && wspace[i] >= relmnts[j])) 
 93          elmnts[k] = wspace[i--]; 
 94        else 
 95          elmnts[k] = relmnts[j--]; 
 96      } 
 97    } 
 98  } 
 99 
100  /* The IncOrder function that is called by qsort is defined as follows */ 
101  int IncOrder(const void *e1, const void *e2) 
102  { 
103    return (*((int *)e1) - *((int *)e2)); 
104  }