The Performance of MPI Derived Types on a SGI Origin 2000, a Cray

The Performance of MPI Derived Types on a SGI Origin 2000, a Cray T3E-900, a Myrinet Linux Cluster and an Ethernet Linux Cluster Glenn R. Luecke, Silvia Spanoyannis, James Coyle [email protected], [email protected], [email protected] Iowa State University Ames, Iowa 50011-2251 USA

October 17, 2001

Abstract

This paper compares the performance of MPI derived types with user packing on a SGI Origin 2000, a Cray T3E-900, a Myrinet Linux cluster and an Ethernet Linux cluster. Four communication tests using MPI derived type routines and packing/unpacking techniques are run for a variety of message sizes using four processors on these machines. Except for one test, MPI derived types outperform user packing. Relative performance between machines varied with the test and the message size used. All tests showed that employing MPI derived types is an easy-to-use and elegant way to send non-contiguous data.

Keywords: Parallel Computers; MPI derived types: SGI Origin 2000; Myrinet Linux Cluster; Ethernet Linux Cluster; Cray T3E-900; MPI Library.

1 Introduction MPI is a standard for passing messages in Fortran, C and C++ programs for distributed memory parallel computers [18, 14]. To send non-contiguous data between processors with MPI, one must either rst pack the data into a contiguous buer or use MPI derived types. The primary purpose of this paper is to compare the performance of MPI derived types with user packing on a SGI Origin 2000, a Cray T3E-900, a Myrinet Linux cluster and an Ethernet Linux cluster. It is also the purpose of this paper to demonstrate that using MPI derived types is an easy-to-use, elegant method for sending non-contiguous data. Measuring the performance of MPI derived types is also being done at the University of Karlsruhe in Germany, where they have added MPI derived type performance tests to their SKaMPI (Special Karlsruher MPI) MPI benchmark [19, 17]. The design of the SKaMPI MPI benchmark for MPI derived types is signi cantly dierent from the tests in this paper, see [19, 17]. Moreover, the tests in this paper employ cache ushing techniques (see Section 2) whereas the SKaMPI tests do not. Four communication tests were chosen to represent commonly-used operations involving sending non-contiguous data: sending row blocks of a matrix, sending the lower triangular portion of 1

a matrix, sending a sparse matrix, and sending an array of mixed data types. These tests are written in Fortran 90 and use mpi type vector, mpi type indexed, and mpi type struct [18]. The Origin 2000 used [8, 9, 3] was a 256 processor (128 nodes) machine located in Eagan, Minnesota, with MIPS R12000 processors, running at 300 MHz. Each node consists of two processors sharing a common memory. There are two levels of cache: a 32 1024 byte rst level instruction cache and a 32 1024 rst level data cache, and an uni ed 8 1024 1024 byte second level cache for both data and instructions. The communication network is a hypercube for up to 16 nodes and is called a \fat bristled hypercube" for more than 16 nodes since multiple hypercubes are interconnected via a CrayLink Interconnect. Notice that throughout this paper 1 Kbyte means 103 bytes and 1 Mbyte means 106 bytes. For all tests, the IRIX 6.5 operating system, the Fortran 90 compiler version 7.3 with the -03 -64 compiler options, and the MPI library version 1.4.0.1 were used. The Cray T3E-900 used [6, 5, 7] was a 48 processor machine located in Chippewa Falls, Wisconsin. Each processor is a DEC Alpha EV5 microprocessor running at 450 MHz with a peak theoretical performance of 900 M op/s. There are two levels of cache: 8 1024 byte rst level instruction and data caches and a 96 1024 byte second level cache for both data and instructions. The communication network is a three-dimensional bi-directional torus. For all tests, the UNICOS/mk 2.0.5 operating system, the Fortran 90 compiler version 3.4.0.0 with the -O3 compiler option, and the MPI library version 1.4.0.0.2 were used. The Linux cluster of PCs used [4] was a 128 processor (64 nodes) machine, located at the Albuquerque High Performance Computing Center in Albuquerque, New Mexico. This cluster was purchased from Alta Technology Corporation [2] with each node consisting of a dual processor Intel 450 MHz Pentium II. There are two levels of cache: a 16 1024 byte rst level instruction cache and a 16 1024 byte rst level data cache and an uni ed 512 1024 byte second level cache. Nodes are interconnected via a Myrinet network [15] and also by 100 Mbits/sec Ethernet network [20]. The system was running Linux 2.2.19 and the GM message-passing system version 1.4.1pre14 for Myrinet network developed by Myricom. For all tests were used the version 3.1-2 of the Portland Group Fortran 90 compiler with the -O3 compiler option, the MPICH-GM library version 5 for the Myrinet communication network and the MPICH-ETH library version 1.2.1 for the Ethernet communication network. Tests are executed on machines dedicated to running only these tests. On the Myrinet Linux cluster and on the Ethernet Linux cluster tests are executed using the PBS scheduler allocating two MPI processes per node, i.e. one MPI process per processor. Default MPI environmental settings are used for all machines except for the Origin 2000. For the Origin 2000, the default number of data types is too small to run the tests, so the authors increased the value of MPI TYPE MAX. Section 2 introduces the timing methodology employed and Section 3 presents each of the tests and performance results. The conclusions are discussed in Section 4.

2 Timing Methodology Timings are done by rst ushing the cache [13, 10] on all processors by changing the values in the real array ush(1:ncache), prior to timing the desired operation. The value ncache is chosen so the size of ush is the size of the secondary cache, 8 1024 1024 bytes for the Origin 2000, 512 1024 for both Linux clusters and 96 1024 bytes for the T3E-900. The time spent to create 2

the MPI derived types is not included in the timing. All MPI derived types are timed using the code listed below. This code collects ping-pong times between processor 0 and processor j, for j = 1, 2, 3. integer, parameter :: n=1 !or 2, or 500, 1000, depending on the test real*8 :: A(n,n), B(n,n)

Creation of the MPI derived type row (for more details see Section 3.1) call mpi_type_vector(n,m,n,mpi_real8,row,ierror) call mpi_type_commit(row,ierror) do j = 1, 3 do k = 1, ntrial flush(1:ncache) = flush(1:ncache) + 0.1 call mpi_barrier(mpi_comm_world,ierror) if (rank == 0) then t = mpi_wtime() call mpi_send(A,1,row,j,1,mpi_comm_world,ierror) call mpi_recv(B,1,row,j,1,mpi_comm_world,status,ierror) array_time(k,j) = 0.5*(mpi_wtime() - t) else if (rank == j) then call mpi_recv(A,1,row,0,1,mpi_comm_world,status,ierror) call mpi_send(B,1,row,0,1,mpi_comm_world,ierror) endif call mpi_barrier(mpi_comm_world,ierror) A(1) = flush(1) end do end do print *,flush(1) + A(1)

Notice that by ushing the cache between each trial, the data that is loaded in the cache during the previous trial cannot be used to optimize the communication for the next trial. On the T3E-900, all remote memory references use the E-registers which bypass the cache. Thus, cache ushing has no eect on the performance of these tests on this machine, but it makes a dierence on all the other machines. However, the same code is run on each parallel computer. The rst call to mpi barrier guarantees that all processors reach this point before they each call the wall-clock timer, mpi wtime. The second call to a synchronization barrier is to ensure that no processor starts the next iteration ( ushing the cache) until all processors have completed executing the MPI code to be timed. To prevent the compiler's optimizer from splitting the cache ushing from the k-loop in the MPI code, the line A(1) = ush(1) is added, where A is an array involved in the communication. The print statement is added to ensure that the compiler would not consider the timing loop to be \dead code" and remove it. Each test is executed ntrial times and the values of the dierences in times on each participating processor are stored in the columns of the array time matrix. Then time is divided by two, because a \round trip" is timed between processor 0 and processor j. Notice that all timings 3

are performed on the rank 0 processor and that two dierent buers, A and B, are used instead only one buer. On the Origin 2000, on the Myrinet and Ethernet Linux clusters, when processor j receives A, all or a part of A will be in its secondary cache. If processor j then sends A back to processor 0, then this sending may be faster because at least a part of A will be in its secondary cache. Therefore processor j sends B to processor 0. Similarity, notice that after processor 0 sends A to processor j, at least part of A will be in the secondary cache on processor 0 for the Origin 2000, for the Myrinet and Ethernet Linux cluster. Thus, processor 0 receives B into B instead of A. Similarity, the code used to time the equivalent user packing code is as follows: integer, parameter :: n=1 !or 2, or 500, 1000, depending on the test real*8 :: temp(m,n), temp1(m,n), A(n,n), B(n,n) call random_number(A) call random_number(B) do j = 1, 3 do k = 1, ntrial flush(1:ncache) = flush(1:ncache) + 0.1 call mpi_barrier(mpi_comm_world,ierror) if (rank == 0) then time1 = mpi_wtime() temp(1:m,1:n) = A(1:m,1:n) call mpi_send(temp,n*m,mpi_real8,j,1,mpi_comm_world,ierror) call mpi_recv(temp1,n*m,mpi_real8,j,1,mpi_comm_world,status,ierror) A(1:m,1:n) = temp1(1:m,1:n) array_time(k,j) = 0.5*(mpi_wtime() - time1) else if (rank == j) then call mpi_recv(temp,n*m,mpi_real8,0,1,mpi_comm_world,status,ierror) A(1:m,1:n) = temp(1:m,1:n) temp1(1:m,1:n) = A(1:m,1:n) call mpi_send(temp1,n*m,mpi_real8,0,1,mpi_comm_world,ierror) endif call mpi_barrier(mpi_comm_world,ierror) temp(1) = flush(1) end do end do print *,flush(1) + temp(1)

Figure 1, page 5, shows the performance data on the Origin 2000 for 51 executions of test 1 with m = 500. The rst measured time was usually larger than the subsequent times. This is likely due to start-up overhead. Timings for a number of trials were taken to view variation among trials and consistency of the timing data. Taking ntrial = 51 (the rst timing was always thrown away), provided enough trials to do this for this study. Some time trials would be signi cantly larger than others and would signi cantly aect the average over all trials (see [11, 12] and Figure 2). In Figure 2, the dotted horizontal line at 0.061 msec shows the average over 51 trials. It was the authors' opinion that \spikes" should be removed so that the data will re ect the times one will 4

166

Origin 2000 - UP

Time (ms)

164 162 160 158 156 154

1

6

11

16

21

26

Trial Number

31

36

41

46

51

Figure 1: Timing data on the SGI Origin 2000 for 51 executions of test 1 with m = 500. usually obtain. All data in this report have been ltered by the following process [13, 10]. The rst timing is removed. The median of the remaining data is computed and all times greater than 1.8 times this median are removed. There were a few cases where the above procedure would remove more than 10% of the data. The authors felt that removing more than 10% of the data would not be appropriate, so in these cases only the 10% of the largest times were removed. Thus, taking 51 ntrials leads to the removal of at most 5 of the largest time trials. An average was calculated from the ltered data and this is what is presented for each test in this paper. For the data shown in Figure 2, page 6, this ltering process gives a time of 0.047 msec whereas the average over all data is 0.061 msec (50% increase!). For all tests the ltering process led to the removal of 2% of performance data for the Cray T3E-900, 6% for the Origin 2000, 8% for the Myrinet Linux cluster and 4% for the Ethernet Linux cluster.

3 Communication Tests and Performance Results 3.1 Test 1: Sending m Rows of an Array A(n,n)

The purpose of this test is to compare the performance of MPI derived types with user packing when processor 0 sends m rows of a real*8 A(n,n) array to processor j. Since this test is written in Fortran, A is stored in memory by columns. Hence, this test involves the sending of non-contiguous data. The parameter n was arbitrarily set to 1250, and m was chosen to be 1, to measure \latency", 1000 to measure \bandwidth", and 500 to measure the performance of middle-sized messages. Examination of performance data shows that it does not matter which m rows of A are sent, so the results presented in this paper are those for sending the rst m rows of A. The MPI derived type row was created and committed as follows, see [18]: 5

0.09

Myrinet Linux - DT AVG without spikes AVG with spikes

Time (ms)

0.08 0.07 0.06 0.05 0.04 0.03

1

6

11

16

21

26

Trial Number

31

36

41

46

51

Figure 2: Timing data data on the Myrinet Linux cluster for 51 executions of test 2 with n = 2. call mpi_type_vector(n,m,n,mpi_real8,row,ierror) call mpi_type_commit(row,ierror)

Thus, the code used to time sending m rows of A using row is (see Section 2): do j = 1, 3 do k = 1, ntrial flush(1:ncache) = flush(1:ncache) + 0.1 call mpi_barrier(mpi_comm_world,ierror) if (rank == 0) then call mpi_send(A,1,row,j,1,mpi_comm_world,ierror) call mpi_recv(B,1,row,j,1,mpi_comm_world,status,ierror) else if (rank == j) then call mpi_recv(A,1,row,0,1,mpi_comm_world,status,ierror) call mpi_send(B,1,row,0,1,mpi_comm_world,ierror) end if call mpi_barrier(mpi_comm_world,ierror) A(1) = flush(1) end do end do

The code used to time the packing, sending, receiving and unpacking is: (see Section 2) do j = 1, 3 do k = 1, ntrial flush(1:ncache) = flush(1:ncache) + 0.1 call mpi_barrier(mpi_comm_world,ierror)

6

if (rank == 0) then temp(1:m,1:n) = A(1:m,1:n) call mpi_send(temp,n*m,mpi_real8,j,1,mpi_comm_world,ierror) call mpi_recv(temp1,n*m,mpi_real8,j,1,mpi_comm_world,status,ierror) B(1:m,1:n) = temp1(1:m,1:n) else if (rank == j) then call mpi_recv(temp,n*m,mpi_real8,0,1,mpi_comm_world,status,ierror) A(1:m,1:n) = temp(1:m,1:n) temp(1:m,1:n)1 = B(1:m,1:n) call mpi_send(temp1,n*m,mpi_real8,0,1,mpi_comm_world,ierror) end if call mpi_barrier(mpi_comm_world,ierror) temp(1) = flush(1) end do end do

Notice that to send/receive non-contiguous data, the user has to store the information into contiguous temporary arrays declared as: real*8 temp(m,n) and real*8 temp1(m,n). Observe how much simpler is to use MPI derived types than manually packing/unpacking the data. Moreover this simplicity makes programming less prone to errors and easier to read. Figure 3, page 8, presents the performance data for this test. With m = 1, for the T3E-900, using MPI derived types is roughly 5 times faster than employing user packing; for the Origin 2000, for the Myrinet Linux cluster and for the Ethernet Linux cluster, using MPI derived types presents slightly better performance than employing user packing. With m = 500, for the T3E-900 and for the Origin 2000, using MPI derived types is 3 times faster than employing user packing; for the Myrinet Linux cluster, using MPI derived types is slightly slower than employing user packing; and for the Ethernet Linux cluster, using MPI derived types presents slightly better performance than employing user packing. With m = 1000, for the T3E-900 and for the Origin 2000, using MPI derived types is about 3 times faster than employing user packing; for the Myrinet Linux cluster employing user packing is slightly faster than using MPI derived types; and for the Ethernet Linux cluster, using MPI derived types presents slightly better performance than employing user packing. Thus, for all machines, using MPI derived types gives performance at least as well as when the user packs the data, and, in some cases, using MPI derived types gives much better performance. The T3E-900 shows the best performance on this test. The Origin 2000 is next best and it is followed by the Myrinet and Ethernet Linux clusters. For all message sizes, the Myrinet Linux cluster is more than 2 times faster than the Ethernet Linux cluster. Table 1 summarizes the time ratios for test 1, with m = 1, 500 and 1000.

7

3

Origin - UP Origin - DT Ethernet Linux - UP Ethernet Linux - DT T3E - UP Myrinet Linux - UP Myrinet Linux - DT T3E - DT

2.5 Time (ms)

2 1.5 1 0.5 0 0 700

1

2

3 Ethernet Linux - UP Ethernet Linux - DT Myrinet Linux - DT Myrinet Linux - UP Origin - UP T3E - UP Origin - DT T3E - DT

600 Time (ms)

500 400 300 200 100 0 0 1400

1

2

3 Ethernet Linux - UP Ethernet Linux - DT Myrinet Linux - DT Origin - UP Myrinet Linux - UP T3E - UP Origin - DT T3E - DT

1200 Time (ms)

1000 800 600 400 200 0

0

1

2

Processor rank

3

Figure 3: Test 1 with n = 1250 and m = 1, 500 and 1000. 8

m=1

MPI Derived Type

User Packing

m = 500

MPI Derived Type

User Packing

m = 1000

MPI Derived Type

User Packing

Origin/T3E Myrinet Linux/T3E Ethernet Linux/T3E Origin/T3E Myrinet Linux/T3E Ethernet Linux/T3E Origin/T3E Myrinet Linux/T3E Ethernet Linux/T3E

11 4.4 9

2.4 11 27 2.6 12 30

2.5 1 2

2.3 3.4 10 2.3 3.5 10

Table 1: Time ratios for test 1 with m = 1, 500, 1000 and n = 1250. Notice in Figure 3, page 8, the \roof" pattern in the data for the Myrinet Linux cluster and that the \roof" shape becomes more pronounced as the message size increases. To nd out if this behavior is independent from the use of MPI derived types and user packing, a simple ping-pong test was run using real*8 arrays A(n) and B(n) with n = 1 and 3000. No MPI derived types nor packing/unpacking techniques were used. The performance results show this same \roof" shape pattern which becomes more pronounced as the message size increases. To explain this behavior, the ping-pong test was run using one MPI process per node and then using two MPI processes per node. Comparing timing results shows that the \roof" pattern occurs when the MPI communication is within the same node (MPI processes of rank 0 and 2 are assigned to node 0 and the MPI process of rank 1 and 3 are assigned to node 1). It appears that the MPI communication within the node is slower than the communication between nodes! To further investigate this behavior, the above ping-pong testing was done on the Myrinet Linux cluster \Platinum" located at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign [16]. Platinum uses a faster processor (a dual 1 GHz Pentium III) and a faster Myrinet connection (Myrinet 2000) [15] than the Myrinet Linux cluster in New Mexico. Figure 4, page 10, shows the ping-pong performance results with four processors on Platinum and on the Myrinet Linux cluster in New Mexico. The MPI process assignment on Platinum is dierent from that of the Myrinet Linux cluster in New Mexico. On Platinum, the MPI processes of rank 0 and 1 are assigned to node 0 and the MPI processes of rank 2 and 3 are assigned to node 1. On the Myrinet Linux cluster in New Mexico, the MPI processes of rank 0 and 2 are assigned to node 0 and the MPI processes of rank 1 and 3 are assigned to node 1. To compare the ping-pong performance data, Figure 4 assumes that Platinum employs the same MPI process assignment as the Myrinet Linux cluster in New Mexico. Note, that on Platinum, the MPI communication within a node is faster than the MPI communication between nodes. The NCSA web site [16] indicates that the MPI library being used is \node aware" and this is supposed to provide improved MPI performance within a node. The MPI used on the Myrinet Linux cluster in New Mexico apparently is not optimized for the communication within the node. 9

0.7 0.6 Time (ms)

0.5 0.4 Myrinet Linux - n=3000 Platinum - n=3000 Myrinet Linux - n=1 Platinum - n=1

0.3 0.2 0.1 0

0

1

2

Processor rank

3

4

Figure 4: Ping-pong test results on the Myrinet Linux cluster and on the Myrinet Linux cluster Platinum with 4 processors for A(n) with n = 1, 3000.

3.2 Test 2: Sending the Lower Triangular Portion of an Array A(n,n)

The purpose of this test is to compare the performance of MPI derived types with user packing when processor 0 sends the lower triangular portion of a real*8 A(n,n) matrix, where n = 2, 500, 1000. Clearly this test involves the sending of non-contiguous data. The MPI derived type lower was created and committed as follows, see [18]: call mpi_type_indexed(n,LEN,D,mpi_real8,lower,ierror) call mpi_type_commit(lower,ierror)

The rst argument in mpi type indexed is the number of blocks. LEN and D are integer arrays of length n, where the elements of LEN contain the length of each block and the elements of D contain the displacements of the data blocks from A(1,1). Thus, LEN and D are initialized as: integer LEN(n), D(n) do k = 1, n D(k) = (n+1)*(k-1) LEN(k) = n - k + 1 end do

Notice that to send the lower triangular part of A from processor 0 to processor j, using the above MPI derived type, one merely writes: (timings were done as described in Section 2) if (rank == 0) then call mpi_send(A,1,lower,j,1,mpi_comm_world,ierror) call mpi_recv(B,1,lower,j,1,mpi_comm_world,status,ierror) else if (rank == j) then

10

call mpi_recv(A,1,lower,0,1,mpi_comm_world,status,ierror) call mpi_send(B,1,lower,0,1,mpi_comm_world,ierror) end if

If MPI derived types are not used, then the user must pack and unpack the data into temporary arrays that are contiguous in memory: real*8 temp, temp1. Thus, the equivalent user-packing code is as follows: (where dim = (n2 + n)=2, and timings were done as described in Section 2) if (rank == 0) then k = 1 do r = 1, n do i = r, n temp(k)= A(i,r) k = k+1 end do end do call mpi_send(temp,dim,mpi_real8,j,1,mpi_comm_world,ierror) call mpi_recv(temp1,dim,mpi_real8,j,1,mpi_comm_world,status,ierror) k = 1 do r = 1, n do i = r, n B(i,r) = temp1(k) k = k+1 end do end do else if (rank == j ) then call mpi_recv(temp,dim,mpi_real8,0,1,mpi_comm_world,status,ierror) k = 1 do r = 1, n do i = r, n A(i,r) = temp(k) k = k+1 end do end do k = 1 do r = 1, n do i = r, n temp1(k)= B(i,r) k = k+1 end do end do call mpi_send(temp1,dim,mpi_real8,0,1,mpi_comm_world,ierror) end if

Clearly the MPI derived type code is much simpler and less prone to errors than the user packing one. Moreover this simplicity makes programming less prone to errors and easier to read. 11

Figure 5, page 13, presents the performance data for this test. As for test 1, for all machines and for all message sizes, MPI derived types perform at least as well as user packing and in some cases MPI derived types signi cantly outperform user packing. With message size of n = 2, for the T3E-900, using MPI derived types shows similar performance than employing user packing; for the Origin 2000, employing user packing gives better performance than using MPI derived types; and for both clusters the performance of MPI derived types and user packing are nearly the same. With message size of n = 500, for the T3E-900, using MPI derived types is slightly better than employing user packing; for the Origin 2000, using MPI derived types is 2 times faster than employing user packing; for the Myrinet Linux cluster, using MPI derived types is about 1.2 times faster than employing user packing; and for the Ethernet Linux cluster using MPI derived types gives better performance than employing user packing. With message size of n = 1000, for the T3E-900, using MPI derived types is about 1.5 times faster then employing user packing; for the Origin 2000, using MPI derived types is about 3 times faster than employing user packing; for the Myrinet Linux cluster, using MPI derived types is about 1.2 times faster than employing user packing; and for the Ethernet Linux cluster using MPI derived types gives better performance than employing user packing. Except for messages of size n = 2, the T3E-900 shows the best performance. The Origin 2000 is next best and it is followed by the Myrinet Linux cluster. The Ethernet Linux cluster ends the sequence. The Myrinet Linux cluster is 3 times up to 7 times faster than the Ethernet Linux cluster. Table 2 summarizes the time ratios for this test, with n = 2, 500 and 1000. For this test the Myrinet Linux cluster was the fastest machine for messages of size n = 2, while the T3E-900 was the fastest machine for all other message sizes:

n=2

MPI Derived Type User Packing

n=500


n=1000


T3E/Myrinet Linux Origin/Myrinet Linux Ethernet Linux/Myrinet Linux Origin/T3E Myrinet Linux/T3E Ethernet Linux/T3E Origin/T3E Myrinet Linux/T3E Ethernet Linux/T3E

1.2 3.2 7.5

1 2.7 7.2

1.1 3.4 11

2 3.3 9

1.2 4 13

2.2 3.3 9.4

Table 2: Time ratios for test 2 with n = 2, 500 and 1000.

12

0.4

Ethernet Linux - DT Ethernet Linux - UP Origin - DT Origin - UP T3E - DT Myrinet Linux - DT Myrinet Linux - UP T3E - UP

0.35

Time (ms)

0.3 0.25 0.2 0.15 0.1 0.05 0 0 140

1

2

3 Ethernet Linux - UP Ethernet Linux - DT Myrinet Linux - UP Myrinet Linux - DT Origin - UP T3E - UP Origin - DT T3E - DT

120 Time (ms)

100 80 60 40 20

Time (ms)

0 0 1000

1

2

3 Ethernet Linux - UP Ethernet Linux - DT Myrinet Linux - UP Origin - UP Myrinet Linux - DT T3E - UP Origin - DT T3E - DT

100

10

0

1

2

Processor rank

3

Figure 5: Test 2 with n = 2, 500 and 1000. 13

Notice in Figure 5, page 13, the unusual pattern in the top graph for the Ethernet Linux cluster data. Observe that the time on processor of rank 1 is greater than the times on processors of rank 2 and 3. This happens for all the tests when small messages are sent. To nd out if this unusual behavior is independent from the use of MPI derived types and user packing, a ping pong test was run on this machine using real*8 arrays A(n) and B(n) with n = 1. The ping-pong test does not employ either MPI derived types or packing/unpacking techniques. Figure 6, shows the performance results of this ping-pong test. Notice that the ping-pong test shows the same behavior as in Figure 5. To investigate why this is occurring, 0.4

Ethernet Linux - n=1

Time (ms)

0.35 0.3 0.25 0.2

0

1

2

Processor rank

3

4

Figure 6: Ping-pong test results on the Ethernet Linux cluster with four processors and for A(n) with n = 1. examination of the hostnames shows that MPI process of rank 0 is assigned to node 0 and MPI processes of rank 1, 2, and 3 are assigned to node 1! The ping-pong test was then run using three MPI processes, where the MPI process of rank 0 was assigned to node 0 and the MPI processes of rank 1 and 2 are assigned to node 1. No performance degradation was observed. Thus, when having three MPI processes on node 1, the ping-pong performance degrades when using the third process. It appears that the assignment of 3 MPI processes to a two-processor node is causing the performance degradation observed.

3.3 Test 3: Sending a Sparse Array A(n,n)

The purpose of this test is to compare the performance of MPI derived types with user packing when sending non-contiguous data randomly placed in memory. This can occur in communications involving sparse arrays. In this test processor 0 sends the non-zero elements of a real*8 A(n,n), where n = 500 and 1000. The sparse array A has m = n/10 non-zero elements per column, for a total of m*n non-zero elements. The following code shows how the sparse array A has been set up: 14

integer, parameter :: n = 500 or 1000, m = n/10 real*8 A(n,n), B(m) integer index(m), D(m*n) call random_number(B) index(1:m) = B(1:m)*(n-1) + 1 D(1:m) = index(1:m) call random_number(A(index(j),1)) do i = 2, n call random_number(B) index(1:m) = B(1:m)*(n-1) + 1 do j= 1, m call random_number(A(index(j),i)) en ddo D(k:k+m-1) = (i-1)*n + index(1:m) k = k+m end do

where D(m*n) is a randomly generated array of integers representing the displacements of the non-zero elements of A from A(1,1). The MPI derived type spa was created and committed as follows, see [18]: call mpi_type_indexed(m*n,LEN,D,mpi_real8,spa,ierror) call mpi_type_commit(spa,ierror)

The rst argument in mpi type indexed is the number of the non-zero elements of A. LEN is an integer array of length m*n and its elements contain the length of each block of data. For this test a block of data is represented by a single element of the sparse array A, thus each element of LEN is set to 1. Observe, that the array of displacements D, above initialized along with the sparse array A, is needed to construct the MPI derived type spa. The code for sending and receiving the sparse array using MPI derived types, is simply: (timings were done as described in Section 2) if (rank == 0) then call mpi_send(A,1,spa,j,1,mpi_comm_world,ierror) call mpi_recv(B,1,spa,j,1,mpi_comm_world,status,ierror) else if (rank == j) then call mpi_recv(A,1,spa,0,1,mpi_comm_world,status,ierror) call mpi_send(B,1,spa,0,1,mpi_comm_world,ierror) end if

The code used for packing, sending, receiving, and unpacking the data is: (timings were done as described in Section 2) if (rank == 0) then k = 1 z = 1

15

do i = 1, n do j = z, z+m-1 temp(k) = A(D(j),i) k = k+1 end do z = z + m end do call mpi_send(temp,m,mpi_real8,j,1,mpi_comm_world,ierror) call mpi_recv(temp1,m,mpi_real8,j,1,mpi_comm_world,status,ierror) k = 1 z = 1 do i = 1, n do j = z, z+m-1 B(D(j),i) = temp1(k) k = k+1 end do z = z + m end do else if(rank == j) then call mpi_recv(temp,m,mpi_real8,0,1,mpi_comm_world,status,ierror) k = 1 z = 1 do i = 1, n do j = z, z+m-1 A(D(j),i) = temp(k) k = k+1 end do z = z + m end do k = 1 z = 1 do i = 1,n do j = z, z+m-1 temp1(k) = B(D(j),i) k = k+1 end do z = z + m end do call mpi_send(temp1,m,mpi_real8,0,1,mpi_comm_world,ierror) end if

Notice that if MPI derived types are not used, then the user must pack and unpack the data into temporary arrays that are contiguous in memory. For this test, temp and temp1 are declared as real*8 arrays of dimension m*n. Recall that the elements of the array D(n*m) contain the displacements of the non-zero elements of A from A(1,1). 16

Observe how much simpler and less prone to errors is the MPI code compared to the user packing one. Moreover this simplicity makes programming less prone to errors and easier to read. Figure 7, page 18, shows the performance data for this test. On all machines, and almost for all message sizes, user packing outperforms the MPI derived types implementation. With message size n = 500, for the Origin 2000, employing user packing is about 1.2 times faster than using MPI derived types; for both Linux clusters, employing user packing is about 2 times faster than using MPI derived types; and for the T3E-900, employing user packing is about 4 times faster than using MPI derived types. With message size n = 1000, for the Origin 2000, using MPI derived types is about 1.2 times faster than employing user packing; for the Myrinet Linux cluster, employing user packing is 2 times faster than using MPI derived types; for the Ethernet Linux cluster, employing user packing is 1.4 times faster than using MPI derived types; and for the T3E-900, employing user packing is 4 times faster than using MPI derived types. For MPI derived types and for all message sizes, the Origin 2000 shows the best performance, followed by the Myrinet Linux cluster, the Ethernet Linux cluster and at last by the T3E-900. For this test and for all message sizes and implementations, the Myrinet Linux cluster is about 2 times up to 6 times faster than the Ethernet Linux cluster. Unlikely test 2, mpi type indexed does not perform well on all these machines relative to user packing for this test. Apparently the implementation of mpi type indexed is not ecient for randomly placed data. The following table summarizes the time ratios for test 3, with n = 500 and 1000. For MPI derived types, the Origin 2000 was the fastest machine for both n = 500 and n = 1000. For user packing, the Myrinet Linux cluster was the fastest machine for n = 500 and n = 1000:

n=500

MPI Derived Type

n=500

User Packing

n=1000

MPI Derived Type

n=1000

User Packing

T3E/Origin Myrinet Linux/Origin Ethernet Linux/Origin T3E/Myrinet Linux Origin/Myrinet Linux Ethernet Linux/Myrinet Linux T3E/Origin Myrinet Linux/Origin Ethernet Linux/Origin T3E/Myrinet Linux Origin/Myrinet Linux Ethernet Linux/Myrinet Linux

3.3 1.3 1.5 1.5 1.5 2.5 2 1.1 1.5 1.5 2 2.5

Table 3: Time ratios for test 3 with n = 500, 1000. 17

80

T3E - DT Ethernet Linux - DT Ethernet Linux - UP Myrinet Linux - DT Origin - DT T3E - UP Origin - UP Myrinet Linux - UP

70 Time (ms)

60 50 40 30 20 10

0

1

2

3

350

T3E - DT Ethernet Linux - DT Ethernet Linux - UP Myrinet Linux - DT Origin - UP Origin - DT T3E - UP Myrinet Linux - UP

300 Time (ms)

250 200 150 100 50 0

0

1

2

Processor rank

3

Figure 7: Test 3 with n = 500 and 1000.

18

3.4 Test 4: Sending an Array of Mixed Data Types

The purpose of this test is to compare the performance of MPI derived types with user packing when sending messages containing elements with dierent data types. This can be done using MPI derived types or by using mpi unpack and mpi pack and mpi pack size. In this test processor 0 sends an array of mixed data, declared as follows: integer, parameter :: n = 1 ! or 500, 1000 type data character(17) :: name integer :: serial_number real*8 :: R(10) end type data type(data) :: A(n), B(n)

The MPI derived type mixed was created and committed as follows, see [18]: call mpi_type_struct(3,LEN,D,types,mixed,ierror) call mpi_type_commit(mixed,ierror)

The rst argument of mpi type struct represents the number of dierent MPI data types used to construct mixed. The elements of the integer array LEN contain the length of each block and the elements of the integer array D contain the displacements between blocks. The following shows how much simpler it is to send the mixed data using the MPI derived type: (timings were done as described in Section 2) if (rank == 0) then call mpi_send(A,1,mixed,j,1,mpi_comm_world,ierror) call mpi_recv(B,1,mixed,j,1,mpi_comm_world,status,ierror) else if (rank == j) then call mpi_recv(A,1,mixed,0,1,mpi_comm_world,status,ierror) call mpi_send(B,1,mixed,0,1,mpi_comm_world,ierror) end if

The following code shows how to treat the mixed information using the subroutine mpi pack size: character, allocatable :: temp(:), temp1(:) integer D(3) integer :: B(3) = (/ 17,1,10 /) integer sizeofchar, sizeofreal, sizeofint, mix_size integer size1, size2, size3 real*8, allocatable :: temp(:), temp1(:) call call call D(1) D(2)

mpi_type_extent(mpi_character,sizeofchar,ierror) mpi_type_extent(mpi_real8,sizeofreal,ierror) mpi_type_extent(mpi_integer,sizeofint,ierror) = 0 = 17*sizeofchar

19

D(3) = 17*sizeofchar + sizeofint call mpi_pack_size(D(1),mpi_character,mpi_comm_world,size1,ierror) call mpi_pack_size(D(2),mpi_integer,mpi_comm_world,size2,ierror) call mpi_pack_size(D(3),mpi_real8,mpi_comm_world,size3,ierror) mix_size = size1 + size2 + size3 allocate(temp(n*mix_size), temp1(n*mix_size))

The mpi pack size subroutine computes the extent of the primitive types, see [18]. The user packing code copies the data into the contiguous temporary arrays temp, temp1 using mpi pack and mpi unpack subroutines. The following is the code for the packing, sending, receiving, and unpacking, where position is an integer variable which is automatically incremented each time mpi pack/mpi unpack are called. Due to limited space, only a part of the round trip between processor 0 and processor j = 1, 2, 3 is presented. if (rank == 0) then position = 0 do i = 1, n call mpi_pack(A(i)%name,LEN(1),mpi_character,temp, & n*mix_size,position,mpi_comm_world,ierror) call mpi_pack(A(i)%serial_number,LEN(2),mpi_integer,temp, & n*mix_size,position,mpi_comm_world,ierror) call mpi_pack(A(i)%R,LEN(3),mpi_real8,temp, & n*mix_size,position,mpi_comm_world,ierror) end do call mpi_send(temp,position,mpi_packed,j,1,mpi_comm_world,ierror) else if (rank == j) then position = 0 call mpi_recv(temp,n*mix_size,mpi_packed,0,1, & mpi_comm_world,status,ierror) do i = 1, n call mpi_unpack(temp,n*mix_size,position,B(i)%name,LEN(1), & mpi_character,mpi_comm_world,ierror) call mpi_unpack(temp,n*mix_size,position,B(i)%serial_number, & LEN(2),mpi_integer,mpi_comm_world,error) call mpi_unpack(temp,n*mix_size,position,B(i)%R,LEN(3), & mpi_real8,mpi_comm_world,ierror) end do end if

Notice how much simpler and less prone to errors the MPI derived type code is than the userpacking. In this test a very simple structured type is sent/received and the amount of work of packing/unpacking is signi cant. Clearly, for this test, using MPI derived types is much simpler than employing mpi pack/mpi unpack routines. Figure 8, page 21, presents the performance data for this test. Note that MPI derived types outperform user packing for all message sizes. 20

0.45

Ethernet Linux - DT Ethernet Linux - UP Origin - UP Origin - DT T3E - UP T3E - DT Myrinet Linux - UP Myrinet Linux - DT

0.4

Time (ms)

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 9

1

2

3 Ethernet Linux - UP Ethernet Linux - DT Myrinet Linux - UP T3E - UP Origin - UP Myrinet Linux - DT Origin - DT T3E - DT

8

Time (ms)

7 6 5 4 3 2 1 0 20

1

2

3 Ethernet Linux - UP Ethernet Linux - DT Myrinet Linux - UP T3E - UP Origin - UP Myrinet Linux - DT Origin - DT T3E - DT

Time (ms)

15 10 5 0

0

1

2

Processor rank

3

Figure 8: Test 4 with n = 1, 500 and 1000. 21

With message size n = 1, for the Myrinet Linux Cluster, employing user packing is 1.2 times faster than using MPI derived types; for the T3E-900, using MPI derived types shows better performance than employing user packing; for the Origin 2000, using MPI derived types is slightly faster than employing user packing; and for the Ethernet Linux cluster, using MPI derived types is 1.2 times faster than employing user packing. For message size n = 500, for the T3E-900, using MPI derived types is about 2.3 times faster than employing user packing; for the Origin 2000 using MPI derived types is about 1.7 times faster than employing user packing; for the Myrinet Linux cluster, using MPI derived types is 1.4 times faster than employing user packing; and for the Ethernet Linux cluster, using MPI derived types is 1.2 times faster than employing user packing. For message size n = 1000, for the Origin 2000, using MPI derived types is about 2 times faster than employing user packing; for the T3E-900, using MPI derived types is about 2.3 times faster than employing user packing; for the Myrinet Linux cluster, using MPI derived types is about 1.4 times faster than employing user packing; and for the Ethernet Linux cluster, using MPI derived types is about 1.2 times faster than employing user packing. With message size n = 1, the Myrinet Linux cluster shows the best performance for both MPI derived type and user packing implementations. The T3E-900 is next best and it is followed by the Origin 2000. The Ethernet Linux cluster ends the sequence. With message size n = 500, for MPI derived types, the T3E-900 shows the best performance. The Origin 2000 is next best and it is followed by the Myrinet Linux cluster and by the Ethernet Linux cluster. For user packing, the Origin 2000 shows the best performance, followed by the T3E-900, the Myrinet Linux cluster and at last by the Ethernet Linux cluster. With message size n = 1000, the Origin 2000 shows the best performance for both MPI derived types and user packing (mpi pack and mpi unpack) implementations. The T3E-900 is next best and it is followed by the Myrinet Linux cluster and by the Ethernet Linux cluster. The following tables summarize the time ratios for test 4, with n = 1, 500 and 1000. With n = 1 the Myrinet Linux cluster was the fastest machine for both MPI derived types and user packing. With n = 500, the T3E-900 was the fastest machine for MPI derived types, while the Origin 2000 showed the best performance for user packing. With n = 1000, the Origin 2000 was the fastest machine for both MPI derived types and user packing:

n=1

T3E/Myrinet Linux Origin/Myrinet Linux Ethernet Linux/Myrinet Linux

MPI Derived Type User Packing 2.1 2.7 4.8

Table 4: Time ratios for test 4 with n = 1.

22

2.6 3.1 5.3

n=500

MPI Derived Type

n=500

User Packing

Origin/T3E Myrinet Linux/T3E Ethernet Linux/T3E

1.2 1.8 4.1

T3E/Origin Myrinet Linux/Origin Ethernet Linux/Origin

1.2 1.2 2.4


n=1000

T3E/Origin Myrinet Linux/Origin Ethernet Linux/Origin

MPI Derived Type User Packing 1 1.7 4

1.2 1.3 2.5


4 Conclusion The primary purpose of this paper is to compare the performance of MPI derived types with user packing on a SGI Origin 2000, a Cray T3E-900, a Myrinet Linux cluster and an Ethernet Linux cluster. It is also the purpose of this paper to demonstrate that using MPI derived types is an easyto-use, elegant method for sending non-contiguous data. Four communication tests were chosen to represent commonly-used operations involving sending non-contiguous data: sending row blocks of a matrix, sending the lower triangular portion of a matrix, sending a sparse matrix, and sending an array of mixed data types. These tests are written in Fortran 90 and use mpi type vector, mpi type indexed, and mpi type struct [18]. Except for test 3 (sending a sparse matrix), MPI derived types outperform user packing. Relative performance between machines varied with the test and the message size used. All tests showed that employing MPI derived types is an easy-to-use and elegant way to send noncontiguous data.

5 Acknowledgments We would like to thank SGI for allowing us to use their Origin 2000. We would like to thank Cray for giving us access to their T3E-900. We would like to thank the University of New Mexico for access to their Albuquerque High Performance Computing Center. This work utilized the (UNM) Alliance Roadrunner Supercluster (Myrinet and Ethernet Linux clusters). We also would like to thank the National Computational Science Alliance for giving us access to their Linux supercluster \Platinum" located at Urbana-Champaign, Illinois. 23

References [1] AHPCC Linux Supercluster - Web Server. http://www.arc.unm.edu/. [2] Alta Technology Corporation Web Server. http://www.altatech.com/. [3] J. Ammon, \Hypercube Connectivity within a ccNUMA Architecture," Silicon Graphics, May, 1998. [4] D. Bovet, M. Cesati, \Understanding the LINUX KERNEL," O'Reilly editor, October 2000. [5] Cray Research Inc., \CRAY T3E Programming with Coherent Memory Streams," December 18, 1996. [6] Cray Research Inc., \CRAY T3E Fortran Optimization Guide SG-2518 3.0," 1997. [7] Cray Research Web Server. http://www.cray.com/. [8] J. Fier, \ Performance Tuning Optimization for Origin 2000 and Onyx 2," Silicon Graphics, 1996. http://techpubs.sgi.com/. [9] J. Laudon, D. Lenosky, \The SGI Origin: A ccNUMA Highly Scalable Server," Silicon Graphics, 1997. [10] G. R. Luecke, J. J. Coyle, \Comparing the Performances of MPI on the Cray T3E-900, the Cray Origin 2000 and the IBM P2SC," The Journal of Performance Evaluation and Modelling for Computer Systems, June, 1998. http://hpc-journals.ecs.soton.ac.uk/PEMCS/. [11] G. R. Luecke, B. Ran, J. J. Coyle, \Comparing the Scalability of the Cray T3E-600 and the Cray Origin 2000 Using SHMEM Routines," The Journal of Performance Evaluation and Modelling for Computer Systems, December, 1998. http://hpc-journals.ecs.soton.ac.uk/PEMCS/. [12] G. R. Luecke, B. Ran, J. J. Coyle, \Comparing the Communication Performance and Scalability of a SGI Origin 2000, a cluster of Origin 2000's and a Cray T3E-1200 using SHMEM and MPI Routines", The Journal of Performance Evaluation and Modelling for Computer Systems, October, 1999. http://hpc-journals.ecs.soton.ac.uk/PEMCS/. [13] G. R. Luecke, B. Ran, J. J. Coyle, \Comparing the Communication Performance and Scalability of a Linux and a NT Cluster of PCs, a Cray Origin 2000, an IBM SP and a Cray Origin 2000, an IBM SP and a Cray T3E-600. Extended version," April, 2000. http://www.public.iastate.edu/ e grl/publications.html. [14] MPI Web Server. http://www-unix.mcs.anl.gov/mpi/. [15] Myricom Web Server. http://www.myricom.com/. [16] NCSA Linux Supercluster - Web Server. http://www.ncsa.uiuc.edu/UserInfo/Resources/. [17] R. Reussner, J. Larsson Tra, G. Hunzelmann, \A Benchmark for MPI Derived Datatypes," 2000. 24

[18] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, \MPI - the Complete Reference," volume 1, The MPI Core. MIT Press, second edition, 1998. [19] Special Karlsruher MPI (SKaMPI) Web Server. http://liinwww.ira.uka.de/ skampi/. [20] C. E. Spurgeon,\Ethernet: The De nite Guide," O'Really, February 2000.

25

The Performance of MPI Derived Types on a SGI Origin 2000, a Cray

The Performance of MPI Derived Types on a SGI Origin 2000, a Cray

Suggest Documents

comparing the performance of mpi on the cray research ... - CiteSeerX

Performance of MPI on the CRAY T3E-512

A Comparison of Application Performance Using Open MPI and Cray ...

Comparison of PVM and MPI on SGI ... - Semantic Scholar

Performance Evaluation of Apache Spark on Cray ... - Cray User Group

Fan-In Communications On A Cray Gemini ... - Cray User Group

The origin of a derived superkingdom: how a gram ... - CiteSeerX

The origin of a derived superkingdom: how a gram ... - BioMedSearch

A comparison of MPI performance on di erent MPPs - CiteSeerX

Performance of an MPI-only Semiconductor Device Simulator on a

Performance of an MPI-only Semiconductor Device Simulator on a

Fan-In Communications On A Cray Gemini ... - Cray User Group

Performance Analysis of a Hybrid MPI/CUDA Implementation of the ...

Performance Comparison of Cray X1 and Cray Opteron Cluster with ...

Implementation and Performance of Portals 3.3 on the Cray XT3

Performance Prediction of a Cray XT4 System during Upgrade

O on the Cray XT3 ... - Cray User Group

Effectiveness of Natura 2000 system for habitat types protection: A

Improving MPI Applications Performance on Multicore ... - LaBRI

Improving MPI Applications Performance on Multicore ... - LaBRI

Assessing MPI Performance on QsNetII

Performance Analysis of MPI Programs

A Robust Preconditioner on the CRAY-T3E for Large Nonsymmetric ...

On the Performance of Transparent MPI Piggyback ... - Semantic Scholar