Automatic Data-Flow Graph Generation of MPI Programs - CiteSeerX

6 downloads 1522 Views 312KB Size Report
The Data-Flow Graph (DFG) of a parallel application is frequently used to take .... a high-level, C++ API, which implies reprogramming exist- ing codes to use it.
Automatic Data-Flow Graph Generation of MPI Programs ∗ Rafael Ennes Silva, Guilherme Pezzi, Nicolas Maillard and Tiaraj´u Diverio Institute of Informatics - UFRGS Av. Bento Gonc¸alves 9500 PO Box: 15064 Zip Code: 90501-910 Porto Alegre - RS - Brazil {resilva, pezzi, nmaillard, diverio}@inf.ufrgs.br Abstract The Data-Flow Graph (DFG) of a parallel application is frequently used to take scheduling decisions, based on the information that it models (dependencies among the tasks and volume of exchanged data). In the case of MPI-based programs, the DFG may be built at run-time by overloading the data exchange primitives. This article presents a library that enables the generation of the DFG of a MPI program, and its use to analyze the network contention on a test-application: the Linpack benchmark. It is the first step towards automatic mapping of a MPI program on a distributed architecture.

1. Introduction Many parallel applications are implemented on distributed architectures using message-passing tools. Although these may lack of high-level, abstract description mechanisms of the parallelism in the program (as, for instance, those provided by Open-MP [3] and Cilk [1] in a shared-memory context, or Java-RMI [9]), they have imposed themselves as being the main solution to provide efficient implementations on large-scaled distributed machines. MPI [12] is the most representative tool in this context, and the de facto norm in HPC computing. Frequently, high-level parallel programming tools represent an application by a graph. In the shared memory context, a thread can be considered as a vertex of the graph and the spawning of new threads represents an edge dependency (dependency graph). If the application uses message passing, the graph can be more detailed to take into account communication between the processes: the dataflow graph (DFG) of the program is then obtained, with each edge (u, v) representing the communication volume between processes u and v. ∗

Research done in cooperation with HP Brasil.

The DFG representation is used in many cases to model a parallel application and produce information in order to schedule the computation, i.e. decide at which time-step and on which processor each process will be executed, so as to optimize the use of the resources (e.g. by load balancing) or one of the application characteristics (e.g. the makespan). Some parallel programming environments explicitly use the DFG of an application. For example, Cilk [1], Athapascan [7] or Satin [13] use it both as programming model and to obtain efficient schedules. On the other hand, MPI does not specify of the run-time maps the processes on the processor topology. The norm left this concern to the implementors, due to the difficulty of standardizing all the scheduling possibilities in a distributed system (queuebased batch systems, operating system schedulers, possible thread-based scheduling on each node, etc. . . ). Yet, in MPI, the DFG may be deduced from the analysis of the data exchange primitives: in a way, the programmer explicitly details the DFG when he uses send/receive instructions. The proposal of this article is to provide a library that automates the DFG generation, in order to enable its study and an optimize scheduling of the program. In this paper, the use of the library is restricted to the DFG generation part, and of the study of its characteristics for a given application: the Linpack benchmark. The paper is structured as follows: next section relates the use of the DFG to some classical parallel programming tools. Section 3 presents the proposed library, called β-MPI. Section 4 introduces the linpack benchmark and its optimized use on a test cluster, as well as the evaluation of βMPI and of its use with linpack. Finally, the last section concludes the work.

2. Use of the Data-Flow Graph in Parallel Programming This section presents some environments that make use of the DFG of an application, either to schedule it efficiently, either to specify the application itself. This section

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05) 1550-6533/05 $20.00 © 2005

IEEE

is divided in two parts: one dedicated to parallel environments based on the data-flow approach; and one that explains how message-passing environments are related to the DFG concept.

2.1. Data-Flow Approach Two environments are presented: Cilk, that mainly runs on shared-memory computers, and Athapascan, that aims at distributed environments. Both are based on the Data-Flow analysis. 2.1.1. Cilk [1] is a multithreaded programming language that extends the language C with primitives for parallel control. This language constraints the DFG of the program as representing fully strict computations, for instance of type “Divide & Conquer”. This kind of restricted computation model allows to compute an efficient schedule by workstealing. In order to restrict the parallelism to this kind of DFG, Cilk introduces a procedure called spawn to create new processes. The spawnee thread is considered to be the the parent; the spawned ones are the children. Another keyword, sync, enables the synchronization of the threads. The figure 1 , shows an example, as presented in [1]. Each rectangle is a task, and it is divided into a sequence of threads shown rectangles. The spawning of threads are illustrated by the downward arrows, while an upward arrow indicates the returning of a value to a parent procedure and the dependency produced by the synchronization.

Figure 1. Cilk model of multithreaded computations.

This language restricts the kind of DFG being considered. This work aims at studying DFGs of whatever type. 2.1.2. Athapascan [7] is a parallel programming environment explicitly based on the DFG and on a virtually shared memory. The Athapascan API is based on C++ and offers basically one keyword, task, in order to specify the sequential parts of its program; and a specification of the kind of accesses that the tasks make to the shared memory (read access, write access, read-write access). With this information, the run-time may interpret the accesses to the data to

detect the logical dependencies and, dynamically, build the DFG and take scheduling decisions [2]. Athapascan is a powerful and high-level tool, that obtained good performance results in both shared memory environments [6] and distributed environments.Thus, it validated the usefulness of the DFG to deduce efficient schedules. But the computation of the DFG is made at the cost of a high-level, C++ API, which implies reprogramming existing codes to use it. The approach of this work is based on standard MPI norm.

2.2. Message Passing Approach We present two environments related to the message passing model. The first one is Pyrros and it uses a graph to describe a program; the second is MPE, which was built for debugging and tracing MPI applications, but provides the needed data information to build the DFG. 2.2.1. Pyrros [15] is a software system for automatic static scheduling and code generation. Pyrros takes, as input, parallel program tasks with precedence constraints and produces some code for message passing architectures such as nCUBE-2 and INTEL-2. Pyrros has three components: a task graph language, with a C interface, allowing users to partition their tasks and data; a scheduling system for load balancing and physical mapping, as well as communication and computation ordering; and a message-passing code generator that inserts synchronization primitives and performs code optimization for specific architectures. Additionally, Pyrros provides a graphic displayer to show the task graphs and the scheduling results. The β-MPI proposal has exactly the opposite approach: instead of generating message-passing code from a graph description, it starts from the MPI code and generates the DFG, in order to schedule it. 2.2.2. MPE (Multiprocessing Environment) [8] was conceived for debugging and tracing of MPI applications. It provides to the programmers some performance analysis tools, that include a set of profiling libraries, a set of utility programs, and a set of graphical visualization tools. After a MPI program has been executed, a log file is created. The data information produced by a log file is read by a graphical visualization tool, that provides a view of all related processes. It could also be parsed to build the application DFG, since it is easy to trace the data exchanges between the processes (source and destination processes, with the data size). MPE is designed to trace a MPI program for debugging and it produces a log file after the end execution of the application. The log file need to be examined, to find the message passing primitives, the processes involved in the com-

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05) 1550-6533/05 $20.00 © 2005

IEEE

munication and the volume of data exchanged. With these informations is possible to build the DFG. In the β-MPI solution, the DFG is obtained when the end of execution is reached.

3. Construction of the Data-Flow Graph in MPI: the β-MPI Library In order to create the DFG during the execution of a MPI program, the β-MPI library overloads the classical message-passing primitives. All the user has to do is to (re)compile his program including the beta-mpi.h header (after the #include directive), and to link his application against the libbeta-mpi.a library. For each overloaded instruction, first the data-structure implementing the DFG is updated with the message’s characteristics (origin, destination and size, in bytes), and then the original primitive is called, and its return code is returned back. As an example of implementation we can see the send MPI primitive in Fig. 2. The add msg primitive update the graph data-structure. #define MPI_Send(buf, count, type, dest, tag, comm) \ bMPI_Send(buf, count, type, dest, tag, comm) int bMPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) { int datatype_size; MPI_Type_size(datatype, &datatype_size); _add_msg(_rank(), dest, count*datatype_size); return MPI_Send(buf, count, datatype, dest, tag, comm); }

Figure 2. Implementation example of β-MPI.

All message passing primitives overloaded in β-MPI are shown in Fig. 3. Collective message passing operations have not been treated yet, since they were not used in the test benchmark. Yet, their overload would follow the same scheme. Two technical problems have been dealt with during the implementation. One is related to the data messages size. The MPI norm allows the user to define its complex datatypes (MPI Datatype), which can aggregate data of whatever kind of basic types, possibly located at memory places that are not continuous. In order to determine their size, the use of the MPI Type size primitive is required. The second problem concerns the rank of the processes implicated in the communication. With the MPI communi-

int bMPI_Init(int * argc,char *** argv); int bMPI_Finalize(); int bMPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ); int bMPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ); int bMPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ); int bMPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ); int bMPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ); int bMPI_Issend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ); int bMPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount,MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status );

Figure 3. Overloaded MPI primitives in β-MPI.

cator mechanism, the rank is local to the process’s group, which may be personalized by the programmer who wants to isolate parts of the processes. Since the β-MPI library requires a unique identifier for each process (node of the DFG), the natural choice is to use the rank of the process referent to the MPI COMM WORLD communicator. Therefore, a translation of the rank of the process implicated in a communication must be done from its local communicator (if it is local) to MPI COMM WORLD. The classical solution is to determine the group associated to MPI COMM WORLD (with the MPI Comm group call), and to use the MPI Group translate ranks primitive to do translation. The β-MPI allows to trace the DFG associated to MPI programs, in order to investigate the communication topology and to decide of a better mapping of the processes to the topology. The next section gives an example based on the Linpack benchmark.

4. Experimental Evaluation of the β-MPI Library The target application used to test the β-MPI library is the benchmark Linpack, as implemented in the HPL version. We executed HPL with β-MPI library to measure the implied overhead, and used it to analyze the communication pattern implied by the LU factorization.

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05) 1550-6533/05 $20.00 © 2005

IEEE

4.1. Configuration of the HPL Benchmark

1000 "data500.gnu"

4.1.1. The HPL Algorithm HPL (Highly Parallel Linpack) [4] is a benchmark developed for distributed memory computers. It solves a linear system of equations by LU factorization with partial pivoting [10]. The matrix distribution is cyclic by blocks (of dimension NB × NB ), in the two dimensions, on a virtual grid of P xQ processes. At each iteration of the main loop, a panel of NB columns is factored, and the trailing submatrix is updated. The update is made in two steps:

800

600 MFlops/s

The simulations with the HPL benchmark have been done on the cluster labtec, at the Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil. The cluster is composed by one server with 2 Xeon 1.8GHz processors and 20 nodes made of Dual Pentium III 1.13GHz, both with 1GB of RAM memory. The operating system is Debian Linux, with kernel 2.6. By the time that the experiments have been done, only 16 nodes were available.

400

200

0 0

4.1.2. Blas Optimization HPL makes extensive use of level-2 matricial operations, at the block level. Optimized Blas (Basic Linear Algebra Subroutine)[5] were used to get the best possible performance on the labtec cluster. The Atlas [14] library has been used, which gives good results on Pentium-III architecture [11]. Fig 4 gives the results of the MFlops/s obtained for the dgemm (matricial product) of the sequential BLAS on one node of the labtec cluster. The peak performance is 800 MFlops/s. With a dual processor with 1.1 Ghz clock, this result is considered as weak ([11]) reports more than 80% of the peak performance obtained on such an architecture). Fig 5 shows the results obtained with Atlas, compiled with thread support. The best performance obtained is 1.68 GFlops/s. 4.1.3. HPL Configuration The cluster and HPL configuration therefore used message passing communication between the 16 nodes and shared memory communication internally, within the dual-processors nodes, through the multithreaded Blas.

300

IEEE

500

1800 "data1825.gnu" 1600

1400

1200

1000

800

600

400

200

0 0

200

400

600

800

1000

matrix size

Figure 5. Blas performance with threads support, on a Dual Pentium III node.

The HPL input file has been adapted to the labtec architecture, in order to reach the highest performance possible: • process grid of size 4 × 4; • block size NB = 100; • sequential versions of the factorization algorithms set to Crout • the matrix size N was set such as to fill up the total memory available in the cluster (i.e. P × Q × N 2 × sizeof(double) = 16 Gb). This gave N = 41.000. With all these parameters, a typical run last around 35 minutes. The best achieved performance has been 22.84 GFlops/s.

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05) 1550-6533/05 $20.00 © 2005

400

Figure 4. Atlas performance on dgemm, on a Dual Pentium III node, without thread.

MFlops/s

Thus, the data exchanges follow the two axes of the Cartesian topology. In order to optimize these broadcast, which are the main limiting operations of the algorithm, the HPL implementation offers 6 different broadcast algorithms based on traditional MPI Send/MPI Recv operations (instead of relying on the MPI Bcast call, whose implementation depends on the installation of the MPI distribution).

200 matrix size

1. broadcast of the pivot block along the lines of the matrix; with this pivot, the factoration of Nb lines can be done; 2. broadcast of these newly factored lines along the columns.

100

4.2. β-MPI Overhead The β-MPI library allows the automatic DFG generation of the HPL benchmark. It has been tested with the best determined parameters of Linpack, with the 6 different broadcast algorithms, in order to determine the best algorithm for our network topology. The overhead we got due to the DFG generation is small, as can be seen in Fig 6 and Fig 7. The major difference we got was 10MFlops. Thus, the overhead we got was 0.02 % compared with the HPL execution without β-MPI. This was to be expected, since the β-MPI library only updates, locally, a data-structure in the memory, which is negligible when compared to the huge message transmission during a linpack run.

Figure 6. Linpack performance (GFlops/s) through the six variants of broadcasts, with and without β-MPI

Figure 7. Execution time of β-MPI with the six variants of Broadcast

4.3. Analysis of the Data Flow Graph of the Linpack Benchmark HPL allows to choose one of the 6 following types of broadcast algorithms: Increasing ring, Modified increasing ring, Increasing two ring, Modified increasing two ring, Bandwidth-reducing and Modified bandwidth-reducing. The ring algorithms propagate the update data in a pipeline fashion, where in the increasing ring variant the process 0 sends data to process 1, process 1 sends data to process 2 and so on. The increasing two ring propagates data in two pipelines concurrently: the N processes are divided into two part (0 . . . N/2 − 1 and N/2 . . . N − 1) and the process 0 sends data to process 1 and to process N/2 − 1. From the processes 1 and process N/2 − 1 start the two pipelines. In the increasing 1 ring modified algorithm, the process 0 sends two messages, one to process 1 and one to process 2. The process 2 receives the message and sends it to the rest of the process through a classical ring algorithm. The message propagation in the increasing two ring modified algorithm has the following behavior: first process 0 sends to process 1, then the N − 2 processes left are divided into two equal parts (2 . . . N/2 and N/2 + 1 . . . N − 1): the process 0 sends to 2 and process 1 sends to the process N/2 + 1; processes 2 and N/2 + 1 then act as sources of two rings. The bandwidth-reducing algorithm allows cutting a message in N parts according to the number of processes and scattered across the N processes. The bandwidth-reducing modified algorithm is the same, except that process 0 sends to process 1 first and then the bandwidth-reducing algorithm is used on the other processes. The β-MPI library has been used to generate the DFG associated with each type of broadcast used by HPL. The resulting graphs are shown from Fig 8 to Fig 12. In all figures, dashed, normal and thin lines represent different communication volumes: fewer, intermediate and more communication, respectively. Table 1 1 shows for each type of broadcast the exact volume of data transfer, as represented in the figures. The first point to note is that the β-MPI library traces the row/column communications in the virtual grid of processes. No communication has been traced outside these network routes, as expected. Fig 8 and 9 show the DFG associated to the same broadcast algorithm (2ringM), but for two different virtual grids of processes: 2 × 8 in the first case, and 4 × 4 in the other. They are composed respectively by: 16 thin, 8 normal and 24 dashed edges and; 16 thin, 24 normal and 8 dashed edges. Notice that although “diagonal” links appear in the 1

The Bandwidth-reducing DFG has only two classes of traffic

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05) 1550-6533/05 $20.00 © 2005

IEEE

Algorithm 1 ring 1 ringM 2 ring 2 ringM Band-Reducing Band-ReducingM

dashed 1 Mb 0.5 Gb 0.5 Gb 0.5 Gb 0.4 Gb 0.5 Gb

normal 0.5 Gb 0.8 Gb 0.8 Gb 0.8 Gb ——– 0.8 Gb

thin 1.3 Gb 1.4 Gb 1.4 Gb 1.4 Gb 1.4 Gb 1.1 Gb

Table 1. Volume of Data Transfer for each Broadcast Algorithm.

2 × 8 case, these are due to broadcasts along a row of processes, since a virtual row includes, in this topology, 8 processes (P1-P8 and P9-16). Clearly, the traffic is larger in the 2 × 8 case, due to these overloaded “diagonal” links: actually, this topology leads to long linear chains of processors in which to broadcast the data, and the consequent accumulation of blocks to route explains the measured overload.

P13

P14

P15

P16

P9

P10

P11

P12

P5

P6

P7

P8

P1

P2

P3

P4

Figure 8. DFG of the 2ringM Broadcast, topology 2 × 8.

Once the superiority of one topology has been established by the β-MPI DFG, what can be said of the different broadcast algorithms? The Bandwidth-reducing algorithm (Fig 10) shows loaded links for the broadcasts along the columns (edges P1-P13, P2-P14, P3-P15 and P4-P16, all with 1.4 Gb of messages); it also shows more traffic along some of the horizontal links than other broadcasts (e.g. P1P4, 1.4 Gb, vs 0.8 Gb in the 2 ring modified, Fig 9). This algorithm seems to be worse than the others. Notice that Fig 6 showed that this broadcast algorithm yields the worst timing results, which is consistent.

P13

P14

P15

P16

P9

P10

P11

P12

P5

P6

P7

P8

P1

P2

P3

P4

Figure 9. DFG of the 2ringM Broadcast, topology 4 × 4.

On the other hand, the bandwidth-reducing modified algorithm, presented in Fig 11, obtained better performance than the bandwidth-reducing variant. The modified variant is divided into three classes of traffic (with 32 thin, 8 normal and 8 dashed edges), whereas the original variant is only broken into two classes (32 thin and 16 dashed edges), as shows table 1. The topology of both graphs is equal. What distinguishes them is the networking load: the highly loaded links of the modified version only route 1.1 Gb of data, vs. 1.4 Gb in the original case. Since these links are the most numerous (there are 32 such links), they heavily count in the total data volume to be exchanged. Once more, our analysis of the DFG is consistent with the performance measurements, that show that the modified bandwidth reducing variant is much better than the bandwidth reduction. At least, as far as ring-based algorithms are concerned, the DFG of the 1-ring algorithm (see Fig 12) can be compared to the graph of the modified 2-ring broadcast already shown in Fig 9 for the same 4 × 4 topology. The graphs of the modified 1-ring and of the 2-ring have not been shown, for they are equal to the DFG of the modified 2-ring. One can clearly see that: • the 12 vertical edges (e.g. P1-P13, P1-P5, P2-P6, etc. . . ) are in the same “thinner” category in both algorithms, yet in the 1-ring case, this means more than 1.3 Gb, vs 1.4 in the other one. Based on that criterion, algorithm 1-ring should be slightly better; • 4 edges that appear in the modified 2-ring case are not present in the 1-ring DFG (P14-P16, P10-P12, P6-P8 and P2-P4); • most edges (P13-P15, P9-P11, P5-P6, P1-P3, as well as P9-P1, P13-P9, P10-P2, P14-P6, etc. . . ) are

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05) 1550-6533/05 $20.00 © 2005

IEEE

P13

P14

P15

P16

P9

P10

P11

P12

P5

P1

P6

P2

P7

P3

formance confirm this analysis. Yet, it can also be noticed that the difference observed in Fig 1 between the 1-ring and 2-ring algorithms is slight, which is consistent with the balance of “pros and contras” analyzed in the DFGs of this case.

P13

P14

P15

P16

P9

P10

P11

P12

P5

P6

P7

P8

P1

P2

P3

P4

P8

P4

Figure 10. DFGs of the Bandwidth-reducing Broadcast, topology 4 × 4.

P13

P14

P15

P16

Figure 12. DFG of the 1ring Broadcast, topology 4 × 4. P9

P10

P11

P12

5. Conclusion P5

P1

P6

P2

P7

P3

P8

P4

Figure 11. DFG of the Bandwidth-reducing (modified) Broadcast, topology 4 × 4. as loaded (around 0.5 Gb) in the case of the modified 2-ring as in the 1-ring case (i.e. 0.5 Gb, see table 1); • 12 horizontal edges (P13-P14; P14-P15; P13-16, etc . . . ) are heavily loaded in the 1-ring case (1.3 Gb), but normally loaded in the 2-ring case (0.8 Gb). The DFG analysis shows a higher traffic, due to the 12 heavily loaded ”horizontal” edges, which sum up to give an overhead of 6 Gb of data extra to be routed in the case of the 1ring algorithm. Namely, the 12 links that save 1.2 Gb and the 4 edges that save 3.2 Gb are not enough to compensate for these 6 extra Gb. Without surprise, the measured per-

Many parallel programming environments use the dataflow graph of the application to analyze and enhance the performance of an application. In this work, a generic tool has been presented that allows to create and analyze the DFG of a MPI program. The Linpack Benchmark, as implemented in the HPL version, has been used as target application in order to validate the use of β-MPI, as well as a deep study of the kind of analyzes that such a tool allows, with the 6 native broadcast algorithms offered by the HPL implementation. For each of the 6 algorithms, the study of the DFG provided by β-MPI has allowed to deduce the relative loss or gain of performance in the benchmark. All the tests have been done on a real cluster, with all the optimization required, in order to work on a “real-life” use of Linpack. Thus, the interest of the DFG appears clearly, without any artifact due to a wrong tuning of the benchmark (e.g. a better performance due to some other factor not related to the network). Currently, the β-MPI library only provides the application DFG. The next steps in our project is to use the DFG to automatically map the MPI processes to the parallel architecture, by partitioning the graph, following the topology of the network. A second step will be to use process migra-

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05) 1550-6533/05 $20.00 © 2005

IEEE

tion to obtain an on-line version of the allocation mechanism.

References [1] M. A. Bender and M. O. Rabin. Scheduling cilk multithreaded parallel programs on processors of different speeds. In SPAA ’00: Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures, pages 13–21, New York, NY, USA, 2000. ACM Press. [2] G. G. H. Cavalheiro. Athapascan-1: interface g´en´erique pour l’ordonnancement dans un environnement d’ex´ecution parall`ele. PhD thesis, Universit´e Grenoble, 1999. [3] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, and J. McDonald. Parallel Programming in OpenMP. 2001. [4] J. Dongarra, P. Luszczek, and A. Petitet. The linpack benchmark: Past, present, and future, December 2001. http:// www.cs.utk.edu/ luszczek/articles/hplpaper.pdf. [5] J. J. Dongarra, J. D. Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw., 16(1):1–17, 1990. [6] F. Galil´ee, J.-L. Roch, G. G. H. Cavalheiro, and M. Doreille. Athapascan-1: On-line building data flow graph in a parallel language. In IEEE PACT, pages 88–95, 1998. [7] F. Galil´ee, J.-L. Roch, G. Cavalheiro, and M. Doreille. Athapascan-1: on-line building data flow graph in a parallel language. Grenoble, France. [8] W. Gropp, E. Lusk, and A. Skjellum. Using mpi portable parallel programming with message-passing interface, 1999. [9] W. Grosso. Java RMI. O’Reilly & Associates, Newton, USA, 2002. [10] A. Petitet, R. Whaley, J. Dongarra, and A. Cleary. Hpl - a portable implementation of the high-performance linpack benchmark for distributed-memory computers, January 2004. http://www.netlib.org/benchmark/hpl/algorithm.html. [11] B. Richard, P. Augerat, N. Maillard, S. Derr, S. Martin, and C. Robert. I-cluster: Reaching top500 performance using maintream hardware., August 2001. Technical Report HPL HPL-2001-206 20010831, HP Laboratories Grenoble. [12] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI: The Complete Reference. The MIT Press, 1996. [13] R. V. van Nieuwpoort, J. Maassena, G. Wrzesinska, T. Kielmann, and H. E. Bal. Satin: Simple and efficient java-based grid programming, 2004. Accepted for publication in Journal of Parallel and Distributed Computing Practices. [14] R. C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Software. In Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999. CD-ROM Proceedings. [15] T. Yang and A. Gerasoulis. Pyrros: Static scheduling and code generation for message passing multiprocessors. In 6th ACM International Conference on Supercomputing, pages pp. 428–437, Washington D.C, July 1992.

Proceedings of the 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD’05) 1550-6533/05 $20.00 © 2005

IEEE

Suggest Documents