(e.g., Nimrod [1], MOAB [2] or MAUI [3], or the GRAM [4] component in Globus. [5]). Following this approach, applications act as resource requesters: when the.
Self-optimizing MPI Applications: A Simulation-Based Approach Emilio P. Mancini1 , Massimiliano Rak2 , Roberto Torella2 , and Umberto Villano1 1
2
RCOST and Dip. di Ingegneria, Universit` a del Sannio {epmancini, villano}@unisannio.it Dip. di Ingegneria dell’Informazione, Seconda Universit` a di Napoli {massimiliano.rak, r.torella}@unina2.it
Abstract. Historically, high performance systems use schedulers and intelligent resource managers in order to optimize system usage and application performance. Most of the times, applications just issue requests of resources to the central system. This centralized approach is an unnecessary constraint for a class of potentially flexible applications, whose resource usage may be modulated as a function of the system status. In this paper we propose a tool which, in a way essentially transparent to final users, lets the application to self-tune in function of the status of the target execution environment. The approach hinges on the use of the MetaPL/HeSSE methodology, i.e., on the use of simulation to predict execution times and skeletal descriptions of the application to describe run-time resource usage.
1
Introduction
The presence of geographically distributed software systems is pervasive in current computing applications. In commercial and business environments, the majority of time-critical applications has moved from mainframe platforms to distributed systems. In academic and research fields, the advances in high-speed networks and improved microprocessor performance have made clusters or networks of workstations and Computational GRIDs an appealing vehicle for costeffective parallel computing. However, the systematic use of distributed systems can be frustrating, due to difficulties in resource usage optimization and development of applications with good performance. Usually high performance systems, such as clusters or NOWs (Network of Workstations), adopt resource allocation systems and schedulers (e.g., Nimrod [1], MOAB [2] or MAUI [3], or the GRAM [4] component in Globus [5]). Following this approach, applications act as resource requesters: when the application starts, they request the needed resource. GRID environments, such as Globus, offer languages to describe the resources an application needs (RSL [6]). When the resources are actually available, the application is started. The main limit of this approach is that application requests are “static”, i.e., the total amount of resources needed by an application cannot vary in function of the L.T. Yang et al. (Eds.): HPCC 2005, LNCS 3726, pp. 143–155, 2005. c Springer-Verlag Berlin Heidelberg 2005
144
E.P. Mancini et al.
system status. For example, an application might declare how many nodes or processor it needs, but it has no way to perform this choice taking into account the system status: the resource allocator locks it until all the requested nodes or processors are available. This makes the problem of performance portability [7] and performance engineering [8] really hard to manage. Application developers, on the other side, try to develop codes that are as flexible as possible, in order to easily optimize them on systems which continuously change their architecture, due to new node acquisition inside a cluster, or to the distribution in a GRID environment [9, 10]. Du, Ghosh, Shankar and Sun [11] propose an alternative approach: they develop a tool that allows MPI tasks to migrate from a node to another in order to optimize resource usage. This approach, as always in code migration, has the problem of the large overhead due to migration and to the system monitoring that triggers migrations. In this paper we propose a new approach, which involves the application in resource allocation, even if hides this process from final users. The idea is to rely on a tool that predicts the optimal application configuration (for example, in terms of number of nodes needed) using performance data collected using direct measurement and simulation to predict the system usage. The tool, MHstarter, is invoked every time users launch a new instance of the application. The tool hinges on an existing description language and simulation environment (MetaPL and HeSSE, respectively [12,13,14,15,16,17,18]). In other words, newlydeveloped services or applications can exploit the simulator to predict the system behavior in real, fictitious or future possible working conditions, and to use its output to help optimize themselves, anticipating the need for new configurations or tunings. The reminder of the paper is structured as follows: the next section introduces the HeSSE/MetaPL prediction methodology. Then the tool architecture and its components are described, explaining how it can be used. After this, a simple case study is proposed. Finally, the conclusions are drawn and our future work is outlined.
2
MetaPL/HeSSE Methodology
HeSSE is a simulation tool that allows the user to simulate the performance behavior of a wide range of distributed systems for a given application, under different computing and network load conditions. The HeSSE compositional modeling approach makes it possible to describe Distributed Heterogeneous Systems by interconnecting simple components. Each component reproduces the performance behavior of a section of the complete system at a given level of detail. A HeSSE component is basically an object, hard-coded with the performance behavior of a section of the whole system. More detailed, each component has to reproduce both the functional and temporal behavior of the subsystem it represents. In HeSSE the functional behavior of a component is the service set exported to the other components. So connected components can ask other components for services. The temporal behavior of a component describes the time spent servicing.
Self-optimizing MPI Applications
145
System modeling is performed primarily at the logical architecture level. For example, physical-level performance, such as the one resulting from a given processor architecture, is generally modeled with simple analytical models or by integral, and not punctual, behavioral simulation. In other words, the use of a processor to execute instructions is modeled as the total time spent computing, without considering the per-instruction behavior. HeSSE uses traces to describe applications. A trace is a file that records all the actions of a program that are relevant for simulation. For example, the trace for an MPI application is a sequence of CPU burst and requests to the runtime environment. Each trace is the representation of a specific execution of the parallel program. Trace files are simulation-oriented application descriptions, typically obtained through application instrumentation. When the application is not available, e.g., it is still being developed, they can be generated using prototypal languages. In the past years the HeSSE framework was provided with an XML-based metalanguage for parallel programs description, MetaPL [16,15,18]. MetaPL is language independent, and can support different programming paradigms or communication libraries. The core MetaPL notation can be extended through Language Extensions (XML DTDs), which introduce new constructs into the language. Starting from a MetaPL program description, a set of (extensible) filters makes it possible to produce different program views, among which are trace files that can be used to feed the HeSSE simulation engine. The detailed description of the MetaPL approach to program description, and of the trace generation process, is out of the scope of this paper and can be found in [15, 18]. The simulation and analysis process of a given application can be represented graphically as in Fig. 1. It is subdivided in three steps: System Description, Simulation and Results Analysis. Unsatisfactory analysis results may lead to the use of new or improved algorithms/code structures and to further analysis sessions. The System Description phase includes: – MetaPL metacode production (Application Description); – system architecture model definition (System Architecture Modeling); – evaluation of time parameters (model tuning). Application Description Application tracing or MetaPL modeling Application trace
XML Engine Simulated run trace
System Architecture Modelling
Filter Configuration file Configuration file generation
(Simulated) program performance analysis
Simulation log Filter
Simulation reports
Model Tuning Command file Benchmarking and parameter evaluation ...
...
....
....
....
.... ....
Simulation
Modeling preparation
Fig. 1. HeSSE simulation session
Analysis
146
E.P. Mancini et al. < MetaPL > < Code > < Task name = " test " > < Block > < CodeBlock coderegion = " Init " time = " 10 " / > < Loop iteration = " 10 " variable = " nstep " > < Block > < CodeBlock coderegion = " Calc " time = " 10 " / > ...
Fig. 2. A simple “core” MetaPL description < MetaPL > < Code > < Task name = " test " > < Block > < CodeBlock coderegion = " Init " time = " 10 " / > < MPI_Init / > < MPIAllGather dim = " 6720 " / > < MPI_Finalize / > < Mapping > < NumberOfProcesses num = " 4 " / >
Fig. 3. MetaPL description of a simple MPI code
The application description step consists essentially of the development of MetaPL prototypes, which make it possible to generate the trace files needed to drive the simulation execution. The system architecture model definition consists of the choice (or development, if needed) of the HeSSE components useful for simulation, which are successively composed through a configuration file. At the end of this step, it is possible to reproduce the system evolution. The last step consists of running benchmarks on the target system, in order to fill the simulator and the MetaPL description with parameters (typically) related to timing information. Among the extensions currently available for MetaPL, there is one that enables the description of application developed using MPI. Figure 2 shows an example of description exploiting only the Core of the MetaPL language. It is a description of a simple task containing a loop. Figure 3 describes an MPI application that performs just an AllGather primitive; the Mapping element contains information useful to define the execution conditions, such as the number of processes involved.
Self-optimizing MPI Applications
3
147
Simulation-Based Autonomic Applications: MHStarter
MetaPL and HeSSE help users to predict the performance of a given system configuration for the execution of a target application. In this paper we propose their integration in a new tool, MHstarter, which automatically uses them to optimize the execution of MPI applications. From a user perspective, the tool is a sort of application launcher, just like mpirun, even if it needs to be suitably configured before the application is started. The user should declare a new project, and provide the tool with the application executable and its MetaPL description. It should be noted that, according to the MetaPL/HeSSE methodology, application descriptions are generated in an integrated way with the final application code, thus making it possible to evaluate the performance tied to the design choices in a proactive way. MHstarter creates a project, in which it stores both the executable and the corresponding MetaPL description. Then the tool analyzes the MetaPL description, and, using a suitable MetaPL filter, it defines the configurations and the traces used to run the simulator for each configuration chosen. For any given problem dimension the user has to ”prepare” the application launch, so that the tool is able to reduce the overhead for application starting. Once the project is created, the application can be started, just giving to the tool the reference to the project, whose name is the same of the executable code. Every time that the user asks the starter to launch the application, the tool performs one or more simulations, retrieves results of previous simulations, automatically analyzes this information and chooses the optimal configurations for the application. Finally, the application is started on the chosen nodes with optimal parameters. Figure 4 summarizes the tool architecture. The main component of the proposed solution is the starter (in the following we will call MHstarter both the complete tool and the starter component). It is the executable used by final users to launch the applications. Its performance affects the application startup time. Along with the starter, the tool provides two simple programs, create and prepare that create a new project, and prepare the program for execution, respectively. Creation and Preparation components are useful to reduce the amount of time spent in choosing the optimal configurations, at application startup. In addition to the starter, the tool provides a data collector, which acts as a local daemon; at regular time intervals, the collector queries the available nodes, collecting such data as the computational workload on each node or network load. The collector stores all the information collected in a file representing the system status. MHstarter uses the file to update, off-line, the simulator configuration files. The collector works independently of user requests, so it does not affect the performance of the application. On the bad side, the proposed approach has the effect to make application latency grow. In fact, the tool needs to perform several simulations to choose the optimal hardware/software configuration, and this affects the application execution time. As anticipated, in order to reduce the time spent to launch the
148
E.P. Mancini et al. MHstarter MHstarter project project Executable Users
MetaPL Description
project project
System System Status Status
collector collector
starter starter
Configurations
Execution Execution Environment Environment
Executable Users MetaPL Description Configurations
Fig. 4. Overview of the MHstarter architecture
applications, a large set of information is collected off-line (i.e. during project creation and application preparation), thus reducing the number of simulations performed when the application starts. The idea is that when the final user creates a new project and prepares it for startup (choosing the application parameters such as problem dimension), a set of simulations takes place, collecting results about possible useful configurations. When the user starts the application, most of this information is available. 3.1
Configuration Choice
In order to choose the optimal configuration, the tool has to evaluate: – the system status, i.e., the current availability of compute nodes; – the application-manageable parameters, i.e. the parameters available for optimization. In this paper we use as optimization parameters the number of processes making up the application and several different allocation policies. The target system considered is a generic SMP cluster, with N nodes and m processors per node. For simplicity’s sake, we will assume that the processors are not hyperthreaded. The system status is modeled simply as the mean load on every processor of the target environment. We represent the status by means of a Workload Matrix (W M ), whose rows represents nodes, and columns processors. In other words, the element W Mij is the mean CPU usage of the j-th processor of the i-th node. The data collector component, acting as a daemon, updates continuously the system status. In the existing implementation, the collector simply reports the actual status of the processors. Future versions of the tool might provide an extrapolation based on CPU usage history and current requests. Under the assumptions mentioned above, the MHstarter has to evaluate the optimal system configuration, and hence how many process to launch and on which processors. We represent a process allocation configuration by means of an Allocation Matrix (AM). Element AMij of the matrix is 1 if a process of the target application will be started on the j–th CPU of the i-th node, 0 otherwise. The MHstarter is able to convert this representation into HeSSE configuration files tuned on the target environment whenever necessary. It should be noted
Self-optimizing MPI Applications
149
that the number of possible configurations grows with cluster dimension, and this may lead to an excessive time spent in simulation. A few considerations make it possible to limit the number of configurations useful for analysis: – A process which shares its CPU with another task may be a bottleneck, i.e. sometimes it is better to does not use a processor, if the application shares it with another task. – In a SMP cluster, a node may be a bottleneck, even if it has a free processor, if there is CPU load external to the application on one of node CPUs. In other words, sometimes it is better to have four CPUs on two biprocessor nodes, than five CPUs on three biprocessor nodes. – It is possible to simulate the time spent in execution on the system without load external to the application, before the user asks to start the application. In order to reduce the application startup latency, the user has to ask off-line to the tool to prepare application executions for the given problem dimension. The tool performs a set of simulation and collects the estimated time, so that when the application will start, it is already known teh variation of the application execution times as a function of the number and distribution of available processors. Note that usually the user starts the application multiple times with the same problem dimensions (even if with different data). So, given a system status wm, we can define two allocation matrices: am = P (wm, t) where amij = 1 ⇔ wmij < t, i.e., we start a process on the processors whose usage is under the given threshold. am = N (wm, t) where amij = 1 ⇔ (∀j, wmij < t), i.e., we start a process only on nodes whose CPUs has all usages under the given threshold. Due to these considerations, MHstarter simulates at project startup all the possible application configurations with CPUs without load external to the application, and stores the results. When the application starts the tool, it: – simulates the application behavior using all available processors and the actual system status; – retrieves the results of the simulations (already done) for the allocated matrix, given by P (wm, 0), N (wm, 0); – It compares the results, chooses the best configuration, generates a suitable MPI machine file, and starts the application. Using this approach, only one simulation is performed before the application, even if simulation results of many other configurations are already available, is started, and it is extremely likely that optimal configuration is one of the proposed. The only drawback of this approach is that the user has to request explicitly a new project set-up when he is interested to a different problem dimension.
150
4
E.P. Mancini et al.
A Case Study
The proposed technique is able to optimize the performance behavior of a configurable application, even if it introduces an overhead due to the time spent for simulation. In order to judge the validity of the approach, we present here a description of the technique, as applied on a real system running an application developed by MetaPL/HeSSE. In particular, the case study is an implementation of the Jacobi method for resolving iteratively linear equations systems. We firstly modeled and developed the proposed algorithm. Then, we used the MHstarter to launch the application. The launcher chooses the optimal application parameters, using the simulation environment to predict the optimal execution conditions, and starts the application. 4.1
The Case Study Application
The application chosen to show the validity of the approach is an implementation of the Jacobi method for solving iteratively linear systems of equations. This numerical method is based on the computation, at each step, of the new values of the unknowns using those calculated at the previous step. At the first step, a set of random values is chosen. It works under the assumption that the coefficient matrix is a dominating-diagonal one [19]. Due to the particular method chosen, parallelization is very simple. In fact, each task calculates only a subset of all the unknowns and gathers from the other tasks the rest of the unknowns vector after a given number of steps. Figure 5 shows a partial MetaPL description of the above described application (in order to improve readability, some XML tags not useful for code understanding have been omitted). This application is particularly suitable for our case study, because its decomposition can be easily altered, making it possible to run it on a larger/smaller number of processors, or even to choose uneven work sharing strategies. Note that in this paper the only parameter used for optimization will be the number < MetaPL > < Code > < Task name = " Jacobi " > < Block > < CodeBlock coderegion = " MPI_Init " time = " 10 " / > < CodeBlock coderegion = " Init " time = " 10 " / > < Loop iteration = " 10 " variable = " nstep " > < Block > < CodeBlock coderegion = " Calc " time = " 10 " / > < MPIAllGather dim = " 6720 " / > ...
Fig. 5. Jacobi MetaPL description
Self-optimizing MPI Applications
151
of processes into which the application is decomposed. Of course, this affects the duration of CPU bursts, i.e., of the times spent in Calc code regions. The study of the effect of uneven decompositions will be the object of our future research. The execution environment is Orion, a cluster of the PARSEC laboratory at the 2nd University of Naples. Orion is an IBM Blade cluster with an X345 Front-End. Its seven nodes are SMP double-Pentium Xeon 2.3 GHz, 1GB RAM memory, over GigaEthernet. The system is managed through Rocks [20], which currently includes Red Hat ES 3 and well known administration tools, such as PBS, MAUI and Ganglia. A Pentium Xeon 2.6 GHz, 1GB RAM, is adopted as front-end. 4.2
Results
We used a synthetic workload to model different working conditions of the computing cluster. In practice, the synthetic workload was injected on a variable number of nodes, in order to model different global workload status of the target machine. Our objective was to model the presence of other running applications or services. In each status, the application was started using both the standard mpirun launcher, and the newly proposed one. Then, the performance figures were compared. The objective was to prove that MHstarter can react to the presence of an uneven additional workload, finding out automatically a decomposition of the application tuned to the CPU cycles available in each node, and taking suitably into account the difference between intra-node and extra-node communications. In any case, our expectation was that the more uneven the CPU cycles available (due to the workload injected), the more significant should be the performance increase due to the adoption of the self-optimizing launcher. In the following, we will show for brevity’s sake the system behavior in a simple working condition, evaluating the tool performance as compared to the absence of control (use of mpirun). Then, we will present a more systematic test of the effects of the synthetic workloads. The system status first considered is the following: in two of the seven available nodes of the system is present a synthetic load. The workload is injected asymmetrically in only one of the two CPUs available per node. The collector component updates the system status file, so that, when the application starts, the workload matrix and the allocation matrices are: ⎛
0 ⎜0 ⎜ ⎜0 ⎜ wm = ⎜ 0 ⎜ ⎜0 ⎝ 0 0
⎞ 0 0 ⎟ ⎟ 100 ⎟ ⎟ 0 ⎟ ⎟ 99 ⎟ ⎠ 0 0
⎛
1 ⎜1 ⎜ ⎜1 ⎜ P = ⎜1 ⎜ ⎜1 ⎝ 1 1
⎞ 1 1⎟ ⎟ 0⎟ ⎟ 1⎟ ⎟ 0⎟ ⎠ 1 1
⎛
1 ⎜1 ⎜ ⎜0 ⎜ N = ⎜1 ⎜ ⎜0 ⎝ 1 1
⎞ 1 1⎟ ⎟ 0⎟ ⎟ 1⎟ ⎟ 0⎟ ⎠ 1 1
Table 1 shows the application response times, both the simulated and the actually measured one, together with the resulting prediction error. In the first column of the table, W is the trivial configuration (use of all nodes and of all
152
E.P. Mancini et al. Table 1. Application response time Configuration W P N
real simulated %error 65.7 66 0.45% 52.7 53 0.57% 60.4 61 0.99%
CPUs, independently of the presence of the workload), whereas P and N are the configurations described in section 3. The results show that the best configuration is P, which gets an execution time 20% lower (13 sec less, in absolute time). The application execution time with MHstarter (including the starting overhead) is t = 54.5s, whereas with mpirun is t = 65.7s. This means a 16.8% performance increase due to the use of our technique. The above-shown example is a single case in which the approach is advantageous. In order to show that it can be effective (i.e., it improves application performance) in almost all the relevant cases, we measured the application response time using MHstarter and mpirun in different system status, injecting, as mentioned before, synthetic workload in a variable number of nodes (once again, asymmetrically in only one CPU per node). Figure 6 shows the application response time using both launchers, varying the number of nodes into which the extraneous workload was injected. mpirun uses always the trivial application decomposition (even decomposition over all the CPUs of all nodes), while MHstarter uses a different configuration each time, the one found by proactive simulation. Figure 7 shows the performance increase obtained using MHstarter in the seven proposed conditions. Note that on the right side of the diagram (where workload is injected in 5, 6 and 7 nodes) the standard launcher performs better. This is due to the number of available CPU cycles becoming more and more even as the number of nodes into which the synthetic workload is injected rises. After all, in these conditions MHstarter chooses as optimized one the “trivial” configuration, just as mpirun, but introduces a
Fig. 6. Response times with MHstarter and mpirun for a variable number of nodes with injected workload
Self-optimizing MPI Applications
153
Fig. 7. Performance increase using MHstarter
larger overhead. It is important to point out that the overhead is almost independent of the execution time of the application. Hence, if the problem dimension grows, the effect of overhead tends to become negligible. On the other hand, Figure 7 also shows that the use of MHstarter is extremely advantageous when the available CPU cycles are unevenly distributed, and can lead to performance improvements up to 15%. This result confirms our expectations, and shows the validity of the approach.
5
Conclusions
In this paper we have proposed an innovative approach to application selfoptimization. The proposed tool, in a way transparent to the final user, simulates a set of possible target system configurations and chooses the one that optimizes the application response time. The tool limits the number of simulations, collecting a large set of results off-line, in order to reduce the overhead linked to application starting. The approach has been tested on a real case study, an implementation of the Jacobi algorithm running on an SMP cluster. The results of the tests show the substantial validity of the technique. Future work will extend the tool, in order to optimize the prediction of the system status, by finding the expected load on the basis of the history and of current user requests. The next releases of the tool will also take into account for optimization other application-dependent parameters, automatically chosen from the MetaPL description.
References 1. Buyya, R., Abramson, D., Giddy, J.: Nimrod/G: An architecture of a resource management and scheduling system in a global computational grid. In: Proc. of HPC Asia 2000, Beijing, China (2000) 283–289 2. luster Resources Inc.: Moab Grid Scheduler (Silver) Administrator’s Guide. (2005) http://www.clusterresources.com/products/mgs/docs/index.shtml.
154
E.P. Mancini et al.
3. Jackson, D.B., Jackson, H.L., Snell, Q.: Simulation based hpc workload analysis. In: IPDPS ’01: Proceedings of the 15th International Parallel & Distributed Processing Symposium, Washington, DC, USA, IEEE Computer Society (2001) 47 4. Globus Alliance: WS GRAM: Developer’s Guide. (2005) http://wwwunix.globus.org/toolkit/docs/3.2/gram/ws/developer. 5. Foster, I., Kesselman, C., Tuecke, S.: The anatomy of the Grid: Enabling scalable virtual organization. In: The International Journal of High Performance Computing Applications. Volume 15, num. 3., USA, Sage Publications (2001) 200–222 http://www.globus.org/research/papers/anatomy.pdf. 6. The Globus Alliance: The Globus Resource Specification Language RSL. (2005) http://www-fp.globus.org/gram/rsl spec1.html. 7. Foster, I., Kesselman, C., Nick, J.M., Tuecke, S.: The physiology of the Grid: An open grid services architecture for distributed systems integration. (2002) http://www.globus.org/research/papers/ogsa.pdf. 8. Smith, C., Williams, L.: Software performance engineering for object-oriented systems. In: Proc. CMG, CMG, Orlando, USA (1997) 9. Kleese van Dam, K., Sufi, S., Drinkwater, G., Blanshard, L., Manandhar, A., Tyer, R., Allan, R., O’Neill, K., Doherty, M., Williams, M., Woolf, A., Sastry, L.: An integrated e-science environment for environmental science. In: Proc. of Tenth ECMWF Workshop, Reading, England (2002) 175–188 10. Keahey, K., Fredian, T., Peng, Q., Schissel, D.P., Thompson, M., Foster, I., Greenwald, M., McCune, D.: Computational grids in action: the national fusion collaboratory. Future Gener. Comput. Syst. 18 (2002) 1005–1015 11. Du, C., Ghosh, S., Shankar, S., Sun, X.H.: A runtime system for autonomic rescheduling of mpi programs. In: ICPP ’04: Proceedings of the 2004 International Conference on Parallel Processing (ICPP’04), Washington, DC, USA, IEEE Computer Society (2004) 4–11 12. Rak, M.: A Performance Evaluation Environment for Distributed Heterogeneous Computer Systems Based on a Simulation Approach. PhD thesis, Facolt` a di Ingegneria, Seconda Universit` a di Napoli, Naples, Italy (2002) 13. Mazzocca, N., Rak, M., Villano, U.: The transition from a PVM program simulator to a heterogeneous system simulator: The HeSSE project. In J. Dongarra et al., ed.: Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science. Volume 1908., Berlin (DE), Springer-Verlag (2000) 266–273 14. Mazzocca, N., Rak, M., Villano, U.: Predictive performance analysis of distributed heterogeneous systems with HeSSE. In: Proc. 2001 Conference Italian Society for Computer Simulation (ISCSI01), Naples, Italy, CUEN (2002) 55–60 15. Mazzocca, N., Rak, M., Villano, U.: The MetaPL approach to the performance analysis of distributed software systems. In: Proc. of 3rd International Workshop on Software and Performance (WOSP02), IEEE Press (2002) 142–149 16. Mazzocca, N., Rak, M., Villano, U.: MetaPL a notation system for parallel program description and performance analysis parallel computing technologies. In V. Malyshkin, ed.: Parallel Computing Technologies, Lecture Notes in Computer Science. Volume 2127., Berlin (DE), Springer-Verlag (2001) 80–93 17. Mancini, E., Rak, M., Torella, R., Villano, U.: Off-line performance prediction of message passing application on cluster system. In J. Dongarra et al., ed.: Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science. Volume 2840., Berlin (DE), Springer-Verlag (2003) 45–54
Self-optimizing MPI Applications
155
18. Mancini, E., Mazzocca, N., Rak, M., Villano, U.: Integrated tools for performanceoriented distributed software development. In: Proc. SERP’03 Conference. Volume 1., Las Vegas (NE), USA (2003) 88–94 19. Quinn, M.J.: Parallel Computing: Theory and Practice. 2 edn. McGraw-Hill (1994) 20. Papadopoulos, P.M., Katz, M.J., Bruno, G.: NPACI Rocks: tools and techniques for easily deploying manageable linux clusters. Concurrency and Computation: Practice and Experience 15 (2003) 707–725