MATE: toward scalable automated and dynamic performance tuning

1 downloads 0 Views 137KB Size Report
Environment) is a tuning environment for MPI parallel ... collect performance data, and finally tune the .... processes between the ACs and the Analyzer. These.
Para 2010 – State of the Art in Scientific and Parallel Computing – extended abstract no. 57 University of Iceland, Reykjavik, June 6–9 2010 http://vefir.hi.is/para10/extab/para10-paper-57.pdf

MATE: toward scalable automated and dynamic performance tuning environment Anna Morajko, Andrea Martinez, Eduardo César, Tomás Margalef, Joan Sorribes Computer Architecture and Operating System Department, Universitat Autònoma de Barcelona 08192 Bellaterra, Spain {Anna.Morajko, Eduardo.Cesar, Tomas.Margalef, Joan.Sorribes}@uab.es, [email protected]

MATE (Monitoring, Analysis and Tuning Environment) is a tuning environment for MPI parallel applications [1]. It augments on-line automated performance diagnosis with dynamic code optimization to combine the advantages of both automated analysis and computational steering. MATE does not require program modifications to expose steerable parameters. Instead, it uses dynamic instrumentation to adjust program parameters. With MATE an application is monitored, its performance bottlenecks are detected, their solutions are given, and the application is tuned on the fly to improve its performance. All these steps are performed automatically, dynamically, and continuously during application execution. MATE uses DynInst [2] to insert instrumentation into running code, collect performance data, and finally tune the application. The fundamental idea is that dynamic analysis and online modifications adapt the application behavior to changing conditions in the execution environment or in the application itself. MATE consists then of the following components that cooperate to control and improve the execution of the application [3]: • Application Controller (AC) — a daemon-like process that controls the execution and dynamic instrumentation of individual MPI tasks. • Dynamic monitoring library (DMLib) — a library that is dynamically loaded into application tasks to facilitate the performance monitoring and data collection. • Analyzer — a process that carries out the application performance analysis and decides on monitoring and tuning. It automatically detects existing performance problems on the fly and requests appropriate changes to improve the application performance. The MATE tool uses the functionality required to parse and modify binary executables by means of DynInst API. It provides a lightweight data collection framework composed of a number of distributed daemon processes and a centralized analyzer process. The centralized performance analyzer is driven by a number of so called tunlets that implement specific

performance models and algorithms that evaluate the current behavior and suggest tuning actions of a running application. MATE has been demonstrated to be effective and feasible tool to improve performance of real-world applications [4]. An extensive experimental work has been conducted with parallel applications based on master/worker paradigm and automatic tuning of data distribution algorithms like factoring. The basic tunlet that was evaluated is the factoring data distribution tuning algorithm [5]. This algorithm calculates the optimal values of the factoring distribution parameters. These values are later applied to the application by dynamic instrumentation. The MATE tool by design is suitable for any Linux-based cluster environment running MPI applications. In particular, the automatic tuning has been applied to a parallel master/worker implementation of forest fire simulation called XFire developed at UAB [6]. The forest fire propagation is a very complex problem that involves several distinct aspects that must be considered: meteorology aspects such as temperature, wind, moisture; vegetation features and terrain topography. The simulation system is used to predict the evolution of the forest fire in different scenarios and help minimize the fire risks and fire damages. Given its complexity, this simulation requires high computing capabilities. The experiments with this highly resource demanding application and MATE in a cluster environment has proven the benefits from dynamic tuning. However, the tool has been used in this case only on a small size cluster. The next step of our research aims to port the existing implementation of MATE tool to large-scale parallel systems. The objective is to examine and resolve all scalability issues that may appear when running on thousands of processors. The key problems are related to volume of collected data and centralized performance analysis. MATE assumes the performance analysis based on the global application view that is taking into consideration all the processes and their interactions. Such an approach is feasible for environments with a relatively small number of nodes. However, the centralized analysis becomes a scalability

bottleneck if the number of nodes increases. We want to solve this problem by distributing the performance analysis process. The analyzer component is the main bottleneck of the MATE environment because of the following factors: • All events generated by all application processes are sent to this component • Number of connections and AC daemos to manage • Volume of events to process by the analyzer process causes an increasing time response • Performance models and thus tuning techniques (tunlets) that we analyzed are adecuate for centralized and secuencial approach. Although the complexity of these models is quite basic to be able to evaluate it during execution time, in many cases it depends on the number of processes. If this dependence is not lineal, the scalabilty will be poor. To overcome these barriers to the scalabilty, MATE is being developed using overlay networks. An overlay network is a network of computers constructed on top of another network. Nodes in the overlay are connected by virtual or logical links, each of which corresponds to a path in the underlying network. For example, many peer-to-peer networks are overlay networks because they run on top of the Internet. This kind of networs is scalable, flexible and extensible. Therefore, to make the MATE environment scalable, we propose to adapt it applying Tree-based Overlay Network inside the MATE infrastructure. TBONs [7, 8] are virtual networks of hierarchically-organized processes that exploit the logarithmic scaling properties of trees to provide efficient data multicast, gather, and data aggregation. An example implementation of the TBON model is MRNet framework [9] developed at the University of Wisconsin. MRNet is a software overlay network that provides efficient multicast and reduction communications for parallel and distributed tools and systems. It is useful to build scalable parallel tools as it incorporates a tree hierarchy of processes between the tool's front-end and back-ends to improve group communication performance. These introduced internal processes are also used to distribute many important tool activities that are normally performed by a frontend or tool daemon. As the calculation is distributed and performance data is aggregated, MRNet allows for reducing data analysis time and keeping tool front-end loads manageable. TBON model allows for developing a new structure of the MATE environment, shown in Figure 1, that will make it scalable.

MATE sends all events from all processes to the central analyzer. In this case, the data flow will be reduced applying TBON architecture. A problem related to the number of connections to the global Analyzer will be solved by introducing the hierarchy of processes between the ACs and the Analyzer. These internal processes will provide event aggregation; each process will be responsible for receiving events generated by a reduced set of daemons, aggregate them, and finally pass them to the superior level of the network. In this way, the aggregated data will arrive to the global Analyzer. Each level may aggregate, filter, or sort data that permits for reducing the data volume and process time required by analysis. All events generated by the application processes are sent through all levels of the TBON (daemon to frontend direction). In the same way, all requests that must be provided to the daemons may be transmitted through this network. For example, a tuning request sent by the global Analyzer (front-end) will be sent from a level to an inferior level till the proper AC (daemon) process. However, in case of certain tuning techniques, the TBON usage will no solve all scalability problems. The techniques that require the evaluation of performance metrixs calculated over a certain event pattern will remain a bottleneck of the front-end process. As an example we can put the following situations: calculate an iteration time for each application process as a difference between the start and the end of the iteration or, calculate the delay between a send event from one process and a receive event on another process. It is impossible to evaluate these metrix directly in the application process as the information required is available only on the global Analyzer. To solve these problems, we propose a new approach based on the distributed evaluation of metrixs. The idea is to delegate certain calculations to the internal TBON process and unload the global Analyzer. Each tunlet must provide arithmetic expressions that characterize the event patterns and distribute these declarations to the TBON nodes. In this way, each TBON node provides a filter detecting required patterns and evaluating given arithmetic expressions. This analysis is distributed and transparent to the global Analyzer. Moreover, we can apply this solution to other performance models than Master-Worker, herarchical Master-Worker or Master-Worker of pipelines. The first model can be applied in the case of XFire simulator where the data distribution may cause a scalability bottlenecks. The Master process may distribute the work to a set of Sub-Masters and each of them can manage a set of workers.

Machine 1

Machine 2

Machine N

task Task

task Task

AC



AC DMLib

task Task

AC

DMLib







Machine N+K

Procesos internos TBON



DMLib

Analyzer

Figure 1: MATE architecture based on TBON overlay network.

If we talk about Master-Worker with pipeline (where each worker is a pipeline), we can find some parts of the analysis that can be performed independently for each pipeline, and globaly for the whole MasterWorker. In this case, each application process is controlled by one AC, all ACs from a pipeline by one TBON node. Each TBON node will provide the collection of events as well as the part of the analysis of the corresponding pipeline. Then, this level transmits the result to the superior level, and finally to the global Analyzer.

[6].

[7].

References [8]. [1].

[2].

[3].

[4].

[5].

Morajko, A. “Dynamic Tuning of Parallel/Distributed Applications”. PhD Thesis. Universitat Autònoma de Barcelona. 2004. Buck, B., Hollingsworth, J.K. “An API for Runtime Code Patching”. University of Maryland, Computer Science Department, Journal of High Performance Computing Applications. 2000. Morajko, A., Margalef, T., Luque, E. "Design and Implementation of a Dynamic Tuning Environment". Journal of Parallel and Distributed Computing, vol. 67, pp. 474-490. 2007. Morajko, A., Caymes-Scutari, P., Margalef, T., Luque, E. "MATE: Monitoring, Analysis and Tuning Environment of Parallel/Distributed Applications". Concurrency and Computation: Practice and Experience, vol. 19, pp. 15171531. 2006. Morajko, A., Caymes-Scutari, P., Margalef, T., Luque. E. “Automatic Tuning of Data Distribution Using Factoring in Master/Worker

[9].

Applications”. Lecture Notes in Computer Science, 3315, pp. 132-139. 2005. Jorba, J., Margalef, T., Luque, E., Andre, J, Viegas, D.X. "Application of Parallel Computing to the Simulation of Forest Fire Propagation", Proc. 3rd International Conference in Forest Fire Propagation, Vol. 1, pp. 891-900, November 1998. Roth, P.C., Miller, B.P. "On-line Automated Performance Diagnosis on Thousands of Processes". Proceedings of the eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 69 – 80. New York City. March 2006. Arnold, D.C., Pack, G.D., Miller, B.P. "Treebased Overlay Networks for Scalable Applications". 11th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2006). Rhodes, Greece. April 2006. P. C. Roth, D. C. Arnold, and B. P. Miller. MRNet “A Software-Based Multicast/Reduction Network for Scalable Tools”. SC ’03. Phoenix, AZ.2003.

Suggest Documents