this stage, JIMS [26], a monitoring tool developed in the CrossGrid project, was ... Samples of the monitorized latency when the ping-pong kernel is executed.
Modeling execution time of selected computation and communication kernels on Grids M. Boull´ on1 , J.C. Cabaleiro1 , R. Doallo2 , P. Gonz´ alez2 , D.R. Mart´ınez1 , 2 2 1 no , T.F. Pena , and F.F. Rivera1 M. Mart´ın , J.C. Mouri˜ 1
Dept. Electronics and Computing. Univ. Santiago de Compostela. Spain. 2 Dept. Electronics and Systems. Univ. A Coru˜ na. Spain.
Abstract. This paper introduces a methodology to model the execution time of several computation and communication routines developed in the frame of the CrossGrid project. The purpose of the methodology is to provide performance information about some selected computational kernels when they are executed in a grid. The models are based on analytical expressions obtained from exhaustive monitorized measurements. Even though the kernels that are considered in this work include both applications dependent and general purpose, the methodology can be applied to any kind of kernel in which the most relevant part in terms of execution time is due to computations and/or communications. We focused on MPI-based communications. In addition, an interactive Graphical User Interface was developed to summarize and show the information provided by the models from different views.
1
Introduction
Performance evaluation, instrumentation, prediction and visualization of parallel codes has been found to be a complex multidimensional problem [1] in parallel and distributed systems. This situation is critical in grid environments. Tuning the performance of codes on distributed memory systems has been a high time-consuming task for users. When programming these systems, the reasons for poor performance of parallel message-passing and data parallel codes can be varied and complex, and the users need to be able to understand and correct performance problems in order to achieve good results. This is especially relevant when high level libraries and programming languages are used to implement parallel codes. Performance data collection, analysis, prediction and visualization environments are needed to detect the effects of architectural and system variations. In high-performance computing, applications performance is very sensitive to problem features such as code and data partitioning, and machine computation and communication parameters. Performance prediction is an important engineering tool that provides timely feedback on design choices in program synthesis as well as in machine architecture development. Apart from prediction accuracy, prediction cost largely determines the utility of the tool. Performance prediction
approaches take many shapes, the choice of underlying modeling formalism depending on the desired trade off between prediction cost and accuracy. Although potentially accurate, modeling formalisms such as stochastic Petri nets [2] or process algebras [3] are not attractive due to the exponential solution cost. Although approaches based on combinations of directed acyclic task graphs and queuing networks [4, 5] pair comparably high modeling power with a high efficiency, the polynomial time complexity of the solution process still entails considerable cost for very large problem sizes. In analytical approaches, the application is transformed into an explicit, algebraic performance expression. In contrast to the above numeric approaches, symbolic prediction techniques offer even lower solution cost that is less dependent on problem size. Manual approaches, such as BSP [6] and LogP [7], modeling cost is significant because the labor-intensive and error-prone derivation effort. Alternatively, symbolic prediction techniques based on stochastic direct acyclic graphs and deterministic direct acyclic graphs [8, 9] offer a mechanized scheme. However, unacceptable prediction errors arise when performance is largely dominated by contention for resources such as locks, servers, processors, networks links, or memories. Sophisticated performance prediction tools are being developed by a number of groups. In particular we can cite the PACE toolset [10], PERFORM and LEBEP[11], DIMEMAS [12], INSPIRE [13], Carnival [14], ALPES [15], P3T [16], PerFore [17] and Bricks [18]. Other works used prediction as part of a software-aided approach for particular applications, e.g. solvers - we cite SPEED [19], SciRUN [20] or PARAISO [21]. A long-term project is the development of the AlpStone [22] project, that simulates parallel performance using information from a database of kernel benchmarks and underlying hardware parameters using a ”skeleton application” approach. The kernel benchmarks are representative of ”algorithm classes” and a ”synthetic program” is built from these to represent the real application. The application of these techniques has so far been largely restricted to accurate prediction of performance of code kernels with the goal of automatic parallelization or execution steering. And all of them focus on specific parallel systems, at most on clusters of workstations, like DIMEMAS or Carnival, but to our knowledge, no one is currently fully adapted to grid environments except Bricks and DAMIEN [23] that is a project to do that for DIMEMAS. On the other hand, most of these tools require simulation of the codes, that takes much time, also they are not application-dependent, and some of them are for specific programming platforms, P3T is for High Performance Fortran programs, AlpStone and Carnival are for PVM programs, PERFORM is just for sequential systems, and PerFore is integrated in a specific compiler. In this work we establish a methodology to obtain analytical models based on exhaustive measurements of execution times obtained in a monitorized environment. The idea is to correlate execution times with monitorized information. In particular we consider computational power of each node as well as network bandwidth and latency
The proposed models are application-dependent, and they focus on some kernels from the applications of the CrossGrid project [24] as well as other general purpose kernels. In this way, the results will be more accurate, and as the performance will be modeled by simple analytical expressions, it is very fast. A model for each MPI-coded kernel on the grid is established. An additional and important feature is that the information is shown in a friendly interactive visual tool for users or developers of applications, This GUI uses the analytical models to establish performance predictions. In addition, this tool can include detailed information about specific parameters of the codes, as well as the predicted information about the execution of these codes. In fact, it should be an interface that provides interactivity in the analysis of the behavior of the codes under different conditions (number of processors, distributions, input data, network parameters, ...). This GUI is devoted to be used by applications developers and users that are interested in analyzing the performance of the selected kernels under different scenarios. The results could be used to modify the parameters of the parallel execution of the application, like the number of nodes, the size of the problem, the distribution of data, etc. To enable a user to quickly find his way in the multidimensional design space, the GUI needs to be used interactively. In long term, it can be also used by resource brokers and schedulers to select the best platform according to the predicted figures offered by the tool. Some of the main features of the tool are: – It is an application dependent tool. It is specifically applied to selected kernels that were analyzed in terms of performance to obtain analytical models. Anyway, it can be also broadly used to study the performance of communication routines themselves. – It is an interactive tool in the sense that the user can easily change the parameters that characterize the system or the problem, and then analyze their influence. – It uses analytical models, and therefore the predictions are obtained fast. This feature is very important for the interactivity. – It is specifically developed for heterogeneous systems. In next sections, the methodology to obtain the models is described in detail.
2
A methodology to model execution times
The methodology is divided into two main stages. The first one is a study about the behaviour of the kernels that is useful to establish the parameters to be considered in the models. The second stage is to obtain the models by correlating execution times under different grid conditions and considering a range of values for the parameters that characterize the kernels.
2.1
Static characterization of the kernels
Initially, the kernels were characterized statically, in terms of the precise number of relevant events. One of these events is the number of floating point operations (FLOPs) required for the target kernel. This number were counted manually or using tools like PAPI [25] when it was necessary. After that, this value is summarized in simple algebraic expressions involving parameters of the kernel, such as the size of some matrices, or the number of iterations of some loop, or the number of non-zero entries in a sparse matrix, or the grade of some polynomial preconditioners, etc. All this parameters are statically established. The study of the communication patterns generated by the MPICH-G2 routines used in the kernels is essential for predicting their overheads. In the same way as for the number of FLOPs, information about the number and size of the communications performed by each processor can be statically stated. In order to model MPI collective communications, the behaviour of many collective routines in terms of individual point to point communications were extracted. In this way, with the information about latency and bandwidth, the cost of these routines can be estimated. For the communications kernels, MPICH-G2 distributes the processors in groups in different levels according to the communication behaviour. For example, level 0 is for WANs, level 1 is for LANs, and so on. Therefore, characterizing collective communications in MPICH-G2 in a set of point to point communications is based on the hierarchy of protocol levels: WAN, LAN, Local, .... In particular: – The MPI Bcast is implemented sequentially in level 0, and as a binary tree in other levels. – MPI Scatter is sequential in level 0, and a binary tree in other levels. – MPI Gather is also sequential in level 0, and a binary tree in other levels. – The associative MPI Reduce is implemented as a MPI Bcast but in reverse order, and it includes the arithmetic or logic operation in each processor. – The no associative MPI Reduce is implemented as a MPI Gather, and after that, the operation is performed sequentially in the root processor. – The implementation of the MPI Barrier is based on a hypercube communication in each level followed by an all-to-all communication in level 0. Figure 1 shows an example of this hierarchical structure for a broadcast in a system with 12 nodes distributed in different levels. 2.2
The models
To obtain a precise model for the performance of the computational kernels, a large number of executions of them under different scenarios were performed. In particular, the kernels were executed on different sites and on the whole grid. As soon as all this static information about computations and communications was modelled, we deal with the development of the performance models. In
Fig. 1. A broadcast on 12 nodes
this stage, JIMS [26], a monitoring tool developed in the CrossGrid project, was used. The cost of each kernel is measured through a large number of executions under different network features. Even thought JIMS offers a broad amount of information, just a reduced set of its functionalities is needed for our purposes. In particular: – The online workload per CPU is used to establish the performance models for computations. – The latency and bandwidth between each pair of processors to define the models for communications. The main idea of our methodology is to obtain the correlation between monitoring information and features of the kernel with measured execution times. This method is based on the concept of “cube of tests”. Consider, as an example, a point to point communication of a certain size. This kernel is executed K times in a short period of time, producing K measures of runtime named Ti . Consider that M monitoring tests were performed in this short period of time, producing M measures of latency and bandwidth named Lj and Wj respectively. A cube of tests is defined as the cube in a, in this particular case, three dimensional space {L, W, T } limited by the minimum and maximum values achieved for these three parameters. Figure 2 illustrates a cube of tests. Note that this cube is defined for a certain size of the message. Therefore a fourth dimension has to be taken into account to obtain the model, that is, the size of the message. Figure 3 illustrates how the monitorized information is extracted when the kernels are executed. In order to minimize the number of executions that are influenced by the monitoring process, their number have to be higher than the number of monitoring tests. According to our experience, about 10 executions of the kernel between two consecutive monitoring stamps are enough.
Fig. 2. A cube of tests Monitoring stamps
TIME
Kernel tests
Fig. 3. Samples of the execution of the kernels in a monitorized environment
The monitoring information obtained in the period of time that defines the measurements must be homogeneous. If this is not the case, we assume that the status of the system is not stable, and it can not produce a consistent cube of tests. In fact, some threshold to guaranty this homogeneity must be established to define the cubes. Figure 4 shows, as an example, the measured execution times for a ping-pong kernel executed 5000 times. The message is 32KB long. Note that most of the measures (in this case 3150) are in a thin interval of execution times, the others are not considered because they are influenced by the monitoring process, or they are considered as spurious measurements.
1000
800
time (ms)
600
400
200
0 0
500
1000
1500
2000
2500 samples
3000
3500
4000
4500
5000
Fig. 4. 5000 samples of the execution of the ping-pong kernel
0.75
0.7
0.65
latency (ms)
0.6
0.55
0.5
0.45
0.4
0.35
0.3 600
650
700
750
800 samples
850
900
950
1000
Fig. 5. Samples of the monitorized latency when the ping-pong kernel is executed
2.8e+07
2.6e+07
2.4e+07
bandwidth (b/s)
2.2e+07
2e+07
1.8e+07
1.6e+07
1.4e+07
1.2e+07 600
650
700
750
800 samples
850
900
950
1000
Fig. 6. Samples of the monitorized bandwidth when the ping-pong kernel is executed
The monitoring process is also affected by the execution of the kernels. As an example, figures 5 and 6 show the values of, respectively, the latency and bandwidth obtained from JIMS when the experiment is being performed. Note that these values change when the kernel is, in fact, executed (this corresponds to the band in the middle of the figure, around sample number 800 in the figure). Therefore the values obtained just before and after the execution of the kernel are considered to define the cube of tests. Note that some spurious values were obtained, they have to be discarded. 2.3
The kernels
The kernels that are currently considered in the tool are the following: – Paraiso is a MPI library of iterative methods for solving large sparse matrix systems. This library includes the implementation of analytical models to
characterize a number of communications routines that can also be considered as kernels. These models for the communication routines are not only needed for implementing the models of the kernels, but also for other users who are not interested in these specific kernels. – From the air pollution application of the CrossGrid project, we focused on the routine that consumes most of the runtime of this application. It is called vertlq. It was analyzed, and an analytical model for characterizing its performance was developed. This routine basically consists of an intensive computational part that is executed in parallel, i.e. locally, and just one reduction operation that involves communications. Both parts are uncoupled, so the model adds both contributions. – From the flooding application of the CrossGrid project, we developed some models for the kernels of this task. For this application the standard PETSc library for solving large sparse systems is the main kernel. – Concerning the HEP application of the CrossGrid project, this application was coded using the mater-slave paradigm. Its main kernel is a learning process of a neural network that includes two parts: a parallel part that is computational intensive, and some communications: a gather and a scatter from the master to the nodes and some point to point communications.
3
The Graphical User Interface
An interactive graphical user interface (GUI) was developed in this task. It shows three types of information: – Information based on the analytical models, like predicted execution times, or load balance based on the predicted execution times on each node. – Information about features of the kernel, like the number and size of some collective communication. – Information about the status of the grid, like latency between a certain pair of nodes. Therefore, this tool can include detailed information about specific parameters of the code, as well as the predicted information about the execution of the code. The user can interact on this information, changing some parameters, like, for example, the latency between a pair of nodes. In this way, the user can analyze their effect over the overall performance. The interface provides interactivity in the analysis of the behaviour of the code under different conditions (number of processors, distributions, input data, network parameters, ...). This GUI is devoted to be used by applications developers and users that are interested in analyzing the performance of the selected kernels under different scenarios. The results could be used to modify the parameters of the parallel execution of the application, like the number of nodes, the size of the problem, the distribution of data, etc. To enable a user to quickly find his way in the multidimensional design space, the GUI needs to be used interactively. In long
term, it can be also used by resource brokers and schedulers to select the best platform according to the predicted figures offered by the tool. Some of the main features of this GUI are: – It is an application dependent tool. It is specifically applied to selected kernels for which analytical models are available. Anyway, note that it can be also used by others than the CrossGrid applications developers, for example to study the performance of communication routines themselves. – It is an interactive tool in the sense that the user can easily change the parameters that characterize the system or the code them selves, and then analyze the influence in the overall performance of these changes. – It uses analytical models, and therefore the predictions are obtained very fast. This feature is very important for the interactivity. – It is specifically developed for heterogeneous systems. In summary, the GUI of the PPC tool is devoted to be used by applications developers and users that are interested in analyzing the performance of the selected kernels under different scenarios. The results could be used to modify the parameters of the parallel execution of the application, like the number of nodes, the size of the problem, the distribution of data, etc. To enable a user to quickly find his way in the multidimensional design space, PPC needs to be used interactively. In long term, it can be also used by resource brokers and schedulers to select the best platform according to the predicted figures offered by the tool.
4
Conclusions
This work present a methodology to characterize the execution of selected kernels on a Grid environment. It is based on exhaustive executions of the kernels under different states of the Grid. As a Grid can change its behavior often, we only consider measures that are homogeneous in terms of monitorized information. These pieces of homogeneous results define “cube of test” that are used to correlate the dependency with network and node based information provided by some monitoring system. This methodology can be applied to kernels that are massively computational or that are dominated by communications. A GUI was also presented in this paper.
5
Acknowledges
This work was supported in part by the European Union through the IST-200132243 project “CrossGrid”.
References 1. M. Simmons and R. Koskela. Performance instrumentation and visualization. ACM Press. 1990.
2. M. Ajmone, G. balbo and G. Conte. A class of generalized stochastic Petri nets for the performance analysis of multiprocessor systems. ACM trans. Computer Systems. Vol. 2, pp. 93-122. 1984. 3. N. Gotz, U. Herzog and M. rettelbach. Multiprocessor and distributed system design: the integration of functional specification and performance analysis using stochastic process algebras. Proc. SIGMETICS93. 4. V.S. Adve. Analyzing the behavior and performance of parallel programs. PhD thesis. Techn. Report 1201. Ini. Of Wisconsin. 1993. 5. V. Mak and S. Lundstrom. Predicting performance of parallel computations. IEEE trans. Parallel and distributed systems. Vols 1. pp. 253-270. 1990. 6. L. Valiant. A bridging model for parallel computations. Comm. ACM Vol 33. pp. 103-111. 1990. 7. D. Culler et.al. LogP: towards a realistic model of parallel computations. Proc. 4th ACM SIGPLAN symp. Pp. 1-12. 1993. 8. T. Fahringer. Estimating and optimizing performance for parallel programs. Computer, pp. 47-56. Nov. 1995. 9. C. Mendes and D. reed. Integrated compilation and scalability analysis for parallel systems. Proc. Int. Conf. Parallel Architectures and Compiler Technology. Pp. 385392. 1998. 10. D.J. Kerbyson, E. Papaefstatuiou, J.S. Harper, S.C. Perry and G.R. Nudd. Is Predictive Tracing too late for HPC Users? In “High Performance Computing” Proc. HPCI’98 Conference 1998 R.J. Allan, M.F. Guest, D.S. Henty, D. Nicole and A.D. Simpson (eds.) pp 57-67. 1999. Plenum/Kluwer Publishing. 11. T. Hey, A. Dunlop and E. Hernandez. Realistic Parallel Performance Estimation Parallel Computing 23. 1997. pp. 5.21. 12. DIMEMAS. http://www.pallas.de/pages/dimemas.htm 13. K. Kubota, K. Itakura, M. Sato and T. Boku. Practical Simulation of large-scale Parallel Programs and its Performance Analysis of the NAS Parallel Benchmarks Lecture Notes Comp. Sci. 1470 1998. pp. 244-54. 14. Carnival. http://www.cs.rochester.edu/u/leblanc/prediction.html 15. J.P. Kitajima, C. Tron and B. Plateau. ALPES: a Tool for Performance Evaluation of Parallel Programs in “Environnments and Tools for Parallel Scientific Computing” J.J. Dongarra and B. Tourancheau (eds.). North-Holland. 1993. pp 213-28. 16. Fahringer Estimating and optimising performance from parallel programs. special issue IEEE Computer 28. 1995. pp. 47-56. 17. Perfore. http://ParaMount.www.ecn.purdue.edu 18. Bricks. http://www.is.ocha.ac.jp/ takefusa/bricks/ 19. C.-C. Hui, M. Hamdi and I. Ahmad. SPEED: A Parallel Platform for Solving and Predicting the Performance of PDEs on Distributed Systems Concurrency: Practice and Experience 9. 1996. pp. 537-568. 20. M. Miller, C.D. Hansen and C.R. Johnson. Simulated Steering with SCIRun in a Distributed Environment in “Applied Parallel Computing” Proc. 4th International Workshop PARA’98. LNCS 1541. Springer. 1998. pp366-376. 21. PARAISO. http://www.ac.usc.es/ paraiso 22. AlpStone. www.ifi.unibas.ch/generate.doc/English/Research/ParProg/alpstone/doc.html 23. DAMIEN. http://www.hlrs.de/organization/pds/projects/damien/ 24. CrossGrid project. http://www.eu-crossgrid.org/ 25. PAPI. http://icl.cs.utk.edu/papi/ 26. JIMS. http://wp3.crossgrid.org/