Graph Based Characterization of Distributed Applications

3 downloads 4618 Views 340KB Size Report
Jan 30, 1999 - A TSPG allows a speci cation of a distributed appli- ... distributed) applications to reduce the development and performance debugging costs.
Graph Based Characterization of Distributed Applications Gabriele Kotsis, Markus Braun

Institute of Applied Computer Science and Information Systems, University of  Vienna, Lenaugasse 2/8, A-1080 Wien, Osterreich, [gabijbraun]@ani.univie.ac.at

Abstract A critical task in the development and execution of distributed applications is to identify the potential degree of parallelism contained in the application. This information is necessary in the design of applications in order to pursue only a promising algorithmic idea for implementation, but also in the execution of existing applications for resource allocation and scheduling decisions. In this paper, we present analytical techniques to derive the potential degree of parallelism of distributed applications described by means of Timed Structural Parallelism Graphs (TSPGs). A TSPG allows a speci cation of a distributed application in terms of its components, the activation and dependence relations among the components, and histogram/interval based estimates on the execution times of components. Based on an analysis of paths through the TSPG (corresponding to paths in the execution) and by applying interval arithmetics, we are able to derive from the TSPG model a set of potential parallelism pro les. From these pro les further performance indices as the average degree of parallelism as well as hypothetical speedup can be derived. In this paper we focus on an evaluation of the analysis technique with respect to its computational complexity and validate the proposed approach by a comparison with results obtained from simulation.

1 Introduction Performance evaluation studies are to be an integral part in the design of (parallel and distributed) applications to reduce the development and performance debugging costs [1]. While measurement techniques can be the method of choice for evaluating existing systems, modeling approaches are necessary when evaluating systems in earlier design stages. Various performance prediction frameworks for evaluating parallel and distributed systems have been developed, where aspects of workload (program), architecture, and Preprint submitted to Elsevier Preprint

30 January, 1999

2 0.3[0.1;0.2] 0.7[0.7;0.9] 0.4

0.7[3;3.1] 0.3[5.8;6]

1[0.4;0.5]

1

1[2.9;3]

3

1[10;11]

1[0.2;0.3]

4

1[3;3.2]

1[0.1;0.2] 1[0.1;0.2] 0.2

0.6

0.1

0.7

1[2;2] 1[2;2.2]

5

6

7 1[1.9;2]

1[0.2;0.3]

12 1[1;1.1]

1[0.1;0.2]

13

8

9

1[7.8;8]

1[2;2.1]

11 1[0.7;1]

1[0.1;0.1]

1[0.7;0.8]

1[7.4;7.6]

10 1[1.9;2]

14

1[1.3;1.6]

15

1[1.2;1.3]

16

1[1;1]

17

19

1[0.5;0.6]

20

1[2.7;3]

18

1[2;2.1]

1[0;0]

20’

1[3;3]

Fig. 1. Example of a Timed State Graph

mapping can be speci ed and evaluated in a structured and exible way. Examples are PSEE [2], the PAPS tool set [3] the PRM approach [4], the approach by Mitchele Thiel [5,6], or the GRADE environment [7]. In most of these approaches, graph models have been used to characterize the application. In previous work, we have introduced a new graph model, the structural parallelism graph SPG ([8,9]), which provides two major advantages with regard to the traditional task graph as (1) it is more exible as far as the level of granularity of the distributed application it represents is concerned and (2) communication restrictions are diminished because a task can also communicate while it is still processing. In [10], the SPG concept has been extended to include quantitative information on computation and communication demands of the application; this extended model is called a Timed Structural Parallelism Graph (TSPG) and an analysis technique for deriving parallelism pro les considering this timing information has been proposed. In this paper this analysis technique is evaluated with regard to the factors in uencing its computational complexity and the accuracy of information it produces. 2

2 The TSPG Model Figure 1 shows an example of an TSPG, which is de ned as follows:

De nition 1 A Timed Structural Parallelism Graph is an acyclic directed graph T SP G = (V; D; W; T ),

where V is a set of vertices corresponding to the components (parts of the application), D = (DP [ DA ) is a set of directed arcs de ning two di erent types of relations between the components, and T = fT V [ T Ag is a set of timing parameters. A timing parameter tvi 2 T V is associated to a node and represents the duration of the execution time of the corresponding component. The timing parameter tAi:j 2 T A gives the instance in time relative to the execution time of node i upon which node j is activated.

A component is a part of the application; depending on the chosen granularity for the modeling study, it can be for example a procedure or a task. In the following, we will simply use the term component. Single valued parameters could be used as timing information for the components and arcs in the TSPG producing a single point measure for each performance index of interest. However, the exact value of every component execution time may not be known to the performance analyst leading to uncertainties in the parametrization. Furthermore the performance of distributed systems may be variable due to (1) non-deterministic processing requirements (if the CPU requirements of the program vary signi cantly across di erent executions on a particular input) (2) as well as random delays due to inter communication events and contention for shared hardware and software resources Existing approaches cope with these problems in assuming a certain distribution for timing parameters at the cost of either simplifying model assumptions (e.g. exponentially distributed parameters) or complex solution techniques. In our approach, we propose the use of intervals. To be concise: A timing parameter tvi is given by tvi : pvi;k [tvi;k ; tvik ]; k = 1 : : K

where tvi;k is the lower and tvik the upper bound of the execution time interval and pi;k represents the probability that the execution time of component i is within the interval k . The timing parameter tAi:j is de ned as tAi:j : pAi:j;l [xAi:j;l; xAi:j;l ]; l = 1 : : L

3

where xAi:j;l denotes a percentage and is the lower bound of the relative part in the execution time of component i where i can activate component j , xAi:j;l represents the corresponding upper bound and pAi:j;l represents the probability that component j is activated by component i in the interval l. The example graph representing a distributed application shown in Figure 1 therefore has the following semantics: After performing some initial computation component 1 activates component 2, 3 and 4. During its execution component 2 may activate component 5 and 6 (e.g. two subprocedures or subfunctions of component 2) at the speci ed activation intervals. It either activates component 5 (with probability 0.4) or component 6 (with probability 0.6). Component 3 activates both of its successor components during its execution. Component 4 either activates component 9 (with probability 0.2), component 10 (with probability 0.7) or component 11 (with probability 0.7). Component 20' is just a dummy node to reunite the OR branches. The execution of component 20 starts after the execution of component 19, 18 or 19, 11 has nished. Finding appropriate estimates to characterize the timing behavior is crucial for any further analysis. Parameters can be obtained for example from a static analysis of the program code. In [11,12] a performance estimator is introduced which computes a set of parallel program parameters such as network contentions, transfer and computation time. These parameters can be selectively determined for statements, loops, procedures or the entire program and could be used to obtain appropriate lower and upper bounds for the execution time of di erent program parts. Another possibility to obtain interval parameters would be to measure the execution time of di erent program parts during execution. From di erent measurements lower and upper bounds for the execution time of each program part could be obtained. Finally the analyst might be able to estimate the appropriate execution time of di erent program parts in terms of best and worst case assumptions.

3 Deriving Parallelism Pro les A parallelism pro le represents the degree of parallelism of an application over time. In this paper we will use the following de nition:

De nition 2 A parallelism pro le (PP) is a sequence of n time intervals Ii with i =

1; : : : ; n, where ii; ii are the lower and upper bound of the interval Ii. To each time interval a numeric value nIi is assigned, indicating the number of components active in the SPG during the respective time interval.

Further performance indices can ber derived from a parallelism pro le, e.g. the average, minimum and maximum DOP, which are important parameters in mapping and schedul4

ing decisions [13]. From the parallelism pro les, also hypothetical execution times T (n) and speedups S (n) can be derived assuming n available processing elements. A straight-forward approach to derive parallelism pro les from TSPGs is to enumerate all possible states of the program (number of active nodes) and to derive from the timing information state transition probabilities. This approach is not applicable from a practical point of view because of state space explosion and because of dependencies of transition probabilities on previous states (non-markovian type). To overcome these problems, alternative solution techniques have been proposed in [10]. A simpli ed approach is to either consider the minimum or the maximum execution times for each node. Thus, each node has only a single value timing parameter and the computation of the potential parallelism pro le is comparatively fast and simple. But the resulting pro les will not be representative of the actual program behavior. A more detailed analysis is provided by the Activation Interval Approach. First all alternative paths through the TSPG are determined. The number of paths depends on the number of outgoing arcs with OR semantic. For all nodes the possible starting and terminating time intervals and the corresponding probabilities are determined. The input to this step is the speci cation of the TSPG (its structure and the timing parameters). The output is a set of interval pairs for each node. By comparing the start and end times for the components of all alterative paths, parallelism pro les can be derived. The algorithm will produce a set of parallelism pro les and for each pro le an associated probability indicating the likeliness of the actual execution exhibiting this particular parallelism behavior. It is obvious, that both, the accuracy as well as the computational complexity of this approach will depend on the speci c TSPG under study. Factors of in uence include the structure of the TSPG (regular versus irregular structures), the number of nodes, the number and length of timing intervals, etc. In the following, we analyze the e ect of these factors on computational complexity and accuracy.

4 Evaluating the Computational Complexity of the Activation-Interval Approach In order to evaluate the computational complexity experiment series have been devised applying the

 Minimum approach  Maximum approach  Activation-Interval approach 5

Approach

Time

Minimum 0,00753 Maximum 0,00752 Activation-interval 0,01780

Input+

24,38 24,39 9,76

Percentage of Time Spent in Procedures PP I O S Co Pa C E T In Sp F N P Ge Pr 0,15 12,70 8,70 17,04 4,33 32,7 6 0,15 12,72 8,76 16,95 4,33 32,69 6 0,06 5,35 10,10 22,86 5,62 46,20 18

* Time gives the execution time in seconds + The \Input" procedure reads the parameters of a TSPG out of an input le I O S . . . identi cation of OR semantics Co Pa . . . computation of paths C E T . . . calculating end times In Sp . . . interval splitting F N P . . . nding next pro les Ge Pr . . . generation of pro les PP . . . number of pro les generated

Table 1 Comparison of Min/Max Approach and Activation Interval Approach

on the TSPG in Figure 1. The minimum approach would always take the lower bound of the execution times and the earliest possible activation times for arcs, thus resulting in a deterministic timing model. The maximum approach analogously takes the upper bound for execution time and the latest possible activation time. Table 1 shows the averaged results 1 for each approach. The number of pro les and the portion of time spent in the most important procedures is also denoted. As can be seen in Table 1 the processing of the TSPG using the Activation-Interval technique requires approximately two and a half times more time than the processing of the TSPG using the minimum or maximum technique. Most time is spent in the \Interval Splitting" procedure, which computes starting and ending times for nodes with multiple predecessors and AND incoming semantic (approximately 17% for the Min/Max and 23% for the Activation-Interval approach), and in the procedure \Generate Pro le", which assembles parallelism pro les and writes them into a le (approximately 33% for the Min/Max and 46% for the Activation-interval). 1 All measurement experiments discussed in the following have been made on a Silicon Graphics

Indigo 2 using Speedshop, an integrated package of performance tools. Each experiment series was repeated 100 times and averaged values are reported.

6

Fig. 2. Relative Execution Times of the Procedures

To be able to analyze which procedures cause the time di erences between the three approaches, Figure 2 compares the absolute time spent in the most important procedures. It shows that for the procedures, which are responsible for determining all paths through the TSPG (procedures \Identify Or Semantics" and \Compute Path") the execution time is approximately the same. The same number of paths has to be computed in each of the solution approaches. Di erences can be observed for the procedures which compute starting and ending times for the nodes (procedures \Compute Execution Times" and \Interval Splitting") and for the procedure \Generate Pro le". On average they need approximately three times longer for processing the TSPG using the Activation-Interval approach than using the Min/Max approach. This time increase is caused by a greater number of parallelism pro les that have to be generated for the Activation-Interval approach. Exactly three times more pro les are produced, resulting in the denoted time increase. In principle the number of pro les is driven by the number of paths and the number of timing intervals associated to the nodes and arcs. As the Min/Max approach reduces the timing information to a minimum or a maximum value, the in uence factor for the number of pro les is reduced to the number of paths through a TSPG. Di erences in the computational complexity between the Activation-Interval approach and the Min/Max approach will usually arise if multiple timing intervals are associated to nodes or arcs. Therefore for TSPGs with only one timing interval per node and arc, the computational complexity of the Min/Max approach corresponds to the complexity of the Activation-Interval. The analysis of the factors in uencing the computational complexity of the Activation-Interval approach (the number of nodes, the number of arcs with AND/OR semantic, the number of timing intervals associated to nodes and arcs), which is part of the next section, therefore includes the analysis of the factors in uencing the performance of the Min/Max approach (number of nodes, the number of arcs with AND/OR semantic). 7

1

2

4

7

3

5

8

11

6

9

12

14

10

13

15

16

4.1 2k Factorial Design

Fig. 3. Regularly Structured TSPGs

The second \benchmark" application for the Activation-Interval approach is characterized by the well known \divide and conquer" structure (see Figure 3). In this example all arcs are assumed to be activation arcs. To analyze the main factors in uencing the performance of the Activation-Interval approach appropriate concepts had to be found to obtain the maximum information with a reasonable number of experiments. As the number of factors 2 is large and the number of levels (the values a factor can assume) is theoretically in nite, a full factorial design (meaning that all combinations of levels the di erent factors can assume are experimented) is impossible. Therefore a 2k factorial design was chosen to determine the e ect of k factors. Four factors have been chosen for the analysis (see Table 2). Four binary variables xA, xB , xC and xD are introduced representing the level that each factor may assume (xi = 1 for Level-1 and xi = 2 denotes Level-2). For all combinations of these variables experiments have been made. The results are outlined in Table 3. To determine which factors have the largest in uence on the execution time, a regression model was chosen. The execution time y can be regressed on xA, xB , xC and xD using a nonlinear regression model of the form y = q0 + qA xA + qB xB + qC xC + qD xD +qAB xAxB + qAC xAxC +qAD xAxD + qBC xB xC + qBDxB xD +qCD xC xD + qABC xAxB xC 2 following [14] a factor is a variable that e ects the computational complexity and has several

alternatives, e.g. in the context of the Activation-Interval approach the number of nodes or the number of timing intervals connected to a node are factors

8

Factor Description Level-1 Level-2 A number of nodes of the TSPG 9 16 B number of timing intervals associated to a n n + 7 node C number of activation intervals associated to n n + 7 an activation arc D type of outgoing arcs (AND or OR) all nodes connected all nodes connected via AND branches via OR branches * n corresponds to the number of nodes ** to seven nodes two timing intervals are associated Table 2 Factors and Levels used in the Activation-Interval Study xA xB xC xD

1111 1112 1121 1122 1211 1212 1221 1222

Exec. Time y (in sec.) 0.00167 0.00298 0.06555 0.00927 0.05678 0.02877 7.11214 0.13913

xA xB xC xD

2111 2112 2121 2122 2211 2212 2221 2222

Table 3 Observed execution times

Exec. Time y (in sec.) 0.00336 0.04425 0.15692 0.06635 0.14552 0.11723 18.46458 0.28005

+qABDxAxB xD + qBCDxB xC xD +qABCDxA xB xC xD This regression model can be solved for the observed execution times, resulting in the following equation y = 1; 668 + 0; 741xA + 1; 625xB + 1; 618xC ? 1; 582xD +0; 717xA xB + 0; 714xA xC ? 0; 7xA xD +1; 588xB xC ? 1; 569xB xD ? 1; 581xC xD +0; 701xA xB xC ? 0; 701xA xB xD ? 1; 557xB xC xD ?0; 696xA xB xC xD

9

Parameter qA qB qC qD Var. explained (in %) 2,6 12,5 12,4 11,8 Parameter qBC qBD qCD qABC Var. explained (in %) 11,9 11,6 11,8 2,3

qAB

qAC

qAD

qABD

qBCD

qABCD

2,4

Table 4 Portion of variation explained

2,3

2,4

11,4

2,3

2,2

The result can be interpreted as follows. The mean execution time is 1,668 seconds, the e ect of the number of nodes is 0,741 sec., the e ect of the number of timing intervals per node is 1,625 and so on. To measure the importance of a factor, or of a combination of factors, the total variation in the execution time that is explained by this factor (factor combination) can be determined by computing the \Sum of Squares Total" (SST) as follows [14]: SST = 24 (qA2 + qB2 + qC2 + qD2 2 + q2 + q2 + q2 + q2 + q2 +qAB CD BD AC AD BC 2 2 2 2 +qABC + qABD + qBCD + qABCD = = 8; 79 + 42; 23 + 41; 91 + 40; 06 +8; 23 + 8; 15 + 7; 85 + 40; 33 + 39; 40 + 39; 98 +7; 855 + 7; 86 + 38; 79 + 7; 75 = 339; 21

The portion of variation explained by these e ects are listed in Table 4. The analysis of these results shows that the portions of variation explained by the factors B (number of timing intervals per node), C (number of activation intervals per arc) and D (AND or OR relation between the arcs) and their combinations all lie between 11,4 % and 12,5 %. The portions of variation explained by the factor A (number of nodes) and all it's combinations with other factors only lies between 2,2 % and 2,6 %. Consequently the factors B, C, D have, under the chosen experimental conditions, approximately the same in uence on the computational complexity. The number of nodes seems to be of less importance.

5 Validation by Simulation In addition to the analysis of the computational complexity also the accuracy of information obtained from the Activation-Interval approach is evaluated. Therefore its results are compared to the results produced by simulation. In the simulation, we have to make an assumption on the distribution of the execution 10

DOP speedup DOP AI/simulation speedup AI/simulation Simulation 1 2.378 1.3785 1.0105 1.0181 Simulation 2 2.443 1.4426 0.9836 0.9729 AI 2.403 1.4035 Table 5 Avg. Deviation to Activation-Interval pro les

time of nodes and of the time during which a component activates another components. In the rst series of experiments, a uniform distribution was chosen were we expected a close match between the simulation results and the prediction of the interval approach, as the boundaries of the uniform distribution correspond directly to the interval boundaries. In the second series of simulations, a normal distribution was chosen. The corresponding parameters in the interval approach are obtained by truncating the tails of the normal distribution as follows: To simulate the execution behavior of a real application timing information for the nodes and arcs of the TSPG in Figure 1 is obtained in the following way: [tvi;n; tvi;n] of width 42 with  being in the \center" of the interval, i.e. the following equations hold:  = tvi;n + 2 =

tvi;n ? tvi;n

tvi;n ? tvi;2

2

4

The average degree of parallelism and the derived hypothetical speedup are used as rst indicators for comparing the results obtained from simulation and obtained from analytical analysis. Simulation 1 in Table 5 gives the results for the simulation series based on a uniform distribution, Simulation 2 gives the results for a simulation based on a normal distribution of execution and activation times, AI gives the results of the activation interval approach. In the last two columns, the relation of the results from the AI approach and the simulation results is given. A value greater than one indicates, that the AI approach overestimates the DOP and the speedup as compared to simulation. From these experiments, we observe, that the AI approach tends to overestimate the potential degree of parallelism if the execution and activation times follow a uniform distribution, while it underestimates the DOP in case of a normal distribution. The results produced by the Activation-Interval approach deviate only slightly from the results of the simulation. This can also be seen from comparing the shapes of the paral11

lelism pro les obtained by the Activation-Interval approach against those obtained from various simulation runs. Figures 4 and 5 shows the parallelism and speedup pro les for ten arbitrarily chosen simulation runs assuming normal distribution compared to the AI pro les generated for the corresponding paths through the TSPG.

6 Conclusions and Future Work We have presented a graph model for characterizing distributed applications to be used in early stages of the program development life cycle. The model supports speci cation of the application's structure at a high level of abstraction in terms of components communicating with each other resp. activating each other during execution. Timing parameters representing the estimated execution time of components and the occurrence of activation/communication can be speci ed as intervals. A technique has been presented which allows a fast evaluation of these graph models respect to the exploitable parallelism given by parallelism pro les. As the quality of such an approach depends on both, the accuracy of the results as well as the complexity in providing these results, a study of the factors in uencing computational complexity (using k-factorial design) and a study on the accuracy of the estimates (validation against simulation results) have been presented. The studies have shown, that with respect to both criteria the proposed approach performs well, but the study also revealed potentials for further improvements. Two possible directions can be identi ed:

 \Similar" timing intervals could be aggregated. When generating starting or terminating

intervals, some intervals are rather close to each other or even overlapping. Others may only have a very small probability of occurrence. In these cases, it might be helpful to combine similar intervals into a single interval or to omit intervals, if the probability is very small. Ideally, the analyst should have the possibility to specify the degree up to which aggregation should be allowed.  \Similar" pro les could be aggregated. When generating parallelism pro les, many pro les have a similar shape. By de ning some kind of pro le pattern similar pro les may be combined into one pro le with a larger probability. Also it may also be helpful to omit pro les with a very small probability. To study the trade o between the gain in performance versus the loss in accuracy of these modi cations is subject to future research. 12

Fig. 4. Parallelism Pro les - Uniform Distribution vs. AI Approach

13

Fig. 5. Parallelism Pro les - Normal Distribution vs. AI Approach

14

References [1] Connie U. Smith, Performance Engineering of Software Systems, Addison Wesley, 1990. [2] E. Luque, R. Suppi, and J. Sorribes, \Designing parallel systems: a performance prediction problem", Information and Software Technology, vol. 34, no. 12, pp. 813{823, Dec. 1992. [3] H. Wabnig, G. Kotsis, and G. Haring, \Performance prediction of parallel programs", in Messung, Modellierung und Bewertung von Rechen- und Kommunikationssystemen, B. Walke and O. Spaniol, Eds. 1993, pp. 64{76, Springer Verlag. [4] A. Ferscha, \A petri net approach for performance oriented parallel program design", Journal of Parallel and Distributed Computing, vol. 15, pp. 188{206, August 1992. [5] U. Herzog, \Performance Evaluation as an Integral Part of System Design", in Proceedings of the Transputers'92 Conference, M. Becker et al., Ed. 1992, IOS Press. [6] A. Mitschele-Thiel, \Automatic con guration and optimization of parallel transputer applications", in Proc. World Transputer Congress, Aachen, Germany. 1993, IOS Press. [7] P. Kacsuk, J. Cunha, G. Dozsa, J. Lourenco, T. Fadgyas, and T. Antao, \A graphical development and debugging environment for parallel programs", Parallel Computing, vol. 22, no. 13, pp. 1747{1770, 1997. [8] Maria Calzarossa, Guenter Haring, Gabriele Kotsis, Alessandro Merlo, and Daniele Tessera, \A hierarchical approach to workload characterization for parallel systems", in High Performance Computing and Networking, LNCS vol. 919, B. Hertzberger and G. Serazzi, Eds. 1995, pp. 102{109, Springer. [9] Markus Braun, Guenter Haring, and Gabriele Kotsis, \Deriving parallelism pro les from structural parallelism graphs", in Proceedings of the TDP'96, 1996, pp. 455{468. [10] M. Braun and G. Kotsis, \Interval based workload characterization for distributed systems", in Proceedings of the Int. Conf. on Modelling Techniques and Tools, Lecture Notes in Computer Science. 1997, Springer. [11] T. Fahringer, \Automatic performance prediction of parallel programs", in Automatic Performance Prediction of Parallel Programs, Boston, USA, March, 1996. 1996, Kluwer Academic Publishers, ISBN 0-7923-9708-8. [12] T. Fahringer, \Compile-time estimation of communication costs for data parallel programs", Journal of Parallel and Distributed Computing, vol. 39, no. 1, pp. 46{65, November 1996. [13] K. C. Sevcik, \Characterization of parallelism in applications and their use in schduling", Performance Evaluation Review, Special Issue, 1989 ACM SIGMETRICS, vol. 17, no. 1, pp. 171{180, May 1989. [14] Raj Jain, The Art of Computer System Performance Analysis. Techniques for Experimental Design, Measurement, Simulation, and Modeling, John Wiley and Sons, Inc., 1991.

15