Mar 27, 1994 - indices are obtained from the solution of this model. .... Page 11 ... In Iane 94] the communication aspect of the workload is modeled for a class of data paral- ...... #include ..... Ghos 92] D. Ghosal, S. K. Tripathi, G. Serazzi, S. Noh, P. Lenzi, and .... Maju 88] S. Majumdar, D. L. Eager, and R. B. Bunt.
PACT
Systematic Approach for Workload Characterization of Parallel Programs Deliverable D3H-1
Responsible Partner: Institut fur Angewandte Informatik und Informationssysteme, Univ. of Vienna Contributing Partners: Authors: A. Ferscha, J. Johnson, G. Kotsis Version: 1.1 Date: March 27 1994 Status: release Con dentiality: restricted Copyright notice:
c 1994 by the PACT Consortium
All rights reserved. No part of this document may be photocopied, reproduced or translated in any way, without prior written consent of the author(s). The Austrian members of the PACT Consortium are University of Vienna - Institute for Software Technology and Parallel Systems, Johannes Kepler University Linz - Research Institute for Symbolic Computation (RISC), Johannes Kepler University Linz - Department of Computer Science, University of Salzburg - Research Institute for Softwaretechnology and University of Vienna - Department of Computer Science.
PACT
CONTENTS
Contents
1 Motivation 2 Classical Workload Characterization of Parallel Programs 2.1 Parameter Based Models : : : : : : : : 2.1.1 Single Value Parameters : : : : 2.1.2 Signatures, Pro les and Shapes 2.2 Behavior Graph-Based Models : : : : 2.2.1 Undirected Graph Models : : : 2.2.2 Directed Graph Models : : : : 2.2.3 Petri Nets : : : : : : : : : : : : 2.2.4 PERT Networks : : : : : : : : 2.2.5 Queueing Network Models : : :
3 Program Code based WL Models
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
: : : : : : : : :
3.1 Performance Oriented Parallel Program Development : : : 3.2 The Program Skeleton Speci cation Language : : : : : : : 3.2.1 Tasks : : : : : : : : : : : : : : : : : : : : : : : : : 3.2.2 Processes : : : : : : : : : : : : : : : : : : : : : : : 3.2.3 Packets : : : : : : : : : : : : : : : : : : : : : : : : 3.2.4 Interprocessor Communication : : : : : : : : : : : 3.2.5 Source Texts : : : : : : : : : : : : : : : : : : : : : 3.2.6 Task Structure Speci cation (TSS) : : : : : : : : : 3.2.7 Task Requirements Speci cation (TRS) : : : : : : 3.2.8 Task Behavior Speci cation (TBS) : : : : : : : : : 3.2.9 Packet Requirements Speci cation (PRS) : : : : : 3.2.10 Packet Behavior Speci cation (PBS) : : : : : : : : 3.2.11 System Requirements File : : : : : : : : : : : : : : 3.2.12 Example: Householder Reduction : : : : : : : : : : 3.3 Scheduling and Simulation of Virtual Processor Programs
4 Conclusions
D3H-1/Rel 1.1/March 27 1994
: : : : : : : : :
: : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
: : : : : : : : : : : : : : : : : : : : : : : :
2 4
4 4 7 9 9 10 16 17 17
20
20 21 21 22 22 22 23 23 24 24 25 25 26 26 33
34
1
Motivation
PACT
1 Motivation Since the use of massively parallel processing hardware in science and engineering is primarily motivated to increase computational performance, it is obvious that a performance orientation must be the driving force in the development process of parallel programs. Until now only very few eorts have been devoted to the integration of performance engineering activities in the early design phases of parallel application software; computerized tools suitable to support parallel programming from a performance point of view are rarely in existence today. Performance evaluation and engineering [Smit 90a] in the context of parallel processing has progressed over the past quarter of a century along dierent lines. Historically rst, works on the so called mapping problem appeared [Ston 77a, Bokh 81], which addressed assignment and scheduling issues in distributed memory MIMD systems. A large body of literature (see e.g. [Norm 93] for a classi cation of the particular problems studied) is available on the subject, however, most of the models have become obsolete due to the technological development of multiprocessor hardware now being in its third generation. An interesting observation within this branch of research is, that it attempted to bridge a gap between the application programmer and the target architecture in the following way: parallel program code produced by the programmer is translated or abstracted to a higher level behavior description and usually represented by a directed, mostly acyclic and sometimes stochastic program graph (DAGs [Gele 89]) integrating an execution precedence relation, a computation cost model, and/or a communication cost model and/or a contention model. On the other hand, an abstracted hardware model, generally also a graph model re ecting topological aspects of the underlying hardware, serves as input to a procedure trying to nd a (semi-) optimal schedule or mapping of (near-) minimum completion time. This program-architecture gap, however, is somewhat arti cial: application programmers do not ignore topological and structural properties of the target architecture being programmed. On the contrary, they seek to eciently utilize hardware characteristics, thus preventing the mapping problem to arise at all. It is generally seen that contributions to this arti cial problem are more or less obsolete, and mapping research has signi cantly declined today. The classical branch of computer systems performance evaluation has also recognized elds of the research and applications of methods in parallel processing [Akyi 92]. Model based performance analysis has recognized an overwhelming mass of contributions in the various paradigms: queueing network models [Heid 82, Heid 83, Funk 91, Munt 90], Petri Net modeling [Ajmo 84] and markovian/stochastic performance models [Ciar 90, Smit 90b] to give some pointers. Modeling has been applied to describe interdependencies of hardware resources [Mahg 92a], to characterize the behavior of parallel program components and to model the workload for parallel systems [Calz 93], but also approaches integrating parallel program, multiprocessor hardware and mapping factors for performance evaluation have been reported [Fers 90b, Mak 90]. Most of the results produced in this branch, however, are not related to the development process of parallel programs but do rather stand as contributions to the \art of modeling" for its own. First approaches to relate performance engineering to the development cycle were attempted by performance measurement and monitoring [Malo 92] of parallel systems, the visualization [Hari 93] of concurrent behaviors and the performance tuning [Ande 90] of codes. All D3H-1/Rel 1.1/March 27 1994
2
Motivation
PACT
these activities are launched at the very end of the development cycle, such that early performance mistakes, although detected, studied and ne tuned, cannot be corrected without major revisions and re-developments. At the point where actual operational codes are available, the most critical performance decisions are usually already settled and not reversible without signi cant eorts. This misconception is addressed by more recent work which tries to move performance engineering up to the front of the development cylce. Several approaches for performance prediction based on models have been reported recently [Mak 90, Sree 92, Gemu 93a], some of which could be successfully applied in restricted application areas [Wabn 94]. The drawback of \single parameter set" model evaluation strategies is removed by very recent ideas to automate parameter variation and scalability analysis based on models [Gram 93, Malo 94]. Applications in practice, however, fail to achieve the performance predicted by analytical models. The main reason for their limited practical relevance is the complexity of implementation details with signi cant performance eects, the inaccuracy of hardware models and unforseen dynamics of interacting parts and layers of the application at run time. Performance prediction accurate enough to rank dierent implementation alternatives of an application based on program or workload (WL) models is just not feasible in general, so that the only practical evaluation consists in an evaluation of the actual implementation. In this work we propose performance/behavior prediction based on real (skeletal) codes (rather than on models) in the early development stages. Our ambition is twofold: First, treat performance/behavior prediction as a performance engineering activity in a sense that it helps create an ecient application (in our approach a parallel program is \engineered" rather than a program (workload) model, like in the \model engineering" approaches mentioned above). Second, discover aws that degrade performance in a parallel program as early as possible, i.e. based on a skeletal representation covering just those program structures that are most responsible for performance. Program skeletons can be provided very quickly by the application programmer and incrementally re ned to the full implementation under simultaneous performance supervision. We call this implementation strategy performance oriented parallel program development [Fers 92], and demonstrate its conception and potential practical operation within the proposed tool environment. The report is organized mainly in two parts. In the rst part we comprehensively point to classical workload characterization with the focus on parallel workloads. We judge the suitability and drawbacks of classical WL models in the light of our goal: early performance prediction. From Section 2 we conclude that a systematic approach for workload characterization of parallel programs must be related to the development of program code directly. This motivates us to establish WL models based on skeletal parallel programs, a novel WL modeling idea extensively presented in Section 3.
D3H-1/Rel 1.1/March 27 1994
3
Classical Workload Characterization of Parallel Programs
PACT
2 Classical Workload Characterization of Parallel Programs
2.1 Parameter Based Models
2.1.1 Single Value Parameters 2.1.1.1 Sequential and Parallel Fraction In algorithms developed for parallel archite-
ctures, not all parts of the problem may be decomposable, but there will still be some parts of serial computation in the code. The ratio of the serial and the parallel work in a program is the rst characteristic, which we want to investigate. The parallel and serial amount of work may be characterized by the time, the system spends in each phase. This approach was used by Amdahl [Amda 67] when deriving his famous bound on speedup. For algebraic simplicity it may be assumed, that the serial and parallel time sum up to one, giving the serial and parallel fraction of time. Amdahl's Law states, that speedup is bounded by 1=(s + p=N ), where s and p denote the serial and parallel fraction, respectively, and N denotes the number of processors. Gustafson [Gust 88] has shown, that Amdahl's Law giving an upper bound for speedup is unsuitable for the concept of massively parallelism. Amdahls Law is based on the assumption, that the parallel fraction is independent of the problem size ( xed-sized speedup). An alternative model, called scaled-sized speedup is presented, based on the assumption that the problem size scales with the number of processors. Under this assumption, speedup is bounded by s + p N . Several amendments have been made to this law, considering not only the parallel and serial fraction, but also other aspects of the program. In [Sun 91] new metrics for speedup are presented, relating the parallel and the sequential work (sizeup) and the parallel and sequential speed (generalized speedup) instead of the time fractions. Gelenbe [Gele 89] proposes to consider the probability, that a parallel program cannot eectively use all processors. A distribution function is given for the probability i , that a program uses only i processors. In addition, an imbalance factor and a communication time c(i) are introduced. Using these additional parameters, the following bound on speedup is obtained
P
N (1 ? )(1 + ) + Nc(N ) + log2 N
where = iN=1?1 i denotes the probability, that not all processors are used. In [Helm 90] an eort is made to explain superunitary speedup. When estimating speedup based on the serial and parallel amount of work and the costs to perform one unit of sequential work, Amdahl's law is obtained as a bound on speedup. A more detailed characterization of the workload may help in explaining superunitary speedup. In some application superunitary speedup was detected, because the costs per operation increase with the number of processes on a processor for some operations (e.g. resource management). To detect and show such a behavior, it is necessary to distinguish not only between the serial and parallel fraction of work, but also in the fraction of those particular operations. Several other examples are given by the authors (including analysis of cache size). D3H-1/Rel 1.1/March 27 1994
4
Classical Workload Characterization of Parallel Programs
PACT
Annaratone [Anna 92] distinguishes not only between parallel and sequential parts of the code, but also between the number of cycles spent in memory stalling and in I/O operations to mass storage. By introducing 6 parameters (fraction of serial execution time, locality of code in parallel sectoin, fraction of I/O in serial code, fraction of non-local memory references, remote access penalty, and average number of hops that a message has to take) the author derives a more accurate bound on speedup, which is used in a case study comparing two hypothetical parallel machines, one with a small number of fast PEs and the other with a large number of slow PEs. In [Chri 91] theoretical bounds on speedup are derived, considering the overhead due to communication, where architectural characteristics (interconnection topology) are taken into account. With all these approaches, only a rough upper bound on speedup can be given, For most of the objectives in performance analysis (predicting a program's runtime, explaining performance degradations, determining scheduling and mapping) a more detailed description of the load is required. The approaches from Annaratone [Anna 92] and Christianson [Chri 91] attempt to consider additional, resource oriented parameters, to obtain more accurate results on speedup bounds. In the next paragraph we will investigate other resource oriented parameter sets, where the purpose it not to establish bounds on speedup, but to provide input parameters for performance models.
2.1.1.2 Resource Oriented Parameters Resource oriented parameters are usually ob-
tained from pro ling or measuring of the execution of the program itself or from benchmarking. Several statistical techniques are to be applied on the measured data to get a more concise representation of the workload. Benchmarking techniques are used for example in [Peir 91] to derive the workload parameters for a clustered multicomputer system, i.e. a system, where several tightly-coupled machines are combined to a large loosely-coupled architecture. Dynamic parameters are obtained from an analysis of the program execution on a real system in [Mass 93]. Based on the measurement of the execution time, speedup, eciency and ecacy are derived and used for describing the workload. Based on these parameters a queueing network model was constructed representing a parallel architecture. Performance indices are obtained from the solution of this model. In [Gohb 91] the implementation of a parallel Levinson-type algorithm for Toeplitz matrices on a shared-memory architecture is analyzed. The authors derive a simple processing timing model, dependent on the number of processors and the problem size, the time factor for computing within one iteration and the overhead time factor due to communication per iteration (obtained from measurements using a least square t method). The model was used to determine the optimum number of processors. Unfortunately, the optimality criterion is not given by the authors. Cypher et.al. [Cyph 93] investigate the behavior of several parallel scienti c applications by quantifying the following parameters: memory requirements, I/O requirements, computation versus communication ratio, and characteristics of message trac. The authors develop an analytical model for studying the eects of increasing problem size and increasing number of D3H-1/Rel 1.1/March 27 1994
5
Classical Workload Characterization of Parallel Programs
PACT
processors on these parameters for a set of scienti c applications. In [Nico 89] the workload model is derived from an analysis of the loop structure of the parallel program. The critical loops (i.e. those parts where most of the execution time will be spent) are to be identi ed. For those loops an analytic model for predicting the execution time has to be constructed. The parameters are obtained from measuring critical sections of the parallel program. The objective of the study is to nd remapping strategies balancing the workload among the processors. A dierent approach is pursued by Fahringer [Fahr 93], where characteristics are collected from pro ling the execution of a sequential run of the program. The obtained parameters (frequency informations, loop counts and true ratios) serve as the input to a performance prediction tool, where parameters characterizing the behavior of the parallel program (work distribution, number and amount of data transfers, transfer times, network contention, cache misses) are computed.
2.1.1.3 Structure Oriented Parameters In contrast to resource oriented parameters,
structure oriented parameters characterize the program independently from the architecture. They can be derived either from the program code itself (an example might be the number of nested loops) or from a description of the program in terms of behavior oriented models (e.g. task graphs). For the purpose of performance prediction, a pure structural description might be inappropriate, and is therefore usually combined with a description in terms of resource oriented parameters. In [Cand 93] an eort was made to identify and qualify those parameters which mostly in uence the performance of a parallel application. Experiments were made to investigate the in uence of the following parameters: the number of nodes in the task graph and the average node degree as structure oriented parameters and the computation time of a task (assumed to be normally distributed with known mean and standard deviation) and the message length (again assumed to be normally distributed with known mean and standard deviation) as resource oriented parameters. In [Cand 92] these parameters are used in the comparison of dierent mapping strategies. A similar study is performed in [Mass 93] for the purpose of predicting system behavior in terms of processor utilization and average number of busy processors. Static parameters are derived from the structural analysis of the precedence graph. Those parameters include the number of nodes and the node degree.
2.1.1.4 Application Oriented Parameters The performance of an application may
not only depend on the program structure or architectural characteristics, but also on factors inherent to the problem. We will call this set of parameters application oriented. The problem size is the most frequently used application oriented parameter, other examples are convergence rates in simulated annealing or rollback probability in distributed simulation. In [Azmy 92] a parallel algorithm for the neutron transport problem was implemented on an IPSC/2 hypercube. Performance results obtained from measurements were used to construct a performance model for execution time. The load parameters in these models are the problem size (described by three application speci c parameters) and the number of D3H-1/Rel 1.1/March 27 1994
6
PACT
Classical Workload Characterization of Parallel Programs 105
5
speedup
Tcomm(p)
90
Tcomp(p)
3.75
75
pws
T(p) 60 2.5
efficacy
45 30
1.25
15
efficiency
0
0 1
2
3
4
5
6
7
8
9
10
number of PEs
1
2
3
4
5
6
7
8
9
10
number of PEs
Figure 1: Execution Signatures of a Hypothetical Application processors. This model was used to predict the scaling behavior of the application. In [Jako 93] a detailed analysis of a a two-dimensional cellular automata simulation was performed to derive application speci c expressions for the computation and communication time for each phase of the algorithm. Based on these expressions speedup and the scaling behavior were modeled and compared to measurements on a transputer based system and on a SUPRENUM-cluster. In [Sing 93] several application speci c parameters are used to model the scaling behavior of an application. The in uence of the parameters on various scaling models (time constraint, memory constraint) is investigated.
2.1.2 Signatures, Pro les and Shapes
A signature is a (graphical) representation of a certain program characteristics as a function of p, the number of processors. Typically, these program characteristics include the execution time, the communication and computation time, and the derived measures speedup, eciency and ecacy. Signatures are usually obtained from the execution or simulation of the algorithm on a parallel machine using processing elements and are therefore system dependent metrics [Calz 93].
Execution Signature The execution signature T (p) of a parallel program is de ned as T (p) = Tcomp(p) + Tcomm(p) where Tcomp (p) is the computation signature representing the fraction of the execution time where at least one processor was performing computations and Tcomm (p), the communication signature, represents communication and waiting times (caused by synchronization delays). D3H-1/Rel 1.1/March 27 1994
7
Classical Workload Characterization of Parallel Programs
PACT
Speedup The speedup S (p) of a parallel program is de ned as S (p) = TT (1) (p) the ratio of the execution of a sequential program (or the run on a single processor) to the execution on p processors. Eciency The eciency E (p) of a parallel program is de ned as E (p) = S (p)
p
the product of speedup to the number of processors and is used as an indicator of average processor utilization. If we assume, that speedup is bounded by p, then eciency values may be in a range of (0; 1]. Ecacy The ecacy (p) of a parallel program is de ned as
(p) = S (p) E (p) the ratio of speedup to eciency. It relates the bene ts of using more processing elements (increase in speed) with the costs (decrease in eciency). The maximum of this function is called the processor working set pws (see gure 1). In contrast to signatures, where the metrics are given as a function of the number of PEs, pro les depict these metrics as a function of time (showing the progress in execution). The information of a pro le can also be represented in a cumulative way, called a shape (x), where x is the percentage of time during that the metric took a particular value [Park 92] [Dowd 93].
Parallelism Pro le The parallelism pro le P (t) of a parallel program is de ned as the
number of processors active (computing or communicating) at time t. This number can either be the actual number used during the execution, but also the maximum number of processing elements given by the degree of parallelism of the application. Computation Pro le The computation pro le Pcomp (t) of a parallel program is de ned as the number of processors performing computations at time t. Communication Pro le The communication pro le Pcomm (t) of a parallel program is de ned as the number of processors communicating at time t. Size of Activity Set The size of activity set n(t) is de ned as the number of tasks executed in parallel at time t. In [Klei 92] the workload is characterized by the number of processors P 0 desired for execution during a certain interval of time (the parallelism pro le of the application). The program is assumed to be divisible, i.e. the time for executing w units of work is given by D3H-1/Rel 1.1/March 27 1994
8
Classical Workload Characterization of Parallel Programs
PACT
max(w=P ; w=P 0), where P denotes the number of processors actually used in the execution. w is assumed to be a random variable (with given mean and coecient of variation), but the time for executing one unit of work is assumed to be deterministic (exactly one unit of time). In this model neither precedence restrictions nor communication or I/O overheads are considered. Based on this workload model, speedup and power (de ned as the ratio of throughput and response time) are derived using queueing theoretic analysis. Sevcik [Secv 89] has characterized the workload of a parallel system for the purpose of scheduling. It was shown, that on the one hand, single parameter characterizations such as the ratio of parallel and sequential code are insucient for this purpose. On the other hand, the most detailed information described in a full data dependency graph is not tractable. Therefore set of parameters is proposed, small enough to be handled eciently but still representing the basic characteristics of the workload necessary for deriving a scheduling strategy. From a parallelism pro le, the following parameters are extracted: sequential and parallel fraction, average, maximum and minimum parallelism, and variance in parallelism. The behavior of three static scheduling policies is investigated under varying parameter mixes. It has turned out, that scheduling strategies which take these parameters into account behave comparatively well. In [Ghos 92] the execution signature and communication and computation pro les are obtained from the execution of selected example programs running on a Connection Machine and on a Transputer based system. It was shown, that even more concise derived characteristics (average parallelism, maximum parallelism, processor working set) may be used for the purpose of scheduling. Similar studies have been performed by [Dowd 93] and [Park 92]. Those measures are typically used to characterize the workload, whenever a decision has to be made on the number of processing elements to be assigned to a problem (e.g. mapping or scheduling), but are not suitable for performance prediction during program design, where the execution signature should be the output of the study, and not the input.
2.2 Behavior Graph-Based Models
In contrast to parameter oriented models, in behavior graph models the dependence and communication structure of the workload is speci ed explicitly. A frequently used representation are directed or undirected graphs, where the nodes represent tasks, processes or events in the program and directed arcs or undirected edges represent relations among the nodes. Relations are typically precedence orders, if the arcs are directed, or indicate the need for communication if the arcs are undirected. Two other graph models, which are suitable for expressing parallelism, are Petri nets and PERT networks, which will also be discussed in this section. Finally, the suitability of queueing network models is analyzed.
2.2.1 Undirected Graph Models
A static process graph [Ston 77b] G = (V; E; f; e) consists of a set of V nodes representing the processes in a parallel or distributed application, E is a set of undirected arcs representing D3H-1/Rel 1.1/March 27 1994
9
Classical Workload Characterization of Parallel Programs
PACT
two-way communication, f is a function associating computation costs to the nodes and e is a function associating communication costs to the arcs. This model is suitable for representing systems with coarse granularity, where the processes are persistent during main parts of the execution time having stable communication patterns [Chu 84]. Performance prediction based on this model is restricted, because no information is contained in this model on how frequently communication may occur.
2.2.2 Directed Graph Models Representing a parallel program is most naturally done by drawing the system as a directed graph. This graph may either represent dependences among tasks (task graph), communication among processes, temporal relations between events, the control ow, or the data ow of the program. Depending on possible restrictions on the graph structure (series-parallel graphs, fork-join structures, meshes, : : : ) and the semantics of incoming and outgoing links (exclusive or inclusive) a further distinctions is possible. The models dier also in how timing information is associated to the graph. Timing information associated to the nodes usually represents execution time or computational demands, timing associated to the links represent communication time or message transfer demands. Whenever time is associated, hardware characteristics are already included, the workload model is therefore hardware dependent. If only the corresponding demands are speci ed (number of instructions, data volume), the workload model is architecture independent. This has the advantage of exibility, i.e. the same model can be used for the evaluation of dierent architectures, but will usually result in an increase in complexity when solving the performance model, where the hardware aspects are to be considered. Finally, we have to distinguish probabilistic and deterministic models, i.e. models where the associated parameters may or may not be random variables.
2.2.2.1 Communication Graphs A communication graph CG = (V; E ) is a directed graph, with a set V a set of nodes representing tasks or processes in a parallel program, and E a set of directed arcs. An arc going from node i to another node j represents communication from node i to node j . In [Iane 94] the communication aspect of the workload is modeled for a class of data parallel algorithms, characterized by alternating phases of (individual, independent) computations and data exchange (synchronization). An algorithm is modeled as a directed graph, where the nodes represent tasks and the arcs represent communication. Weights associated to the arcs represent the data transfer volume . Together with information on the architecture topology and the mapping, a hardware communication graphs are derived. In this graph, only those processors and links are shown, which are involved in the communication demands for a given process. The links are directed and weighted with the aggregation of all weights from the corresponding demands in the algorithm graph for this process. If the algorithm graph is symmetric, an expression for communication overhead can be derived. A resource-oriented approach, specifying the dependence structure and the demands for computation and communication on the corresponding resources (processors, communication D3H-1/Rel 1.1/March 27 1994
10
Classical Workload Characterization of Parallel Programs
PACT
network) is used in [Gemu 93b]. The structure is described in a program description language. The parallel architecture is also modeled in this language. Performance prediction is based on the method of serialization analysis. Andre [Andr 87] also characterizes the program by its communication structure using a description language . The time for executing sequential parts of the program is obtained from actual execution of these parts (i.e. they have already be implemented), dierent processor speeds may be simulated. Communication times are determined from the amount of data and the transfer rate (both are to be known, constant values), the synchronization and communication delays are obtained by a simulator tool. Another simulation tool is presented in [Hart 90] for predicting the execution of a parallel algorithm on a transputer system based on a description of the task structure, the communication pattern, and expressions for the computational and communication demands. Computation demands are given as a timing function dependent on the problem size and the number of processors . Communication demands are speci ed in terms of the number of bytes to be transferred. The language used for describing the task structure is a subset of occam. In [Kape 92] a computation control graph (ccg) is proposed to model the program behavior. Nodes, representing the tasks, are connected via directed arcs. Multiple ingoing arcs can either be grouped conjunctive (AND) or disjunctive (XOR), probability weights may be assigned to multiple disjunctive outgoing arcs. In contrast to task graphs, ccgs do not necessarily have to be acyclic. The execution times are assumed to be exponentially distributed with a given mean. Analytical performance evaluation techniques are based on a hierarchical aggregation of segments of the ccg.
2.2.2.2 Task Graphs A Task Graph TG = (V; E ) is a directed, acyclic graph, where V
is a set of nodes, representing the tasks in a parallel application, E is a set of directed arc representing precedence relation among tasks. Task graphs may be augmented by assigning weights to the nodes and arcs, representing computation and communication costs. Such a model is used in [Noh 92] where computation time functions associated to the nodes and data transfer functions associated to the arcs. Both functions may depend on the number of processors, he problem size and other application dependent factors, but the values given are assumed to be deterministic. These parameters are the used for estimating the communication and computation signatures assuming an execution on a SIMD architecture. For estimating the communication signature, additional parameters characterizing the hardware are needed: an allocation and a contention factor and the time to transfer one data unit. These values were obtained from experimental measurements on a real hardware. In order to deal with the complex representation of large task graph, hierarchical structures have been proposed. A composite task graph [Lewi 91] or hierarchical task graph may contain regular-structured sub-task graphs as components, which may be aggregated to a single node at a higher level in the hierarchy. Model parameters are the set of tasks to be executed, the precedence relation between the tasks, a communication matrix showing the amount of data transfer between any two tasks, and a function characterizing the operations to execute a task. This function may for example be a complexity function dependent on the D3H-1/Rel 1.1/March 27 1994
11
Classical Workload Characterization of Parallel Programs
PACT
problem size. Computation and communication times are derived by combining hardware parameters (execution rate, communication speed) with the workload parameters (amount of data, number of operations). Expressions for deriving the execution time and optimum number of processors for a given time complexity function and a certain subset uf sub-task structures (divide and conquer, nearest neighbor mesh) are derived. A tool supporting the evaluation of dierent scheduling strategies based on this model is presented in [Lewi 93]. The assumptions of deterministic computation and communication demands may not always allow a representative characterization of the workload. Therefore probabilistic or stochastic models have been proposed, where computation and communication demands may be de ned as random variables. In [Mena 92] the parallel program is represented as a task graph together with an assignment function allocating the tasks to processors. The computation time for a task k is given by a function Ak + Bk tshr;k , where Ak is a deterministic factor, estimating the actual computation time, Bk is a deterministic factor estimating the number of shared memory accesses, and tshr;k is a nondeterministic function accounting for the average delay due to contention for a single shared memory access. Ak and Bk are application and architecture dependent and may be obtained from measurement experiments and benchmarking. tshr;k depends on the access rates or probabilities, which are obtained in an iterative procedure from the deterministic part of the execution time Ak , the number of access requests per task and the interconnection network cycle time. In [Vinc 91] an acyclic task graph is used to model a parallel algorithm. The task execution times are assumed to be identically independent distributed , communication time is neglected. The task graph is analyzed under the assumption of an unlimited number of available hardware resources, so there is no delay because of contention. The structure of the task graph is deterministic. Bounds on the execution time were derived for certain classes of structures. In [Mada 91] for each task the execution time is given by its mean and variance . Communication and contention costs are assumed to be either incorporated in the task execution time or are neglected. The analysis is restricted to a subset of task graphs with regular structures, where the tasks can be grouped in certain levels and there are only dependences among nodes in adjacent levels. Based on these assumptions it is possible, to estimate speedup for several allocation strategies analytically. In [Hart 93, Hart 92, Sotz 90] several performance evaluation techniques are described based on a stochastic graph model of the workload. In this model, iterations and loops are represented by cyclic nodes or hierarchical loop nodes. The execution time for a task can be given by a distribution (deterministic, exponential, erlang) or numerically. Since the exact solution of non-series-parallel graphs is very costly and even intractable for larger systems, several approximation techniques are proposed and implemented in a tool called PEPP [Daup 93]. By restricting the structure of the task graph to certain regular structures, more ecient solution techniques can be given. Frequently used structures are fork join models, where a single task spawns a set of (sub) tasks performing independent computations in parallel. After a computation phase, the tasks are joined together at a synchronization point. In [Agwr 92] an upper bound on the speedup of such iterative synchronized algorithms is derived. The algorithm is modeled as a task, which consists of a number of atoms, which are D3H-1/Rel 1.1/March 27 1994
12
Classical Workload Characterization of Parallel Programs
PACT
parts of the execution, that cannot be further decomposed, and whose execution takes exactly one unit of time. An atom becomes active with a certain probability and has to be executed on the processor, where it is allocated. Tasks (and atoms resp.) are distributed statically among the processes. The total execution is composed of a given number of iterations over this task with synchronization points between two successive iterations, i.e. only after all atoms have nished computation, the next iteration can start (fork join structure ). Although in a real application, usually messages are exchanged at a synchronization point, the communication time is neglected in this model. The characterizing parameters are the number of iterations, the number of atoms , and the probability for an atom to become active. It is assumed, that all atoms have the same probabilities. In [Maju 88] a set of jobs of a fork join structure serves as a workload model in scheduling. Jobs arrive at a certain rate and will fork in a certain number of tasks to be executed in parallel. The number of tasks is a random variable with given mean and coecient of variation. The cumulative computation demands (I/O and memory demands are neglected) for each job are also modeled as a random variable with given mean and coecient of variation. A Job is completed after completion of all its tasks. One particular instance of this model is investigated, where the job service demands are linear correlated with the number of tasks, which might be a realistic assumption for classes of applications. The eects of variations in the workload parameters are investigated with respect to the performance reached when using several scheduling strategies. A queueing network model is used for the analysis. In [Sree 92] the workload of a multicomputer system is modeled as a collection of tasks in a divide and conquer structure. Tasks enter the multiprocessor system at a single root processor. Upon arrival of a task, each processor either computes the task locally or splits it into subtasks. The resulting task structure is a tree. Nodes within one level are assumed to have identical split, join and execution times. Also the communication time between two successive levels is assumed to be equal for all nodes. Here only the actual overhead is considered, it is assumed, that the actual time to transfer the data may be overlapped with computation. Another structure frequently found in linear algebra problems, are triangular task graphs. Such graphs do have a synchronization structure similar to fork join models, but the number of tasks spawned either decreases or increases during the iteration, resulting in a triangular shape. In [Mak 90] the performance of parallel programs is investigated, whose structure can be represented as a series-parallel directed, acyclic graph . To each node representing a task, an exponentially distributed random variable is associated, characterizing the task's execution time . The parallel system is modeled as a queueing network. A methodology is proposed to combine both models to derive predictions for the execution time. In [Staf 93] the performance of parallel applications with a triangular task graph structure is investigated. The workload is speci ed by a DAG with probabilistic task execution times, communication costs are neglected. The performance is evaluated using two dierent scheduling strategies by the analysis of Markov chains.
D3H-1/Rel 1.1/March 27 1994
13
Classical Workload Characterization of Parallel Programs
PACT
2.2.2.3 Event Graphs An event graph EG = (V; E ) is a directed, acyclic graph, where
V is a set of nodes denoting events in (the execution of) a parallel program, and E is a set of
directed arcs, representing temporal relationships among events. Usually, all events associated to a certain process (or processor), are shown along a line, similar to a gantt chart.!bib In [Hick 92] a parallel program is characterized by the analysis of execution traces. This trace is a collection of time stamped events collected during the execution at a real architecture. An event marks the occurrence of a primitive action performed on a processor. Based on this event trace, the execution graph and the event graph are derived. The execution graph is a machine dependent representation of the program. Each node in the execution graph corresponds to the occurrence of an action associated to a process and a processor. Directed arcs connecting the nodes represent precedence relations. The event graph is a machine independent representation, which is obtained from the execution graph by removing all machine dependent information (i.e. processor numbers, duration between the occurrence of events). Now, a node represents the occurrence of an action associated to a process. Finally, a timing function may be derived from the event graph, which is an expression relating the program execution time to the execution time of primitive actions. In [Yang 88] a program activity graph is created from its execution trace. In a program activity graph, each node corresponds to a certain event during the execution, connecting arcs indicate the time between the occurrence of the events. Usually, all events associated to a certain processor are shown on a line, so that communication can be easily detected as arcs connecting events on dierent lines. In [Lo 91] a tool supporting the mapping of parallel programs onto parallel architectures is presented. The program is modeled as a temporal communication graph, which incorporates the concepts of process-time graphs, static task graphs and directed acyclic graphs. Each process consists of a sequence of atomic events (compute, send, receive), which are the nodes in the TCG. Directed edges represent the precedence relation between the events in a process and form a line. Communication is represented by a directed arc from the send event of one process to the receive event of another process. A TCG can be seen as the unrolling of a static task graph over time. Weights associated to nodes and arcs represent computation and communication costs, respectively. Regularities in the temporal behavior can be speci ed for computation and communication phases. Instead of showing a sequence of structural identical phases, repetition statements can be given, thus resulting in a more concise representation as obtained when using DAGs.
2.2.2.4 Control Flow Graphs A computation structure model is used for characterizing parallelism in an application, which describes the tim cost behavior of applications at the computation level. The model consists of a control graph and a data graph. The control graph depicts the control structure of the application. Node types include a decision node, a fork node, a join node, a operation node, and start and end nodes. To each node a time cost function is associated. In the data graph, the relations between operations and data are depicted. In [Qin 93] ow analysis and time cost analysis techniques are presented to derive the execution time for a program speci ed by its control graph. In [Abra 87] a ow chart is used to model the behavior of a process in a parallel algorithm. D3H-1/Rel 1.1/March 27 1994
14
Classical Workload Characterization of Parallel Programs
PACT
The nodes correspond to phases of computation, directed arcs represent communication, but communication partners are not depicted. Communication may not occur in a conditional statement, and each process loops without termination. The parallel program consists of several processes and is described by its states, which are in return de ned by the states of each process (nodes in the process ow chart). The execution times for each node are assumed to be deterministic. For the purpose of performance modeling, the states of the ow charts are translated into a geometric concurrency model, which is solved to obtain the steady state trajectory (a sequence and timings of transitions in the ow charts, which is repeated for ever).
2.2.2.5 Data ow Graphs In [Luqu 92] a tool is presented for modeling analyzing the
behavior of parallel algorithms. The behavioral data ow graph representing the parallel program in a architecture independent way may either be speci ed directly or derived from the code of an existing program together with computation and communication demands. Together with a description of the architecture and a mapping, the execution of the program can be simulated and performance results are obtained. In [Ha 91] a data ow graph representation of the program, which may contain data dependent iterations, is used as a workload model. The objective of the performance study is to derive optimum scheduling strategies subject to a minimization of the run-time cost. In [Hous 90] a parallel application is modeled by a stochastic data ow graph, which is a weighted precedence and-or logic graph, where the node and link weights represent the computation and communication requirements. The weights are exponentially distributed random variables with given mean. Because of the existence of conditional branches and loops, it is not known in advance, how often a module will actually be executed or how many data will be transferred. But the execution frequencies can be obtained from the analysis of the Markov chain, described by the stochastic data ow graph. After obtaining these frequencies, it is possible to derive a deterministic data ow graph, which has a simpler structure. Now the weights of the nodes are give the total processing requirements and the weights to the links, which are now undirected, give the total communication demands. Parameters derived from both, the stochastic and the deterministic graph are used to model the workload of real-time systems for the purpose of developing optimum allocation strategies with respect to minimizing the total run time of the application. In [Gele 86] the structure of the task graph is not given deterministically, but follows certain stochastic rules. The workload characteristics (number of tasks, maximum and average parallelism) are derived. From all models presented in this section, stochastic or probabilistic task graphs seems to be suitable for performance prediction of parallel application. What is missing in this model is the possibility to express conditional statements and loops (with the exception of the models presented in [Hart 92]) and the lack of ecient analysis techniques for arbitrary structures. In [Wabn 93] and approach is presented, where the task graph is translated into a Petri net, which is simulated to obtain performance results. Other Petri net based models are presented in the next section.
D3H-1/Rel 1.1/March 27 1994
15
Classical Workload Characterization of Parallel Programs
PACT
2.2.3 Petri Nets
A Petri net SPN = (P; T; A; M0; ) consists of a set of places, a set of transitions, and a set of directed arcs connecting places with transitions and vice versa, and an initial marking, i.e. a distribution of tokens over the set of places. In a stochastic Petri net, there is a ring rate associated to every transition. This rate can either be exponentially distributed (for a timed transition) or in nite (for an immediate transition). A detailed de nition of various time extensions to Petri nets can be found in [Fers 90a]. When modeling the parallel program in terms of (stochastic) Petri nets, this model serves on the one hand as the workload model, but may on the other hand already be the performance model. With respect to expressive power, Petri nets extended with a timing concept, are a the most powerful formalism from all models presented here. All important structural aspects of a parallel program (parallelism, synchronization, communication, conditional branches) can be expressed and the association of a timing concept supports performance analysis. The major problem in the use of Petri nets is the exploding number of states, which prohibits the analysis of large systems using analytic techniques (Markov analysis). Altough several attempts have been made in making these models analytically tractable (see for example [Gran 92]), simulation will be in general the only possibility to analyze these models. Within the Petri net formalism, both, the parallel architecture as well as the parallel program can be described. In [Fers 91, Fers 92] a method is proposed combining both models in a unique net using PRM-nets [Fers 90c, Fers 90a]. The application as well as the architecture are modeled as a generalized, stochastic Petri net, the mapping can also be expressed within the Petri net formalism. For each transition in the program net representing the execution of a program part, resource demands can be speci ed. When assigning processors from the resource net to this transition, the factual ring time will be computed considering the architectural parameters associated to the node representing the processor in the resource graph. The same technique is applied for determining the communication times. Therefore the model allows a clear separation between the description of the workload and the description of the architecture, but still provides an easy to use formalism to combine both into a performance model. An detailed modeling example of a parallel algorithm is given in [Mahg 92b], where a colored Stochastic Petri net is used for modeling the behavior of a parallel algorithm. The statements of the parallel program are translated into Petri net constructs. The execution of a statement is represented by the ring of a timed transition, the execution time is assumed to be exponentially distributed with a mean of one time unit. Therefore, all timed transitions will have the same exponentially distributed ring rate. Structural analysis of the net allows the proof of deadlock freeness. Performance aspects are studied by analyzing the underlying Markov chain. An example of modeling a message passing parallel architectures using Petri nets is [Wang 91]. The stochastic Petri net is used to model the interconnection topology (a two dimensional grid with link and bus connections) of the hardware. The workload (computation and communication demands) is assumed to be uniform distributed among all processing elements. A shared memory architecture is modeled for example in [Ho 93], where a clustered sharedD3H-1/Rel 1.1/March 27 1994
16
Classical Workload Characterization of Parallel Programs
PACT
memory multiprocessor (MR-1) was modeled using extended deterministic and stochastic Petri nets. The workload is considered on the one hand by determining the probabilities for communication and memory access requests and on the other hand by determining the ring rate of the transitions representing the computation and the memory access times.
2.2.4 PERT Networks The technique of PERT or GERT networks, frequently used in project planning, may also be applied in the domain of parallel processing. A PERT network G = (V; E; T ) is a directed, acyclic graph, where V is a set of nodes (representing the tasks in a project) and A is a set of directed arcs, connecting two nodes, if there exists some precedence relation. T is a set of timing functions associating a transition time (a random variable with given distribution) to each task. Applied to the model of a parallel program, the nodes represent the tasks in the program and the arcs represent the dependence between the tasks. The timing function corresponds to the task's execution time. In [Taqi 92] Petri nets and networks described in SLAM are compared for their suitability to model the behavior of concurrent systems. However, the analysis is restricted to structural aspects, no timing aspects are considered. An example of modeling a parallel wave equation algorithm using GERT networks described in SLAM can be found in [Sinz 93]. In [Cuba 91] a bilogic extension of PERT networks (called BIPERT) is proposed as a suitable model for parallel programs. In a BIPERT network, ingoing and outgoing links may either be inclusive (IN) or exclusive (EX), thus allowing to represent amongst parallelism also conditional branches and alternative parallelism, but also dead ends (if a node spawns several inclusive outgoing links, which are collected later on in an exclusive ingoing). Graphs containing such dead constructs are prohibited and called unlegal. A network where all nodes are IN/IN or EX/EX types, belongs to the class of series-parallel graphs, where exact analytic solution techniques are feasible. All other networks are to be solved using simulation or approximation techniques.
2.2.5 Queueing Network Models A queueing network model usually models the computer architecture, where the resources are represented by servers in the network. The workload is described in terms of arrival rates of jobs and their service demands. In [Chia 91] mean value analysis is used to estimate the performance of a multiprocessor architecture. The results are compared to trace-driven simulation. In [Ghos 90] the average parallelism derived from a data ow graph is used to represent the population in a closed queueing network modeling the Manchester data ow computer. In [Akyi 92] a closed queueing network model is used for predicting the performance of a multiprocessor system, which consists of a xed number of identical processors having access to global memory. The workload for this system consists of a xed number of processes. Each process can either be ready (waiting to be scheduled on a processor), active (receiving service on a processor) or blocked (due to communication or synchronization). Blocking times D3H-1/Rel 1.1/March 27 1994
17
Classical Workload Characterization of Parallel Programs
PACT
and probabilities are obtained in a communication submodel, response time, throughput, and processor utilization are obtained from a global queueing network model, where the transition probabilities are dependent on the blocking probabilities and the service demands are derived from the blocking times. In [Kush 93] a queueing network model is constructed for analyzing the performance of distributed systems. The system consists of a set of clients and a set of servers connected via a network. Clients request access to les from the servers. Model parameters are the number and think time of the clients (the time between two requests), the network transfer time, and the le access times at the servers. The performance of this system (utilization of servers, waiting times and system throughput) is analyzed with respect to the sensitivity to these parameters. In [Leuz 89] the performance of a parallel system is analyzed under a two program workload using a closed queueing network model. Both programs may either be at the host computer, or receiving service at the multiprocessor, or one program is at the host, while the other is at the multiprocessor system. These are the dierent states describing the system. The service rate on the multiprocessor system ia state dependent and state independent on the host. An additional potential factor is introduced on the multiprocessor system, since a parallel program may not execute at its optimum potential due to contention for shared resources. The execution rates for a program when receiving exclusively service from the multiprocessor is derived from the execution signature of the application. The model is used to calculate system throughput. Validation experiments on an Intel IPSC/2 hypercube were performed. In [Vinc 88] the tasks arriving at a parallel system, which consists of an unlimited number of servers and a queue of in nite capacity, may only be processed considering precedence relation given by an acyclic task graph. The processing time and the inter arrival times are assumed to be independent and general distributed. A general expressions for the stability condition of such systems was derived. In [Nels 88] Jobs, arriving at the system according to a poisson stream, consist of several independent tasks, that may be executed concurrently. The number of tasks is given by a random variable with known probability distribution, the execution times per task are identically independent exponentially distributed with given mean. The jobs are to be executed on a parallel system consisting of c identical processing elements. The eects of using a distributed or a centralized queue for collecting the arriving jobs and job splitting and no splitting are compared. The performance values are obtained from a solving the steady state equations of this system modeled as a continuous time, discrete state markov process. The similar workload model is used in [Nels 90]. Since it may not always be possible to characterize the workload precisely, i.e. determine the distribution for the number of tasks in a job for a certain application, an alternative model is proposed, based on the fraction of sequential and parallel amount of work and on the average number of parallel tasks. The relative error in the results when using this workload model compared to the results when using the more detailed model was comparatively small, thus motivating a more concise workload representation. In [Kats 92] a the behavior of a single processor in a message passing architecture is analyzed using a queueing network model. The workload is characterized by an arrival rate of messages, the service time for each message is assumed to be proportional to the message D3H-1/Rel 1.1/March 27 1994
18
Classical Workload Characterization of Parallel Programs
PACT
length. Since each message is assumed to leave the station immediately after receiving service and joins the next station, the inter-arrival times between messages is proportional to the message length. Therefore a single processor can be modeled as a single queue with interdependent arrival and service rates. Conditions for the existence of a stationary distribution are derived and solutions for a number of inter-arrival time distributions are developed. In [Lauw 93] the performance of a parallel architecture with point-to-point link connections is analyzed using queueing theory. The tasks of the program are modeled by the jobs in the queueing network, the processors and links are modeled by queueing centers. After receiving service, a task may request service from the same processor or from any other processor in the network according to certain routing probabilities. Service demands per task on a server are exponentially distributed, communication or synchronization restrictions are not considered. Furtheron it is assumed, that the degree of parallelism (i.e. the number of jobs in the system) is constant over time. This assumption is unrealistic, but necessary from a queueing theoretic point of view for the system to operate in steady state. In [Mabb 94] a queueing network model is used to analyze the performance of a clustered shared-memory multiprocessor (MR-1). Since a queueing network model requires a characterization of the workload at a physical level, the workload parameters are the resource demands (memory access requests) and the arrival rates of the requests. To enhance the tractability of the model, the arrival stream is assumed to be a poisson process, and the access times are assumed to be exponentially distributed. In [Kant 88] the workload consists of a continuos ow of applications. Applications are assumed to arrive at the parallel machine in a poisson arrival stream, requiring a certain substructure of processing elements. A xed set of K dierent substructures may arrive (represented by dierent job classes in a queueing network model), for each type of substructure an exponentially distributed holding time with given mean is assumed. The objective is to nd allocation schemes for assigning subsets of processors to applications. A queueing model is constructed for determining system throughput, queueing delays and processor utilization. Unfortunately, the load balance assumption will not hold for this model in general, thus solution methods become expensive. In [Chen 92] queueing network models are used to compare centralized and decentralized task queue organizations in parallel systems. The workload is assumed to have a fork and join job structure, where each job is assumed to consist of a set of independent tasks, which may be executed concurrently on any processor. There is no communication among tasks within a job. This workload is modeled by an arrival rate of jobs and a random variable giving the number of tasks per job. Each task is assumed to have an exponentially distributed execution time with given mean. Altough solution techniques for product form queueing networks are well known, the necessary assumptions are often not given when modeling a parallel system. A further drawback of queueing models for the purpose of prediction the performance of a parallel algorithm is, that these models are architecture oriented, i.e. the queueing network model represents the architecture in detail, but the program structure is considered insuciently. Summarizing all existing approaches, we conclude, that an entirely new approach is needed, for the description of workloads in terms of parallel program structure, behavior, and relevant timing parameters. D3H-1/Rel 1.1/March 27 1994
19
Program Code based WL Models
PACT
3 Program Code based WL Models
3.1 Performance Oriented Parallel Program Development
Performance engineering activities in a performance oriented parallel program development range from performance prediction in early stages (design), modeling (analytical modeling and simulation) in the detailed speci cation and coding phase, to nally monitoring and measurements in the testing and correction phase. The quality of performance analysis based on modeling is fairly limited by method speci c assumptions necessary for the tractability of model evaluation (several of these have been presented extensively in the previous section). For simulation analysis almost all assumptions can be released, yielding arbitrary detailed WL models, but still leaving the chance of intractability of the model evaluation. For the performance prediction aspect of a performance oriented parallel program development process as aimed by this workpackage, our emphasis is on attainable accuracy for performance predictions. We have decided to take the utmost precision in WL modeling by taking the preliminary program code itself as a WL characterization, and to evaluate the system performance by interpreting/simulating the WL model. Two major strengths appear to be inherent to this approach:
Accuracy WL characterization can be arbitrarily detailed, growing with the progress in im-
plementation. Moreover, accuracy can be made fully user controllable by the possibility of annotating code with further (extensional or restrictive) WL characterizations. Impliciteness WL characterization is no longer a performance modelers activity, but happens implicitely with the coding of programs. No additional WL characterization eorts are required for performance predictions, although this possibility is not precluded. Vice versa, the performance engineering activities do not intervene with the development of operable program code { no additional coding eorts are imposed. Within this workpackage we shall develop a tool (preliminarily called the N-MAP tool) exemplifying the idea of WL characterization based on skeletal parallel programs. To this end we rst relate parallel program speci cations and their use as a means for roughly expressing their skeletal structure to the early phases of development and preliminary performance prediction. According to Figure 2 we consider the algorithmic idea as the rst step towards a parallel application. To express his implementation intent, the programmer should not be forced at this point to provide detailed program code, but to focus on the constituent and performance critical program parts. We provide an extension to the C-language, abstract enough to allow a quick denotation of the principal communication and computation pattern, but also syntactically detailed enough to support an automated translation, veri cation and analysis. Special emphasis was devoted to meet the demands of the incremental and iterative nature of the proposed development process, where a program may be speci ed at various levels of explicitness. In each phase of program development, more and more detailed information may be provided to further specify program functionality until a complete executable parallel program is created. As a result, certain parts of the full speci cation may be absent at various D3H-1/Rel 1.1/March 27 1994
20
PACT
Program Code based WL Models Algorithmic Idea
BBBBBBBBBBBBBB BBBBBBBBBBBBBB High Level Specification BBBBBBBBBBBBBB refine / modify / improve
Parse and Translate
BBBBBBBBBBBBB BBBBBBBBBBBBB Virtual Processor BBBBBBBBBBBBB BBBBBBBBBBBBB BBBBBBBBBBBBB Parallel Program BBBBBBBBBBBBB Simulate Execution
Physical Processor Parallel Program Compile and Execute
BBBBBBBBBBBBBB BBBBBBBBBBBBBB Execution Trace BBBBBBBBBBBBBB View and Analyse
Figure 2: Performance Oriented Parallel Program Development Cycle (Abstracted). stages of development, depending on whether these are of momentary interest or necessary for present program analysis and evaluation. With the following subsections we look into more syntactical details of a potential program skeleton (or communication pattern) speci cation language de ned on top of C.
3.2 The Program Skeleton Speci cation Language
The language C has been chosen as the basis for the implementation not only because of its wide dissemination, familiarity and acceptance, but also because it is well suited for syntactical extensions. The proposed extensions are described in the following from a syntactical point of view, and consist essentially in the de nition of three types: task, process and packet and two syntactical structures for communication calls: send() and recv().
3.2.1 Tasks The term task is used in the context of this work to designate a contiguous and self-contained sequential series of (computational) operations. Contrary to some de nitions of the term task in the literature, a task will be de ned here as having no communication calls within its code (body). All communication calls necessary for task execution must be performed outside of the task itself. All tasks that occur in the program must be declared prior to their usage by means of the type speci er task. The task identi er may be optionally indexed. An example declaration for tasks could syntactically look like: task my\_task, readmatrix[N], transform[N][N];
D3H-1/Rel 1.1/March 27 1994
21
Program Code based WL Models
PACT
3.2.2 Processes The term process will be used to refer to the sequential stream of task and communication calls to be performed on a (virtual) processor. In this meaning, the terms process and processor can be used interchangeably. A process is de ned using the type speci er process: ::= process [ where '{' ';' [ ] '}' ] '{' '}'
An optional clause, beginning with the keyword where can be used to replicate a process. Example a) de nes a single process, b) 100 similar process(or)s named column[0], column[1], etc. and c) de nes a triangular matrix of process(or)s. a) process column { } b) process column[i] where { i=0..99; } { } c) process column[i][j] where { i=0..N, j=0..N; i