A Petri Net Approach for Performance Oriented Parallel Program Design

A Petri Net Approach for Performance Oriented Parallel Program Design A. Ferscha Institut fur Statistik und Informatik, Universitat Wien Lenaugasse 2/8, A-1080 Vienna, AUSTRIA Tel.: +43 1 408 63 66 5, Fax: +43 1 408 63 66 6 Email: [email protected]

1

Proposed running head: Performance Oriented Parallel Program Design A. Ferscha Institut fur Statistik und Informatik, Universitat Wien Lenaugasse 2/8, A-1080 Vienna, AUSTRIA, Tel.: +43 1 408 63 66 5, Fax.: +43 1 408 63 66 6 Email: [email protected]

Abstract Performance orientation in the development process of parallel software is motivated by outlining the misconception of current approaches where performance activies come in at the very end of the development, mainly in terms of measurement or monitoring after the implementation phase. At that time major part of the development work is already done, and performance pitfalls are very hard to repair - if this is possible at all. A development process for parallel programs that launches performance engineering in the early design phase is proposed, based on a Petri net speci cation methodology for the performance critical parts of a parallel system. The Petri net formalism is used to de ne Program Resource Mapping-net (PRM-net) models, that serve as an integrated performance model of parallel processing systems, combining performance characteristics of parallel programs (P-net), parallel hardware (R-net) and the assignment of programs to hardware (Mapping) into a single performance model, while simultaneously representing the speci cation of a parallel application. Predictable parallel algorithm skeletons are worked out as PRM-net models to simultaneously serve as generic program templates to be instantiated according to speci c needs, thus representing the starting point of the development process. The systematic integration of a set of tools supporting performance oriented parallel programming results in the CAPSE (Computer Aided Parallel Software Engineering) environment, which is being built around the PRM-net methodology. Speci cation and performance prediction of parallel applications at the algorithm structure level are demonstrated by example.

2

1 Introduction Although the skills in developing parallel algorithms and programs are growing with the use of parallel and massively parallel hardware [16], there are no general development rules or guidelines in order to be conform with the primary motivation for using this technology, i.e. gaining maximum performance from the hardware. It is clear at least today, that peak performance (the theoretical upper limit of machine instructions executed per time unit) and performance attainable by a `practical' application often diverge by orders of magnitude. Summarizing current eorts to escape from the `parallel software'-crises shows, that alarmingly few attention is given to the most critical issue in this concern, namely performance: Compiler Approach Vectorizing compilers have been successfully used for restructuring innermost loops of programs to be executed on vector processors [2] Parallelizing compilers [46] try to extract parallelism at the instruction and/or loop level, thus making explicit parallel programming and the parallelization of existing code unnecessary. The compiler approach seems promising in the detection and exploitation of ne grain parallelism, but is not realistic for large grain parallelism [29]. Hence we can expect satisfactory eciency only for a small subclass of parallel architectures (SIMD). Parallel Languages A broad spectrum of programming languages [40] for coding parallel applications has been developed in uenced by communication and synchronization mechanisms [3], as well as by programming paradigms as such [5]. At least imperative parallel languages are relatively easy to implement, and re ect to some extent the underlying hardware model, so that the programmer can embody his application domain knowledge to optimally partition and assign his problem. However, he is often concerned with the burden of correctly managing communication and synchronization, which is overwhelming his mental capacity. The functional, logic and object oriented approaches try to relief this burden by providing higher expressiveness of the language, but considerably complicate implementation [44]. Theoretical approaches of parallel programming [9] are still far from ecient implementations. Support for parallel programming at a higher level of abstraction to get rid of the underlying hardware is given by language constructs for manipulating a shared data structure like the Linda tuple space [8], but again ecient implementations are lacking. The language approach tends to follow the same development as sequential languages (from third to forth and fth generation) by abstracting more and more from machine models and thus enlarging expressiveness. No attempts seem to be undertaken to support any kind of performance analysis at the language level. 3

Environments The needs for toolsets assisting program developent for parallel machines have rst

been addressed in environmental prototypes like the FAUST [25] or the CODE [7] project. Recent advances like the PPSE project [31] address target machine independent design and implementation of parallel software systems based on large grain data ow using a graphical and hierarchical construction of realistically sized programs. Working projects deal e.g. with source code generation from textual/graphical design, code visualization in terms of annotated graphs like in ParaGraph [4], concurrency analysis of Ada tasking programs in Total, visualization and interactive parallelization of Fortran code in Pat C programs in Aspar, data ow and static dependency analysis etc. [26]. With respect to performance the focus is on visualization tools [39], all of which use some combination of color, graphics, animation and even sound to present performance data of executing programs on the

y, or traced data (gathered by software or hardware monitors) post mortem. Examples for resource utilization visualization are the System Performance Displayer [41], or the ParaGraph environment [27], which takes execution pro le data collected from message passing parallel computers to animate the behaviour of a parallel program in various modes of motion via several simultaneous views. The TraceView [32] tool has neatly the same aims for machines supporting smaller grained parallelism. The IPS-2 instrumentation system [34] supports a hierarchical view of programs and provides performance data for each level (program- , machine- , process- , procedure- , instruction level). From this (seemingly representative) examples on development environments we can conclude, that the parallel program development support given is dominated by traditional life cylce models for sequential software. Performance engineering activities come into the development process as soon as a running implementation of some parallel application is at hand. Of course, at that time it is already too late for performance investigations, because pitfalls detected in the nal implementation would cause tremendeous reimplementation eorts, sometimes even a redesign of the entire application. This work is to be seen as an attempt in favour of a support to the process of developing parallel software. We rst propose performance oriented parallel program design and systematically describe a parallel software development cycle (Section 2). In a second step a Petri net based design methodology integrating functional speci cation and performance evaluation techniques (Section 3) is introduced. The desing process is conceptually demonstrated for the speci cation of a general purpose, reusable application template at the algorithm structure level (Section 4), and practically performed in the CAPSE development environment (Section 5).

4

TOOLSUPPORT Algorithm Idea High Level Specification

Hybrid Editing, Diagramming Prove Functional Validity Performance Prediction

Implementation Skeleton

Editing Code Generation Compiling, Debugging, Testing Reusability, Library Management Performance Prediction

Coding Executable Program Assignment

By Hand Placement Data Partitioning & Mapping Program Decomposition & Mapping Automated Mapping

Running Version of Application Tuning

Performance Measurement Visualization

Efficient Parallel Application

User Interface

Figure1: Phases of Parallel Program Development.

2 Performance Oriented Parallel Program Design Compared to the traditional development process of sequential software where performance issues are insuciently considered one is now (naturally) convinced that performance evaluation is a critical factor in the upcoming parallel software development methodology. The challenges are to support the process of developing parallel software by eective (automated) performance engineering during the whole software development cycle. In contrast to approaches in the environments presented above, in a performance oriented parallel program design performance engineering has to approach in the early design phases, accompanying this process until the completion of the application. Performance engineering activities must hence range from performance prediction in early development stages, performance analysis (both based on analytical and simulation modeling) in the detailed design and coding phase, to nally monitoring and measurements in the testing and correction phase. The following steps are to be taken towards an ecient application in the performance sense ( gure 1 shows development phases as well as useful and demanded tool support.): For expressing the rough algorithm idea of a parallel program the designer for rst should not have to use traditional programming languages and immediately start with an implementation, as this is usually time consuming and errorprone. This work proposes a high level speci cation method (based on the Petri net formalism) able to support automatic veri cation of functional validity and performance predicition, while simultaneously exploiting the graphical expressiveness of the framework (top of gure 2). Tool support in this development phase is given by graphical and hybrid 5

(combined graphical and textual) editing tools consistent with the speci cation method. After arriving at an implementation skeleton (program template) in the speci cation phase, one is now interested whether the predicted performance promises an ecient implementation. To this end the speci cation is re ned until a level of abstraction is reached, where detailed functionality is speci ed in terms of program code (middle of gure 2). Code generators can translate from the speci cation into high level languages (bottom of gure 2), the use of several programming language paradigms, facilities and formalism for language de nitions by the user and library maintenance tools become essential. The reuse of speci cations as well as program fragments has to be supported by retrieval and adaption facilities. Performance prediction in this phase relies on additional user information, hence interactive tools for the speci cation of the expected dynamic program behaviour are necessary. On the other hand the designer may wish to improve the predicted performance by experimentally modifying implementations, again by using performance prediction tools. If functional validity of the program is assured and performance gures are acceptable, the designer can now turn to assign components of his program to devices of the target architecture. Tool support can be by either totally automatic (but generally inecient) process-to-processor or data-toprocessor mappers, by providing a set of standard mappings, or by facilitating to nd an optimum mapping by experimentation. In the latter case an object oriented, graphical, direct manipulative user interface with associated assignement-code generators and instrumentation tools are required. Finally, if a running parallel program has been achieved, performance studies using monitoring techniques can be fruitful in detecting system bottlenecks and aid in ` ne-tuning' the application. Further toolsupport in this phase are required for visualizing the running system on the basis of computation and communication events. The result of this phase is either an ecient parallel application, or the nding that the implementation has to be modi ed or rebuilt. The integration of the tools mentioned to form an environment is illustrated in gure 8. The major focuss of this work although will be on the presentation of the speci cation technique and its application in the rst development phase.

6

3 PRM-net Models as Parallel Program Speci cations 3.1 Formal Speci cation and Petri Nets Among the most frequently used formalisms for specifying the functional behaviour of dynamic systems are automata, process algebras and Petri nets. Process algebras (e.g. CSP [28] or CCS [35]) like Petri nets allow a precise description of systems due to their formal syntax and behavioural semantics, but also algebraic reasoning, deduction of properties and equational transformation preserving behaviour. Both approaches have proven highly suitable for parallel and distributed system construction, as simple composition operators (like sequential, parallel, iteration and nondeterministic choice) can be given to incrementally construct complex systems out of simpler ones. Additionally, Petri nets have a simple graphical representation thus supporting visualization of concurrent actions. Speci cation of a parallel program (in the sense of this work) must also capture performance requirements eectively, not only the functional requirements. Beyond (traditional) requirements to speci cation methods (like consistentcy, formal framework etc.) we claim support for functional and temporal speci cation, suitability for the speci cation of various aspects of parallel systems like software (control ow, data ow, communication, synchronization, nondeterminism etc.) and hardware (resources like processing elements, memory modules, communication media etc.), simple but expressive graphical means, the capability of investigations on various levels of abstraction, (i.e. provide concepts of modularity and hierarchical decomposability), for (automated) analysis concerning performance, functional validity, correctness and expected behaviour, and support for generating executable and analysable application prototypes as well as the generation of high level language source code. It is well known and practically approved that Petri nets support structured, hierarchical speci cation of phenomena typically arising in parallel programming (e.g. asynchronous or synchronous concurrent execution of cooperating processes), by at the same time providing a graphical formalism. Furthermore the generation of the corresponding performance models out of high level speci cations is a (more or less simple) translation step, and several methods and toolsupport exists for evaluating Petri net performance models. On the other hand - despite superior modeling power - the Petri net formalism is not actively used by software engineers. The fundamental problems to applying formal methods in general, although

7

their analysis capabilities are commonly recognized, lie in the diculty of their use [14]. Therefore, for application to practical work we have to provide means to specify at a more abstract level with a set of `easy to use' descriptive methods. In the case of Petri nets this could be by reducing the generality of the framework to an acceptable tradeo between expressive power and practical usability in software engineering. In this work we apply a restricted class of hierarchical Petri nets (but sucient with respect to parallel programming) for speci cation at the highest level, namely the (parallel) algorithm structure level. At this level, mainly communication patterns among independent (sequential) tasks are speci ed, re ecting the kind of parallelism actually exploited for the dedicated hardware. Obviously at that level the most critical performance decisions are being made, and performance prediction is becoming essential. The Petri net speci cation is parametrized by performance characteristics of the dedicated hardware and estimated resource requirements - analogously to classical performance modeling techniques (e.g. [17]), where a machine model (characteristics of the servicing system) is joined with a workload model (characteristics of service demands) to form a system model for which performance indices are evaluated. (See the illustration in gure 2.) As soon as there is only a rough estimate of the resource requirements, only a vague prediction of the performance behaviour will be possible. Nevertheless it should be possible to judge whether some speci cation promises satisfactory performance or not. If so, the application development should proceed with a re nement of the speci cation. (The possibility and importance of reuse of approved algorithm skeletons (parallel program templates) has been pointed out in [20].) At the program code level of speci cation, a change from graphical to multilingual textual representation is performed, in order to allow the engineer to use languages he is used to for expressing the functionality and the dynamic behaviour of the program under development. Depending on the predictability of the languages used in this re nement step, the estimation of the resource requirement parameters can be improved by automatic or semiautomatic parametrization of the model. The speci cation at the algorithm structure level and the program code level are consistent in the sense, that automatic code generation can be applied to translate the speci cation into a syntactically and semantically correct parallel program in textual form (along with the speci ed mapping information), to be fed into a traditional compiler generating object code and execution schedules. Performance prediction based on a (semi-)automatic generation of resource requirement parameters

8

Development phase

Performance prediction

Specification: algorithm structure level

rough resource requirement parameters

service characteristics of dedicated hardware

program to hardware mapping information

Transformation Performance model Evaluation

Specification: program code level

rough prediction

improved resource requirement parameters SEQUENTIALLY DO process A process B process C process D



Transformation Performance model Evaluation

Code generation Coding

improved prediction

automatically generated resource requirement parameters

PARALLEL DO SEQUENTIALLY DO on PROCESSOR1 process A process B process C process D SEQUENTIALLY DO on PROCESSOR2 process X process Y process Z



Transformation Performance model Evaluation "precise" prediction

Figure2: Petri Net Speci cation and Prediction of Parallel Applications.

9

is again improved.

3.2 Performance Modeling and Petri Nets Petri net research has signi cantly contributed to the performance evaluation eld in recent years [43], [36], [1], [15], [45], [13], analysis and evaluation tools have evolved [11]. The simultaneous use of the formalism for analysing qualitative and quantitave aspects of systems is practically approved and can be seen for example in [6]. The performance of parallel systems (a parallel program executing on parallel hardware) is not only determined by the performance of the hardware itself (e.g. processor-, bus- or link-, memory access-speed etc.), but also by the structure of the parallel program (the underlying algorithm, the communication pattern, synchronization of tasks etc.) and the assignment of program parts (tasks that execute concurrently and cooperatively) to resources. Neither the approach of resource oriented performance evaluation of parallel processing systems, where only the system resources are modeled to some extent of detail [38], nor the program or process oriented approach, where exclusively software aspects are subject to the analysis [45], [23], [10], seem adequate to characterize the performance of parallel systems. The actual performance of such systems is always determined by the interdependencies between hardware performance and the requirements of parallel programs, i.e. the proper utilization of hardware performance by the program. With PRM-nets [19] a modeling technique has been given considering hardware, software and mapping as the performance in uencing factors along with a computationally ecient and accurate method for the prediction of performance of parallel computations running on parallel hardware. The performance measures of interest are the (expected) execution time of the program and the degree of resource utilization at the hardware level. In the sequel we brie y introduce PRM-net models as a speci cation method for parallel programs, focussing mainly the structure and the resource requirements of an application and the aspect of the speci cation as far as it helps to construct a corresponding performance model. The aspect of specifying detailed program functionality is of secondary importance at the algorithm structure level. We will denote functionality at that level of speci cation by the symbol , which stands for a functional speci cation in some conventional programming language to be added to the speci cation later on in the development process.

10

P

E

Process 1

Process 2 t

P

t

P

fork

P

E1

C

C

1.1

E1

E2

P

P C

P

P

Loop 1 SR

n t

A1

1.1

Loop 2

((2+2 +4);p)

1

ESR 2 C

1.2

t

t

P

P

A2

2

BSR 1

2.2 P

A1

((+ 3);p)

sub 3 A2

BSR 1

join

A

(a)

3.3 P-nets

((;p)

sub 2

n

ESR 1 C

P

12

Loop 1

E2

BSR 2

BSR 1

((2+ );p)

sub 1

2.1 t

Loop 1

(b)

Figure3: P-net model of a parallel program

A Petri net oriented process model is used to specify the structure, functionality and resource requirements of parallel programs. A process is graphically represented by a transition, where inputplaces and outputplaces to the transition are used to model the current state of the process. A process t is ready to get active, if its corresponding process transition is enabled; the process gets active as the corresponding transition starts ring, and remains active for the ring duration. The process behaves according to a prede ned functionality and terminates by releasing tokens to outputplaces, therewith making subsequent processes (transitions) ready to get active (enabled). The Petri net speci cation of processes (componenents of parallel programs) is called a P-net or process graph. Processes can be arranged to be executed in sequence ( gure 4 (b)), in parallel ( gure 4 (c)), alternatively ( gure 4 (d)) or iteratively ( gure 4 (e)). Concurrent processes are allowed to communicate on a synchronous or asynchronous message passing basis. In gure 3 (a) the P-net of a simple program constituted by two cyclic processes working in parallel and communicating with each other is given. It is built

11

by a set of process transitions in a proper arrangement determining the dynamic behaviour of the program. With every process transition in the P-net a multiset of services oered by some physical (hardware) resource is associated, which we call the resource requirements of the process. To support hierarchical speci cation, process compositions can be folded to form a single, compound process, graphically represented by a single transition (box), by aggregation of the resource requirements of all the constituting processes. The opposite is also possible: a single process can be re ned by specifying its internal structure in terms of complex process compositions. Figure 3 (b) shows that process C1:1 is constituted by three subprocesses sub 1, sub 2 and sub 3, each of them requiring a certain amount of the physical resource services 1; 2 and 3. The type of resource p (processor) is also speci ed. When aggregating sub 1, sub 2 and sub 3 to comp 1, the resource requirements are cumulated. More formally we have:

De nition 3.1 (P-net) A P-net (or process graph) is a six-tuple P = (P P; T P; RP ; Mstart; R; )

where:

(i) (P P, T P; RP ) is the underlying net with P P = fp1; p2; : : :; pnP g and T P = ft1; t2; : : : ; tnT g. The elements ti 2 T P are called processes. (ii) 9pE 2 P P with pE 62 tO8t 2 T . pE is the entry place of P (iii) 9pA 2 P P with pA 62 tI 8t 2 T . pA is the termination place of P (iv) 8t 2 T P : t is either a primitive or a compound process. (v) The direction of each r 2 RP de nes the direction of the ow of control. (vi) Mstart = fm1; m2; : : :; mnP g is the initial marking with mE = 1 and mi = 0 8pi 2 P n pE . (vii) R = f%1; %2; : : :; %nT g is the set of resource requirements of ft1; t2 : : :; tnT g where %i, the requirement of process ti, is a set of tuples (; !) and is a multiset of primitive processes requiring a resource of type !. (viii) = f 1; 2; : : :;

nT g

is the set of functionalities of ft1; t2 : : : ; tnT g.

12

PE

send PA

PE

PE f1 t1

receive PA

(a) Synchronous Communication

P E

P1

f2 t2

fn T tn T

P A

t1

(d) Alternative Composition

PE t fork

PE

P2

PE

PE

t2

send

receive

P3

t1

t2

tn -1 T

tn T

tE

...

PA

PA

(a´) Asynchronous Communication

t

n t join

tn T PA

Pn T +1

(b) Sequential Composition

PA

(c) Parallel Composition

tA PA

(e) Iterative Composition

Figure4: Graphs of process compositions. We distinguish two types of process transitions: Primitive processes are deterministic in behaviour, i.e. they have deterministic resource requirements in that they always require the same amount of services from the physical resources. They are no further divisible and hence represent the building blocks of a parallel program. The set of all primitive processes within a parallel program is denoted by = f1; 2; : : :; n g. The graph of a primitive process is a single transition (represented by a bar) with an entry place and one termination place. Complex structures of parallel programs are represented by process compositions, which are being built as an arrangement of primitive or compound processes. A sucient [28] set (to specify any kind of block structured parallel application) of process compositions is given in terms of process graph compositions in gure 4. For the aggregation of resource requirements when folding a process composition to a single, compound process (represented by a double bordered box) we have the 13

following rules:

Sequential A sequential composition Pseq = (P P; T P; RP ; Mstart; R; ) of processes T P = ft1, t2, : : :tnT g with resource requirements R = f%1; %2; : : :%nT g and functionalities = f 1; 2; : : : ; nT g is aggregated to a compound process P = (P P = fpE ; pAg , T P = ftg , RP = f(pE ; t); (t; pA)g, ! ( ; ! ) denotes the set of all tuples of mulMstart = f1; 0g; R = f%g; = f g), where % = Sni=1 i i tisets of primitive processes and resource types respectively if n! dierent types of resources are T required by the processes t1; t2; : : :tnT . = ni=1 i is the (abstract) product of functionalities i. !i Let %i = Snk=1 (k ; !k ) be the resource requirement of the process ti 2 T P, assuming that n!i is the number of dierent types of resources required by ti. i = f!1; !2; : : : ; !n!i g is the set of T the set of all types of resources wanted all types of resources required by ti, and = Sni=1 i by the whole composition. The resource requirement (of the compound process) is

%=

[

(

X

j j!j 2 ij(k ;!j )2%i

k ; !j );

where P is a symbol for the sum of multisets. This means that when aggregating a set of sequential processes to a single compound process as a matter of abstraction, the resource requirements of the constituent processes have to be cumulated with respect to dierent types of physical resources.

Parallel The process graph Ppar of a parallel process composition comprises two additional proces-

ses: a fork process tf and a join process tj , such that the parallel processes (t1; t2; : : : ; tnT ) are allowed to get active concurrently if the fork process has terminated. The join process tj gets active as soon as the last t 2 ft1; t2; : : :; tnT g has terminated. Folding a parallel process composition to a single compound process is similar to the sequential case with respect to cumulation of resource requirements, since it is necessary to serve all the requirements of the (potentially parallel) subprocesses. (We assume that neither the fork-, nor the join-process require resources for their execution.)

Communication Parallel processes, i.e. processes having a fork-transition in common, are al-

lowed to communicate with each other. Interprocess communication with synchronous message passing is expressed by matching send (!) and receive (?) primitives and takes place if both 14

processes issuing these commands are ready for the message exchange at the same time. We recognize the operations ! and ? to be processes and specify the communication of processes by a synchronization transition (see gure 4 (a)). (Note that the interrupted bar for the synchronization transition is only a drawing convention and represents a single transition in the usual sense.) On the other hand, asynchronous message passing where the send primitive is nonblocking, a facility like a monitor or buer is used to deposit the message for collection at a later time. In this case the operations ! and ? can again be seen as processes. A deposit place is used to specify asynchronous communication (see gure 4 (a0)).

Alternative The graph of an alternative process composition Palt has only two places P P = fpE ; pA g, allowing at most one of the alternative processes (t1; t2; : : : ; tnT ) to get active at

the same time (free choice con ict). Boolean guards fi are assigned to the arcs (pE ; ti) for con ict resolution. If none of the guards f 2 ff1; f2; : : :; fnT g is true, none of the corresponding processes can get active. If more than one of them are true simultaneously, one of the corresponding processes is selected at random to get active (nondeterministic choice). The aggregation of the resource requirements in Palt hence is probabilistic: Let qi = Pff(pE ;ti) = tg T q = 1), then P be the probability that guard fi is true (Pni=1 i alt is aggregated to a compound process with resource requirement:

%=

[

(

X

j j!j 2 ij(k ;!j )2%i

qik ; !j ):

Iterative The graph of an iterative process composition Piter contains a 'loop place' pL with entry-

and exit-processes tE and tA (which are again assumed to require no resource services). The arc (pL ; tA) is a counter arc [15] denoting a speci c number of loop iterations (consecutive enablings of the 'loop body' t). (In the example P-net in gure 3 (a) (PLoop1; tA1) is a counter arc labeled by n, denoting that tA1 is enabled if there are n + 1 tokens in PLoop1 . (PLoop1; C1:1) is the corresponding counter-alternate arc enabling C1:1 when the count in PLoop1 is between 1 and n inclusively. Firing of C1:1 is allowed whenever a new token enters PLoop1, but does not remove tokens from there, and places tokens to subsequent outputplaces.)

Let n be an estimate of the loop iterations, then Piter aggregates to a single, compound process 15

with

%=

n! [ k=1

(nk ; !k ):

Concise speci cation of replicated P-nets For process compositions with replicated structures

P-nets can be speci ed in the avour of Pr/T-nets [24] to get a more concise and easier to understand speci cation of compositions of huge sets of identical processes. This is of practical importance especially for sequential and parallel processes compositions. We call Pseq with T P = ft1; t2; : : :tnT g a replicated sequential process composition if all ti 2 T P are instances of one and the same process t (see gure 5 (a)). In this case Pseq is denoted by the abbreviation as in gure 5 (c), with an interpretation according the semantics of a Pr/T-net as in gure 5 (b). The variable predicate PE x initially holds for all individual processes P1inT < i >. The transition selector x = mini < i > of t selects processes by increasing identi ers to be executed ( red) by t (removed form PE x and deposited into PA x). (In general: an individual < i > is allowed to leave PE x only if the individual < j >, j = i ? 1 has arrived in PA x.) Multiple replications are possible and denoted by subsequential annotation of replicators. The same is possible for replicated parallel process compositions with notations as in gure 5 (d) ? (f ). Note that t has no transition selector, i.e. t can re concurrently in all modes of x. The range of replication is enclosed in brackets ( gure 5 (f )). The following relation among individual tokens in multiple (hierarchical) replications and plain tokens guarantees consistency with plain P-nets: Let uk ik ok be the range of individuals in the k-th replication then X < i1; i2; : : :; ik > = < i1; i2; : : :; ik?1 > P

(uk ik ok )

holds with (u1i1 o1 ) < i1 > = , i.e. presence of all individuals of single replication is equivalent to the presence of the plain token.

Firing Rules P-nets have the same ring behaviour as plain Place/Transition nets [42] with the

exception of the guarded transition ring in alternative process compositions. In this case the con ict among enabled transitions is solved depending on data in the speci cation - usually performed (interactively) by an application designer interested in the behaviour of the program when simulating the P-net. For the purpose of investigating structural or performance properties of the P-net it is no

16

PE t

PE x j 1xN P 1iN

PE

PE x j 1xN P

t fork

PE (i=1..N)

1iN

t

i

< >

PE [i=1..N]

x = min t i

t

t

t

t

t

t

t

t PA (i=1..N) PA

(a)

PA x j 1xN (b)

PA

(c)

PA [i=1..N]

t join

(d)

PA x j 1xN (e)

(f)

Figure5: Replicated Sequential/Parallel Process Compositions. longer necessary to consider guarded processes, one is rather interested in what the `real alternatives' are. Hence we can restrict our further considerations to P-nets meeting the following assumption of de nitive choice: whenever the entry-place of an alternative process composition is reached, at least one guard is true. According to this assumptions we can eliminate guards from P-nets by simultaneously assigning appropiate random switches for con ict resolution.

3.4 Properties of P-nets

Every P-net is safe in every marking MStart ; Mi reachable from the initial marking Mstart = fmE ; m2; : : :; mAg = f1; 0; : : : ; 0g, and satis es the free choice condition by de nition. For the investigation of behavioural properties of P-nets we rst extend P-nets as follows: Let P = (P P; T P; RP, MStart, R; ) be a P-net. Its extension to P = (P P ; T P; RP ; Mstart), with: P P = P P ; T P = T P [ te; RP = RP [ (pA ; te) [ (te; pE ) is called the extension of P. (The extension of the P-net in g. 3 is constructed by adding a transition te and arcs from PA and to PE thus `looping back' the token

ow from PA to PE .) The extension P of a valid P-net P is a strongly connected Free Choice Net (FC-P-net) (It meets the free choice condition and j pI j 1; j pO j 1; 8 p 2 P ). If there are no alternative or iterative process compositions in P, then P is a strongly connected Marked Graph (MG-P-net) (8 p 2 P; j pI j = j pO j = 1) 17

Process 1

Process 2

C

C

1.1

Process 1

Process 2

2.1 C

SR

P

12

P

Loop 1

SR

Loop 2 P

n ESR 1

ESR 2 C

1.2

Link

o Processor

2.1

12

P

Loop 1

Loop 2

n

n ESR 1

ESR 2

C

C

2.2

1-2

1.2

2.2

o Processor

1

BSR 2

BSR 1

n

C

C

1.1

BSR 2

BSR 1

o

2

Processor

1

(b)

(a)

Figure6: PRM-nets for (a) two processor (b) one processor mapping. To determine whether a parallel program is free of static deadlocks we consider a P-net P with Mstart = fmE ; m2; : : :; mAg = f1; 0; : : : ; 0g to terminate if Mstop = fmE ; m2; : : :; mAg = f0; 0; : : : ; 1g is reachable from every marking MStart ; Mi. Now let P be the extension of P. P terminates if pE as well as pA are covered by all (minimal) place invariants of P (for a proof of an equivalent theorem see [18].) Thus, if an invariant does not cover pE as well as pA then two undesirable situations appear depending on whether the invariant is initially marked or not: if it is marked, then there is a cycle in the process causing livelock, if not, then (some) subprocesses can never get active (dead transitions). Solving for the minimum place invariants of the extension of the P-net in gure 3 (a) (solve IP0 ~i = ~0) we obtain that the two of them both cover pE as well as pA ; thus P will terminate. Moreover in this example the number of invariants (2) obtained simultaneously expresses the degree of exploitable parallelism; P can (potentially) execute on two dierent processing elements. We can conclude that P-nets due to their de nition allow the application of powerful methods (and tools) developed for structural analysis within the framework of P/T nets. P-nets at that stage represent qualitatively analyzable speci cations of parallel applications at the algorithm structure level. 18

3.5 PRM-nets After qualitative analysis of the P-net model one is interested in the expected performance of the parallel application under development. This requires inclusion of hardware performance characteristics and program to hardware mapping information into the model. Parallel processing systems typically employ pools of resources like memory, processing elements and communication devices, allowing their concurrent usage. The pool of resources, their connectivity and interactivity as well as their potential performance are in the sequel modeled by R-nets (resource nets). We assume that for every resource in a parallel processing environment one can identify its type and a set of services (primitive processes) oered to applications.

De nition 3.2 Let = f1; 2; : : :; n g be the set of resources and the set of primitive processes that can be executed by 2 . A R-net is a resource graph R = (P R ; T R; RR ; Minit; T ) where: (i) P R = fh1; h2; : : : ; hn g is a nite set of 'home' places for the resources 2 . (ii) T R = ft1; t2; : : : ; tnT g is a nite set of transitions and RR (P R T R) [ (T R P R ) is a ow relation. (iii) Minit : P R 7! initially assigns resource tokens i 2 to 'home' places hi 2 P R. (iv) T = f1; 2; : : :; n g is a set of functions, each i : i 7! Z assigning deterministic timing information to primitive processes executable by i . Every resource in the system is modeled by a token in the R-net having its proper home place. Presence of the (resource-) token in that place indicates the availability of the resource (idle to serve). Arcs in RR describe the direction of resource ows and help, together with transitions in T R, to model interactions and interdependencies between resources. With every resource is associated a set of primitive processes , along with timing information () for each 2 . () is the time it would take the resource to serve the primitive process . The assignment of parallel (software) processes to resources is now expressed by a set of arcs combining P-nets and R-nets to a single Petri net:

De nition 3.3 A mapping is a set of arcs M (P R T P) [ (T P P R ) where 19

(i) P R T P is the set of arcs leading from home places to processes such that if (hi ; tj ) 2 P R T P and the type of i is !, then 2 i 8 2 for all tuples (; !) 2 %j . (ii) T P P R is the set of arcs leading from processes to home places with (tj ; hi) 2 T P P R ) 9(hi; tj ) 2 P R T P as in (i). Assigning home places to process transitions is allowed only if all the primitive processes required by the transition are oered as services by the resource (the resource also has to have the desired type). We nally call the triple PRM = fP; R; Mg, the combination of a P-net and a R-net by a mapping to a single, integrated net model, PRM-net model. The PRM-net model is at the same time the speci cation of the parallel application at the algorithm structure level (see gure 2 and a detailed example in the next Section). Figure 6 gives sample mappings of the P-net in gure 3 (a) to a set of resources. Assume that the compound processes C1:1, C1:2, C2:1 and C2:2 all require recources of type p (processor), while Processor 1 and Processor 2 being resources of that type, then the process transitions are allowed to be mapped to them if they can serve all the primitive processes required by C1:1, C1:2, C2:1 and C2:2. Figure 6 (a) shows the assignment of the parallel program to two processors connected to each other by a communication link, fully exploiting the inherent parallelism. During one iteration step C1:1 and C2:1 can be executed concurrently by Processor 1 and Processor 2. The communication process SR12 synchronizes the two processes when being executed by a link type resource. (The link resource is made available only if both processes are ready for communication, i.e. there is a ow token both in BSR1 as well as in BSR2. This is expressed by bidirectional arcs between BSR1 (BSR2) and the transition preceding place Link, enabling this transition only if BSR1 and BSR2 are marked.) Finally C1:2 and C2:2 are executed concurrently. In the second case ( gure 6 (b)) the program is mapped - without any change in the speci cation of the algorithm structure - to a single processor (capable of serving also communication processes, e.g. emulated by shared variables). All the processes C1:1, C1:2, C2:1 and C2:2, as well as SR12 are executed sequentially. Processes are scheduled according to their ow precedence in the P-net: When starting a new loop iteration only C1:1 and C1:2 are ready to get active. This is due to C1:1 and C1:2 being enabled by tokens in the loop places (mLoop1 = 1, mLoop2 = 1), and the residence of the resource token of Processor 1 in its home-place (mProcessor1 = 1). The con ict among C1:1 and C1:2 is resolved as in ordinary Petri nets by nondeterministic selection of one transition (process) to re (execute). 20

Assume C1:1 beeing chosen to execute rst, then, after replacing the resource used in its home-place, only C1:2 is enabled and will re. After that, control ow in the P-net forces the communication SR12 to happen (mBSR1 = 1, mBSR2 = 1 and mProcessor1 = 1) etc. To conclude, a transition (process) assigned to some resource is enabled (ready to get active) if all of its inputplaces in the P-net hold a ( ow-) token, and the required resource token is in its home place. The transition (process) res (executes) by removing all the ow tokens from the inputplaces in the P-net and the resource token from the R-net, making the resource unavailable for other processes. After a certain ring period ow tokens are placed to outputplaces in the P-net, while the resource token is placed back in its home place (R-net), making the resource available again.

Timing in PRM-nets To support independence among the speci cation of the algorithm struc-

ture of an application and the characteristics of resources we introduce the notion of interactive timing in the evaluation of PRM-nets. At the time a parallel application is being developed, the con guration of the target hardware is generally not known. The number, type and arrangement of processing elements for example is often determined on the basis of the parallel program so as to achieve optimum performance. To this end a model of the program has to reveal the amount and type of services required from hardware resources, independently of an actual, physical resource constellation. The P-net provides all these informations: the amount of resources required is implicitely speci ed by the number of processes and the resource types required by them. The amount and type of services is explicitely in the model in terms of resource requirements associated to process transitions. The resource requirements are expressed in terms of multisets of primitive processes during the parametrization of the P-net, and aggregated in the case of process compositions according to the rules given above. The actual execution time of some process is based on performance characteristics of an assigned resource; these characteristics are explicitely in the R-net model in the shape of timing functions for primitive processes, the assignment information is explicitely in the PRM-net model. Given now a process transition ti with resource requirements %i = (; !), where = Pkjk 2i nk k is a multiset of primitive processes out of i (nk denotes the multipliciy of k in j ), assigned to a resource j with services j and service times j (), 2 j (i j ), then the (deterministic) ring

21

time for that transition is interactively (at the ring instant) calculated as X kjk 2i

nk j (k):

The compound process C1:1 in gure 6 (a) for example, with resource requirements as in gure 3 (b), assigned to resource Processor 1 with execution times 1(1), 1(2) and 1(3) for the primitive processes 1, 2 and 3 would take 21 (1)+21 (2)+41 (3) time steps to execute. In the PRM-net ( gure 6 (a)) this would cause (after ring of tfork - which is not assigned to a resource since it does not require services - in zero time) a removal of the resource token from place Processor 1 in the R-net and the ow token from place PLoop1 in the P-net at time zero, and a release of both tokens (the resource token back to place Processor 1, and the ow token to BSR1) at time 2 1 (1)+2 1 (2)+4 1 (3). During that time period, neither the speci c ow token, nor the resource token would be available (visible) in the PRM-net (note that this ring policy is in contrast to the usual `race policy'). Based on the above timing conventions, the expected overall execution time of the parallel program in the speci cation phase is evaluated by simulation of the PRM-net model. Tools [12] have proven useful for this task, although preprocessing is necessary to simulate interactive timing. With the simulation of the PRM-net one at the same time observes token distributions of home-places in the R-net, representing the basis for the derivation of gures describing resource usage. At this point it is easy seen that PRM-nets serve as the basic model for performance prediction, while at the same time acting as the speci cation of the parallel application at the algorithm structure level. Moreover PRM-net models bear the potential of reusing them as general purpose application templates, instanciated by a assigning a speci c functionality for a special purpose. We will demonstrate this aspect of PRM-net based speci cation in the next section, by demonstrating the development steps from the algorithm idea downto an implementation skeleton ( gure 2), and the former aspect (performance prediction) in terms of another example in the CAPSE environment afterwards.

4 Speci cation of a Process Pipeline The principle common to pipelined parallel algorithms [37] is that data is owing through a cascade of processes (pipeline stages) being modi ed by process activities. Processes acting as transformers of data usually have dierent functionality. Applying this functionality in a prede ned sequence

22

upon the input data stream generates an output data stream representing the problem solution. The behaviour of a pipeline stage is to accept data from a predecessor stage, apply a set of operations on that data and pass the result to the successor stage. Consequently, as the stages have dierent functionality, the computations in a pipeline are asynchronous. Communication in a pipeline is always between consecutive stages and usually a data item is allowed to enter a stage if the predecessor has already released it, i.e. communication is regular. As the arrival and departure of data drives the computations in a pipeline, the regular communications naturally synchronise the asynchronous computations. For the development of a pipelined parallel application we start according to the scheme in gure 1 with an algorithm idea, which is has verbally been described above. Formally we can think of a pipeline stage as a box with some hidden functionality as at the top left corner of gure 7, and the pipeline itself as a linear arrangement of stages. A P-net speci cation for a single stage is easily derived: First a data item is accepted from the previous stage, which is speci ed as a receive process with resource requirement of f(h?i; l)g (where l denotes a link type resource). After data is received, some operation with functionality has to be performed, which is expressed by a process transition with resource requirements to a processor f(: : : ; p)g (a multiset of primitive processes describing the resource requirements to the processor is intentionally left open here). Finally data is passed by to a succeeding stage, speci ed by a send communication process with resource requirement f(h!i; l)g. This has to be repeated iteratively for a given set of data, being speci ed by an iterative process composition that will terminate after a certain number (n) of iterations. A P-net for the behaviour of a single pipeline stage is developed in the upper third of gure 7 Assuming synchronous message passing communication, the pipeline stages have to be tied to each other such that the output (!) of some stage i forms a single (synchronization) transition with the input (?) of stage i + 1. In the middle of gure 7 a three stage pipeline is given, where the iterative processes representing single stages are forced operate in parallel by being arranged as a parallel process composition. The P-net derived can now be analyzed from a correctness and termination point of view by applying structural analysis upon the corresponding Petri net. Qualitative analysis of the P-net reveals that a process in stage i can get ready for execution in iteration j only if stage i ? 1 has already completed iteration j and passed the resulting data to stage i. Furthermore, all the stages can potentially execute their processes in parallel, stage i in iteration j , stage i ? 1 in iteration

23

Three Stage Pipeline (P-net)

Behaviour of Single Pipeline Stage

Algorithmic Idea P E

{(, l)}

tE

{(..., p)}

tA

n

P A

{(, l)}

P E tf

{(..., p)}

n {(, l)}

n

{(..., p)}

{(, l)}

{(..., p)}

tj

change mapping Three Stage Pipeline Mapped on Three Processors

n {(, l)}

{(, l)}

P A

{(..., p)}

n {(, l)}

n

n {(, l)}

{(, l)}

{(..., p)}

{(, l)}

{(..., p)}

link

processor

processor

link

processor

link 1-2 processor

Performance Prediction

1

link 2-3 processor

2

Implementation Skeleton

Figure7: Speci cation for a Process Pipeline. 24

processor

3

j + 1, stage i + 1 in iteration j ? 1 etc. The maximum degree of parallelism is determined by the maximum overlap of stage computations, which can be veri ed by applying invariant analysis. For performance prediction we have to provide additional information to the P-net model. First by giving hardware to mapping information, and secondly by providing service characteristics of the hardware in use. This is achieved by extending the P-net to a PRM-net. In the case of the availability of a three processor pipeline, one possible mapping results from assigning a single pipeline stage to a single processor, and communications among stages to communication links connecting processors (lower third of gure 7). The PRM-net can be evaluated under its parametrization (set of all resource requirements in the P-net) by simulating the corresponding timed Petri net. The eciency of pipelined algorithms is determined by the homogenity of stages, i.e. the balance of operations in the stages and the grain size of stage computations, i.e. the communication frequency between stages. If a p-type resource is able also to serve a l-type requirement (emulation of message exchange), then communication could also be mapped to a processor. Without changing the Pnet, several dierent mappings could be investigated to balance the operations across processors by variations of the R-net and the mapping arcs yielding dierent parametrizations of the PRM-net. (Practical case studies with a real application on a real multiprocessor system are reported in [19] [22].) As soon as a speci cation for which the predicted performance is satisfactory (or optimum) is found, the speci cation is kept as an application or program template, and the designer can start with the implementation by stepwise re nement of the speci cation, i.e. providing the functionality of stages in terms of a high level programming language. Program templates at that high level of abstraction can be instantiated in future applications by simply assigning dierent functionality to stages, thus serving as general purpose, reusable, performance optimized implementation skeletons which can considerably ease application design and save development time.

5 Application Speci cation in the CAPSE Environment 5.1 The CAPSE Environment In contrats to the technology of computer aided software engineering (CASE) which provides set of tools for software development, according to traditional methodologies like the waterfall-model, 25

GRAPHICAL, OBJECT ORIENTED USER INTERFACE X-Windows

DESIGN

CODING Textual/Graphical Editor

HierarchyDiagram Editor

Compiler Debugger Library Manager

PERFORMANCE PREDICT./MEASUREM. Model Generator

Model Editor

Evaluator Analysis/Simulation Software Monitor

Performance Viewer

PERFORMANCE MODELS

SPECIFICATION

LIBRARIES CODE EXECUTABLES

EXECUTION VISUALIZATION

Process/Data Placer

Event Tracer

Process/Data Mapper

Execution Viewer

Mapping Library Manager

PERFORMANCE DATA

VIEW SETTINGS

ASSIGNMENT

View Specifier

MAPPING BASE

TRACE INFORMATION

Figure8: Architecture of the CAPSE Environment the spiral- model or rapid prototyping, the Computer Aided Parallel Software Engineering (CAPSE) environment [21] aims to assist the proposed performance oriented development cycle ( gure 1) in all development phases by integration of automated performance tools and the CASE methodology. It integrates the PRM-net based speci cation method for parallel programs with Petri net related structural and performance analysis methods. The interdependencies and henceforth connectivity of tools in the compound toolset is given in gure 8. All the tool functionality is hidden behind a unique, graphical, window-oriented user interface. The design tools follow the need for graphical program development at the algorithm structure level by applying the PRM-net metaphor, and are used to work out the speci cation which is used by coding tools to generate or develop program code, and by the performance tools to derive performance models. In this section we give a avour of the development process from the algorithm idea to the implementation skeleton of the application in the CAPSE environment (the rst step in gure 1 applying tools for semigraphical editing and performace prediction). As in this phase the major performance decisions have to be made, and as the proposed method applies mainly in this step (see upper third of gure 2), we concentrate only on this subset of CAPSE (see [21] for a systematic description).

26

(a)

(b) Figure9: Developing a Systolic Application in CAPSE.

5.2 Developing a Systolic Application in CAPSE

The major characteristic of systolic computation [30] in contrast to pipelining is the homogenity of `stages', better known as systolic cells, whereas the operational principle is the same. A systolic algorithm can be completely expressed in terms of simple, functional cells, usually implemented in hardware (VLSI) and arranged to allow only nearest neighbour communication (`locality of communication'). The set of operations applied to incoming data stream(s) is the same in each cell. All cells operate synchronously in parallel (i.e. every cell causes (approximately) the same processing time) in compute-communicate cycles. In the compute phase all the cells are busy operating, while in the communicate phase cells propagate and receive data. In CAPSE, a reusable parallel program template for systolic computations would be designed, speci ed, performance predicted and created as follows: The building block of a systolic computation, a cell, is represented by a process with input ports for accepting data from previous cells, a process body comprising a set of operations to be

27

applied to the input stream and output ports to pass (possibly modi ed) data on to succeeding cells. The algorithm idea hence is an arrangement of cells, e.g. two dimensional grid, of functional cells (concurrently) accepting two input streams, applying some functionality and outputting the streams. A corresponding P-net describing this behaviour is directly constructed in the speci cation tool of CAPSE ( gure 9 (a)) using a graphical, hierachical, high level Petri net editor (based on Design/CPN [33]). The entry place (Process enter) (a user de ned place identi er corresponding to PE ) preceds a fork which spawns two parallel receive (?) operations, say from north and from west, then executes some multiset of primitive processes according to the functionality (to be speci ed later in the development process). Finally data is sent to the east and south by two concurrent send (!) operations. Concentrating only on the algorithms local behaviour we for rst postpone the matching of send and receive processes of a single cell related to neighbouring cells. This receive-compute-send cycle is repeated N times, thus the whole net is sequentially replicated by (i = 1::N ). The overall algorithm structure is formed by a grid of cells. Assuming the necessity for M M cells that iterate concurrently in synchronized lock step, we simply replicate the cell in parallel for the two grid dimensions: [j = 1::M ] [k = 1::M ] ( gure 9 (b)). The inner replication index i hence de nes the iteration level of a cell located at position (j; k) in the grid. The initial marking of place [j = 1::M ][k = 1::M ](i = 1::N ) is the cartesian product of grid (row/column) indices with the iteration index, and represents the symbolic sum P < j; k; i > of individuals < j; k; i > with j 2 f1; 2; : : : ; M g, k 2 f1; 2; : : : ; M g and i 2 f1; 2; : : : ; N g. In the initial marking all the individuals < j; k; 1 > are red (simultaneously and independently of values for j and k). As soon as < j; k; 1 > has arrived in the place Process exit (PA ), the corresponding < j; k; 2 > is allowed to leave PE . As soon as all individuals < j; k; i >, i 2 f1; 2; : : : ; N g are in PA , these are reduced to < j; k >. As soon as all < j; k > j 2 f1; 2; : : : ; M g, k 2 f1; 2; : : : ; M g are in PA , they are reduced to the plain token . We nally have to assure, that communication happens only between neighbouring cells in the grid that operate in consecutive iterations. This is realized by the transition selectors assigned to the preliminary synchronization transitions. An individual is denoted by < r; s; t > if it appears as a receiver in the communication, and < u; v; w > if as sender. A west-east communication happens for indiviuals < r; s; t > and < u; v; w >, for which value bindings r = u, s = v + 1 und t = w (same row index, column index diers by 1). For a north-south communication to happen the value bindings have to ful ll r = u + 1, s = v and t = w (same column index, but row index diers by 1).

28

Sequential replication over the iteration index guarantees that whenever the sender is in iteration i the receiver is in iteration i ? 1. At iteration level i cell j; k communicates with cell j ? 1; k (north) and j; k ? 1 (west) in receiving mode, while with cell j; k + 1 (east) and j + 1; k (south) in sending mode via the same transitions at iteration level i ? 1. So the north!south (west!east) communication transitions drawn twice in the P-net of gure 9 (a) are one and the same in a `nonconnected' array of cells, and have been merged to form single synchronization transitions as in gure 9 (b). This corresponds to interconnecting the functional cells in the mental model of the parallel algorithm. We have now arrived at the algorithm skeleton of a systolic algorithm (for brevity we did not draw the `boarder processes' at the top and the left hand side of the grid acting as data stream sources; also not drawn are the data stream sink processes to the right and at the bottom of the grid), which can be assigned to a grid of processing elements described in the R-net, while the synchronization transitions between cells are mapped to hardware links. (The mapping is trivial and therefore omitted.) The model can be now parametrized by assigning resource requirements to process transitions directly in the diagram. The optimum placement of the P-net to an R-net can now be found e.g. by experimentation (investigating various dierent parametrizations), since performance prediction is possible directly on the basis of the speci cation - infront of any implementation work.

5.3 Case Study: Performance Prediction of Matrix Multiplication The systolic algorithm speci ed in gure 9 (b) can be considered as a general and reusable template for a whole class of systolic algorithms. We can use the speci cation e.g. for the problem C = A B on a processor grid, given that A has dimension n1 n2, B has dimension n2 n3, and processing elements are arranged, in a (now more general) N M grid, if A can be partitioned into N equal sized row blocks (9lA j n1divN = lA), B can be partitioned into M equal sized column blocks (9lB j n3divM = lB ) and the columns of A (rows of B ) can be partitionend into s iterations (9s j n2divs = l, with 1 l n2 and l 2 IN ) such that 2 66 66 A = 66 66 4

a1;1 a1;2 a2;1 a2;2 ... ... an1;1 an1 ;2

...

3 2 a1;n2 7 6 7 6 a2;n2 777 666 ... 77 = 66 7 4 5 6

an1 ;n2

29

A1;1 A1;2 A2;1 A2;2 ... ... AN;1 AN;2

...

3 A1;s 7 7 A2;s 777 ... 77 ; 7 5

AN;s

A

B

Topology

n1

n2

n2

n3

N

6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8

12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12

2 2 2 2 3 3 3 3 6 6 6 6 1 1 1 1

M

s

3 3 3 3 2 2 2 2 1 1 1 1 6 6 6 6

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

Size Subm. IP Exec. Time Proc. Speedup PRM (s) n_s w_e i TN P SP M 32 24 96 1742.755 6 1.436 16 12 48 1156.457 6 2.164 8 6 24 872.017 6 2.869 4 3 12 747.215 6 3.349 48 16 96 1998.691 6 1.252 24 8 48 1348.409 6 1.856 12 4 24 1031.977 6 2.425 6 2 12 891.179 6 2.808 96 8 96 4436.225 6 0.564 48 4 48 2762.031 6 0.906 24 2 24 1933.643 6 1.294 12 1 12 1536.867 6 1.628 16 48 96 3092.561 6 0.809 8 24 48 1898.247 6 1.318 4 12 24 1309.799 6 1.910 2 6 12 1032.993 6 2.422

Figure10: Speedup for Matrix Multiplication using 6 processing elements where Ai;j is a lA l submatrix of A, and B is analogously partitioned into sM submatrices of dimension l lB . In this case, in order to implement C = A B , every cell would have to perform multiplication of lA l submatrices of A and l lB submatrices of B locally, i.e. the functionality of every cell (associated as program code to transition compute in gure 9 (b)) would be

ci;j = ci;j +

l X h=1

ai;h bh;j

8i; j j (1 i lA); (1 j lB );

thus yielding lA lB submatrices Ci;j of the solution C in every cell after s iterations ( is a plain matrix multiplication algorithm since every cell multiplies and cumulates submatrices of A and B respectively).

Example Assume A to be 6 8, B be 8 12 and the availability of 2 3 (N = 2, M = 3) processing elements to calculate C = AB . Choosing s = 4 implies A to be partitioned into 24 = 8 submatrices with dimension 26 84 , whereas B is partitioned into 4 3 = 12 submatrices with dimension 48 123 . In every cell submatrices of A and B are multiplied to 26 123 submatrices of C , which are cumulated 30

in s = 4 iterations. In an experiment the timing function of transputer hardware modeled at the R-net level has been obtained by measurement and gave the following values: Only one primitive processes of the resource processor (p) was required to model the application, namely p = fh+ig where h+i is a process to multiply two (32-bit) integers and to cumulate them (c := c + a b), with (h+i) = cb = 3:001sec. For the link resource (l) we had l = fh?[i]i; h![i]ig (where h![i]i (h?[i]i) stands for sending (receiving) i integers through a hardware link), with (h![i]i) = (h?[i]i) = b + i b, b = 2:903sec and b = 3:999sec (b denotes the data transfer setup time and b is the propagation time for a single number). The resource requirements for the synchronization transition communicate west-east are % = f(h![6]i + h?[6]i; l)g, as in one single communication step a submatrix of A ( 26 84 = 6 integers) has to be transmitted if assigned to a resource of type l. Transition communicate north-south requires % = f(h![8]i + h?[8]i; l)g ( 48 123 = 8 integers). The computation transition (compute) requires % = f(24h+i; p)g for 62 123 2 = 24 cumulated products if assigned to a resource of type p. Performance prediction based on the speci cation in gure 9 (b) revealed (for the example) an execution time of 872:017sec using 6 processors, which represents a speedup of 2.869 over a mapping that uses only one processor. The table in gure 10 summarizes performance predictions of variations of the size of submatrices, the iteration factor s and the topological arrangement of processors - all derived with one and the same speci cation ( gure 9 (b)). We discover that one should better choose s = 8 for a 2 3 processor topology, and note that one cannot gain higher speedup when topologically rearranging the processor array. So before having started with coding ( gure 1), we can be sure of having found the optimum decomposition of the problem and the corresponding mapping.

Analytical Veri cation The predicted execution times closely relate to the measurement of the

fully implemented application [19] and are easily veri ed analytically. Let the ring time of a northsouth synchronization transition be n s and w e for a west-east transition respectively. The ring time of a process cell transition performing multiplication of submatrices is denoted by 1. Then 1

Given n1, n2, n3 and a partition lA , lB and s, then

= nN1

n3 M

n2 s

bc = lA lB l bc;

w?e

= b + nN1

n2 s

b

31

= b + lA

l b;

n?s

= b + ns2

n3 M

b

= b + l lB b:

under the assumption of concurrent send and receive operations the execution time in a cell is + 2 max( n s; w e) because of inhomogeneous submatrix transfer times. The ll-time of the rst (leftmost) n_s pipeline (i.e. the time the rst submatrix of A uses to reach the last cell in the array) with N stages (cells) is n s n s w e tfill ; ) + N ( + n s); N (1) = max( as computation in the rst cell can start at time max( n s; w e) at the earliest. The second northto-south pipeline can start computation at max( n s; w e) + ( + w e), and the ll-time of pipeline i is hence tnfillsN (i) = max( n s; w e) + (i ? 1) ( + w e) + N ( + n s): Analogously the ll-time for the i-th M -stage w_e pipeline is twfilleM (i) = max( n s; w e) + (i ? 1) ( + n s) + M ( + w e). After lling both groups of pipelines, submatrices are released in time intervals of w e n s w e ; ): tnrelsN (i) = trel M (i) = + 2 max( As s submatrices have to be propagated through every pipeline, the execution time is n s n s n s texec N (i; s) = tfillN (i) + (s ? 1) trelN (i) =

max( n s; w e) + (i ? 1) ( + w e) + N ( + n s) + (s ? 1) ( + 2 max( n s; w e)) and

w e twexece M (i; s) = twfilleM (i) + (s ? 1) trel M (i) =

max( n s; w e) + (i ? 1) ( + n s) + M ( + w e) + (s ? 1) ( + 2 max( n s; w e)) respectively. The execution time TN M of a systolic computation on a N M grid with a data stream length of s is therefore w e TN M (s) = max(tnexecs N (M; s); texec M (N; s)):

The overall execution time T23(4) (in our example) is because of = 72:024sec and max( n s; w e) = max(34:895sec; 26:897sec) and hence T23(4) = max(34:895+(3 ? 1) 98:921+2 106:919+(4 ? 1) 141:814; 34:895+(2 ?1) 106:919+3 98:921+(4 ?1) 141:814) = max(872:017; 864:019) = 872:017sec; which is exactly the predicted execution time.

32

6 Conclusion This paper dealt with performance oriented development of parallel programs, proposing a development cycle that launches performance engineering activities (performance prediction) in the initial phase, namely the design of the application at the algorithm structure level. A set of `easy to use' Petri net process templates along with a modular hierarchy concept was proposed to serve as a functional and temporal speci cation formalism for parallel program development. The graphical expressiveness of Petri nets has been exploited to support semigraphical speci ation at the highest level of abstraction, with a consistent interface to functional speci cation at a lower level of abstraction (in terms of high level programming languages). Applying the method for generating rapid prototypes of general purpose program templates revealed the close relationship beween the programmers mental model on the structure of the underlying parallel algorithm and its denotation in terms of the speci cation formalism. Moreover, the speci cation obtained is at the same time a Petri net performance model which can be used for performance prediction after appropriate parametrization with hardware characterics, resource requirements at the software level and process to processor mapping information. Software tools are composed to form the CAPSE (Computer Aided Parallel Software Engineering) environment, whose promise mainly lies in performance prediction in the early phases, rather than in performance measurement of fully implemented applications. Recognizing that the driving forces for parallel processing technology originate in almost any case from computer architects and hardware designers, leaving the diculty of developing applications for such machines to the software engineer, we feel a chance to escape from the `prarallel software' crises by using a toolset of the proposed kind.

References [1] Ajmone Marsan, M. , Conte, G. , and Balbo, G. , A Class of Generalized Stochastic Petri Nets for the Performance Evaluation of Multiprocessor Systems, ACM Trans. Comput. Syst., 2(2):93 { 122, May 1984. [2] Allen, J. R. and Kennedy, K. , PFC: A Program to Convert Fortran to Parallel Form. In Proc. of the IBM on Parallel Computers and Scienti c Computation, Rome, 1982.

33

[3] Andrews, G. R. and Schneider, F. B. , Concepts and Notations for Concurrent Programming, ACM Comput. Surv., 15(1):3{43, Mar. 1983. [4] Bailey, D. A. , Cuny, J. E. , and Loomis, C. P. , ParaGraph: Graph Editor Support for Parallel Programming Environments, International Journal of Parallel Programming, 19(2):75{110, July 1990. [5] Bal, J. G. , H. E. amd Steiner and Tanenbaum, A. S. , Programming Languages for Distributed Computing Systems, ACM Computing Surveys, 21(3):261{322, Sept. 1989. [6] Balbo, G. , Bruell, S. C. , Chen, P. , and Chiola, G. , An Example of Modelling and Evaluation of a Concurrent Program using Colored Stochastic Petri Nets: Lamport's Fast Mutual Exclusion Algorithm, IEEE Transactions on Parallel and Distributed Systems, (to appear.). [7] Browne, J. C. , Azam, M. , and Sobek, S. , CODE: A Uni ed Approach to Parallel Programming, IEEE Software, 6(4):10 { 18, Dec. 1989. [8] Carriero, N. and Gelernter, D. , How to Write Parallel Programs: A Guide to the Perplexed, ACM Computing Surveys, 21(3):323{358, Sept. 1989. [9] Chandy, K. M. and Misra, J. , Parallel Program Design. A Foundation. Addison-Wesley Publ. Comp., Reading, Massachusetts, 1988. [10] Chimento, P. F. and Trivedi, K. S. , The Performance of Block Structured Programs on Processors Subject to Failure and Repair. In Gelenbe, E. , (Ed.), High Performance Computer Systems, pp. 269{280, North-Holland, Amsterdam, 1988. [11] Chiola, G. , GreatSPN Users Manual. Version 1.3, September 1987. Technical report, Dipartimento di Informatica, corso Svizzera 185, 10149 Torino, Italy, 1987. [12] Chiola, G. , GreatSPN1.5 Software Architecture. In Proc. of the 5th Int. Conf. on Modelling Techniques and Tools for Computer Performance Evaluation. Torino, Italy, Feb 13-15, 1991` (to appear), pp. 117 {132, 1991. [13] Chiola, G. , Bruno, G. , and Demaria, T. , Introducing a Color Formalism into Generalized Stochastic Petri Nets. In Proc. of the 9th European Workshop on Application and Theory of Petri Nets, June 22 - 24, 1988, Venice, Italy. (to appear), pp. 202 { 215, 1988. 34

[14] Denning, P. , Technology or Management?, Communications of the ACM, 34(3):11{12, Mar. 1991. [15] Dugan, J. B. , Trivedi, K. S. , Geist, R. M. , and Nicola, V. F. , Extended Stochastic Petri Nets: Applications and Analysis. In Proc. of the 10th Int. Symp. on Computer Performance (Performance 84), Paris, France, Dec 19-21, 1984, pp. 507 { 519, 1984. [16] Duncan, R. , A Survey of Parallel Computer Architectures, IEEE Computer, 23(2):5{16, Feb. 1990. [17] Ferrari, D. , Computer Systems Performance Evaluation. Prentice-Hall, Englewood Clis, New Jersey, 1978. [18] Ferscha, A. , Modellierung und Leistungsanalyse Paralleler Systeme mit dem PRM-Netz Modell. PhD thesis, University of Vienna, Institute of Statistics and Computer Science, May 1990. [19] Ferscha, A. , Modelling Mappings of Parallel Computations onto Parallel Architectures with the PRM-Net Model. In Girault, C. and Cosnard, M. , (Eds.), Proc. of the IFIP WG 10.3 Working Conf. on Decentralized Systems, pp. 349 { 362. North Holland, 1990. [20] Ferscha, A. , PRM-Net Modules for Parallel Programming Paradigms. Technical Report ACPC/TR 91-9 February 1991, Austrian Center for Parallel Computation, Technical University of Vienna, Austria, 1991. [21] Ferscha, A. and Haring, G. , On Performance Oriented Environments for the Development of Parallel Programs, Kybernetika a Informatika, Proceedings of the 15th Symposium on Cyberne , 4(1/2), 1991. tics and Informatics '91, April 3-5 1991, Smolenice Castle, CSFR [22] Ferscha, A. and Kotsis, G. , Optimum Interconnection Topolgies for the Compute-AggregateBroadcast Operation on a Transputer Network. In Proceedings of the TRANSPUTER '92 Conference, Amsterdam, 1992. IOS Press (to appear). [23] Gelenbe, E. , Montagne, E. , and Suros, R. , A Performance Model of Block Structured Parallel Programs. In Cosnard, M. , Quinton, P. , Robert, Y. , and Tchuente, M. , (Eds.), Parallel Algorithms and Architectures, pp. 127{138, North-Holland, Amsterdam, 1986.

35

[24] Genrich, H. J. , Predicate/Transition Nets. In Brauer, W. , Reisig, W. , and Rozenberg, G. , (Eds.), Petri Nets: Central Models and Their Properties. Advances in Petri Nets 1986. LNCS Vol. 254, pp. 207 { 247. Springer Verlag, 1987. [25] Guarna, V. A. , Gannon, A. , Jablonowsky, D. , Malony, A. D. , and Gaur, Y. , Faust: An Integrated Environment for Parallel Programming, IEEE Software, 6(4):20 {27, Dec. 1989. [26] Harrison, W. , Tools for Multiple CPU Environments, IEEE Software, 7(3):45{51, May 1990. [27] Heath, M. T. and Etheridge, J. A. , Visualizing Performance of Parallel Programs. Technical Report ORNL=TM-11813, Oak Ridge National Laboratory, May 1991. [28] Hoare, C. A. R. , Communicating Sequential Processes, Commun. ACM, 21(8), Aug. 1978. [29] Kennedy, K. , Compiling for Parallel Computers. In Strategic Directions in Computing Research, pp. 15{17. ACM Press, 1989. [30] Kung, H. , Why Systolic Architectures?, Computer, 15(1):37{46, 1982. [31] Lewis, T. G. and Rudd, W. G. , Architecture of the Parallel Programming Support Environment. Technical Report 90-80-2, Oregon State University Computer Science Department, 1990. [32] Malony, A. D. , Hammerslag, D. H. , and Jablonowski, D. J. , Traceview: A Trace Visualization Toool, IEEE Software, 8(5):19{28, September 1991. [33] Meta Software, , Design/CPN. A Tool Package Supporting the Use of Colored Petri Nets. Technical report, Meta Software Corporation, Cambridge, MA, USA, 1991. [34] Miller, B. P. , Clark, M. , Hollingsworth, J. , Kierstead, S. , Lim, S.-S. , and Torzejewski, T. , IPS-2: The Second Generation of a Parallel Program Measurement System, IEEE Transactions on Parallel and Distributed Systems, 1(2):206{217, April 1990. [35] Milner, R. , A Calculus of Communicating Sequential Processes, volume 92 of Lecture Notes in Computer Science. Springer Verlag, New York, 1980. [36] Molloy, M. K. , Performance Analysis Using Stochastic Petri Nets, IEEE Trans. Comput., C31(9):913 { 917, Sept. 1982. 36

[37] Nelson, P. A. and Snyder, P. , Programming Paradigms for Nonshared Memory Parallel Computers. In Jamieson, L. H. , Gannon, D. B. , and Douglass, R. J. , (Eds.), The Characteristics of Parallel Algorithms, pp. 3 { 20. MIT Press, 1987. [38] Nelson, R. , Towsley, D. , and Tantawi, A. N. , Performance Analysis of Parallel Processing Systems, IEEE Transactions on Software Engineering, 14(4):532{539, April 1988. [39] Nichols, K. , Performance Tools, IEEE Software, 7(3):21{30, May 1990. [40] Perrott, R. H. , Parallel Programming. Addison-Wesley, 1987. [41] Po, W. ping , Lewis, T. , and Thakkar, S. , System Performance Displayer: A performance Monitoring Tool. Technical Report 91-30-6, Oregon State University Computer Science Department, 1991. [42] Reisig, W. , Place/Transition Systems. In Brauer, W. , Reisig, W. , and Rozenberg, G. , (Eds.), Petri Nets: Central Models and Their Properties. Advances in Petri Nets 1986. LNCS Vol. 254, pp. 117 { 141. Springer Verlag, 1987. [43] Sifakis, J. , Use of Petri Nets for Performance Evaluation, pp. 75 { 93. North-Holland, 1977. [44] Treleaven, P. E. , Parallel Computers. Object-Oriented, Functional, Logic. Series in Parallel Computing, John Wiley & Sons, 1990. [45] Vernon, M. K. and Holliday, M. A. , A Generalized Timed Petri Net Model for Performance Analysis. In Proc. Int. Workshop on Timed Petri Nets, pp. 181 { 190. IEEE Comp. Soc. Press, July 1985. [46] Zima, H. P. and Chapman, B. , Supercompilers for Parallel and Vector Computers. ACM-Press Frontier Series, Addison-Wesley, 1990.

37