Metrics and Techniques for Automatic Partitioning and Assignment of ...

Metrics and Techniques for Automatic Partitioning and Assignment of Object-based Concurrent Programs Lonnie R. Welch y Binoy Ravindran Jorge Henriques Dieter K. Hammer z ,

,

March 23, 1995

Abstract

The software crisis is de ned as the inability to meet the demands for new software systems, due to the slow rate at which systems can be developed. To address the crisis, object-based design techniques and domain models have been developed. Furthermore, languages such as Modula-2, Ada, Smalltalk and C++ have been developed to enable the software realization of object-based designs. However, object-based design and implementation techniques do not address an additional problem that plagues systems engineers | the eective utilization of distributed and parallel hardware platforms. This problem is partly addressed by program partitioning languages that allow an engineer to specify how software components should be partitioned and assigned to the nodes of concurrent computers. However, very little has been done to automate the task of con guration, that is, the tasks of partitioning and assignment. Thus, this paper describes automated techniques distributed/parallel con guration of object-based applications, and demonstrates the technique on programs written in Ada. The granularity of partitioning is the program unit, including software components such as objects, classes, tasks, packages (including generics) and subprograms. The partitioning is performed by constructing a call-rendezvous graph (CRG) for the application program. The nodes of the graph represent the program units and the edges denote call and task interaction/rendezvous relationships. The CRG is augmented with edge weights depicting inter-program-unit communication relationships and concurrency relationships, resulting in a weighted CRG (WCRG). The partitioning algorithm repeatedly cuts edges of the WCRG with the goal of producing a set of partitions among which (1) there is a small amount of communication and (2) there is a large degree of potential for concurrent execution. Following the partitioning of the WCRG into tightly coupled clusters, a random neural network is employed to assign clusters to physical processors. Additionally, a graphical interface is provided to allow viewing and modi cation of the software-hardware con guration. This two-pass approach to con guration is useful in large systems, where it is necessary to rst reduce the complexity of the problem before applying an accurate assignment optimization technique (such as neural networks, genetic algorithms or simulated annealing). Although the tools described in this paper partition and assign Ada programs, they are easily extended to work with other languages by simply changing the front-end parser. Thus, a general solution to the con guration problem for object-based concurrent programs is provided. This work is supported in part by The U.S. NSWC (N60921-93-M-1912 and N60921-94-M-G096) and by the U.S. ONR (N00014-92-J-1367). y Welch, Ravindran and Henriques are with The Department of Computer and Information Science, New Jersey Institute of Technology, Newark, NJ 07102, e-mail: [email protected], phone: 201-596-5683. z Hammer is with The Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands, e-mail: [email protected], phone: +31-40-472734.

1

1 Introduction The software engineering techniques of reuse, encapsulation and information hiding are increasingly being employed in the construction of large, mission-critical software. Thus, systems composed of abstract data type (ADT) hierarchies and tasks are being produced. Simultaneously, the hardware platforms continue to increase in sophistication, particularly in the area of concurrency. Thus, we are confronted with the problem of bridging the gap between (1) large computer-based software systems composed of tasks and (2) ADTs/objects and parallel and distributed computer platforms. As an example of the type of software system under consideration, consider the track processing system, common in air trac control and defense systems, which can be implemented with ADTs as represented in Figure 1. The gure shows the ADT instances used in the implementation, as well as call relationships among the instances. The track le is implemented as a list of tracks. A track is an ADT implemented as a queue of the last n snapshots of the track's state. The research described in this paper addresses the mapping of layered software systems onto parallel and distributed hardware platforms. We are currently using a 64-node Intel Paragon processor as a parallel computing testbed, and a collection of DEC workstations as a distributed platform. Given a software system, software components are partitioned/clustered according to some binding relationships (such as communication, concurrency or shared data access), and the clusters are assigned to processors in a way that causes ecient utilization of hardware resources and simultaneously obeys system constraints [20, 22, 19, 17]. For example, the ADTs implementing the software for the AEGIS cruiser with doctrine regions could be assigned to processors as shown in Figure 2. The magnitude of the software systems under consideration as well as the complexity of the objectives and constraints make partitioning and assignment too dicult for the average human. Furthermore, the software systems are too large for the eective use of a single pass, accurate assignment algorithm. A partial solution to this problem is to de ne languages for describing how software components should be partitioned and assigned Previous work [18, 7] shows the need for an algorithm to assign ADT modules to processors, but development of such an algorithm was not the focus of the work. In [18], items identi ed as distributable (assignable) Ada packages and tasks, which are a subset of the items considered here. APPL [7] is a language for specifying mappings of Ada programs to concurrent architectures. This permits programs to be implemented without taking hardware con gurations into account|the same philosophy used here. A successor to APPL is the Distributed Application Development System (DADS) [10], which not only provides a distribution speci cation language, but also provides a linker, code generator and run-time system. These eorts addressed important problems in exploiting concurrent computer platforms, but none addressed actual partitioning and assignment. Although solutions have been published for partitioning and assignment, the approach outlined here diers from previous approaches in the following ways. The majority of the previously reported assignment algorithms, of which [13, 8, 1, 26, 2] are representative, consider the unit of distribution to be a procedure, a task, or a process, in contrast to the assignment algorithm presented here, which also considers units of distribution to be generic abstract data type (ADT) modules. Furthermore, the metrics which drive our con guration are rigorously de ned and address concurrency two features not seen in conjunction in previous work. The con guration technique described in this paper is depicted in Figure 3. The rst step of our approach is to extract concurrency and communication metrics. Following this, the software components are partitioned, or assigned to logical processors. Partitioning is done using coarse, fast techniques for (1) computing metrics and (2) optimization. Partitioning is followed by the assignment of partitions to processors (i.e., logical processors are mapped to physical processors). Assignment uses more accurate and more costly metrics and optimization heuristics. Finally, the assignment and partitioning speci cations are combined with the application and executed in a concurrent manner. The rest of the paper describes our approach to con guration. First, our analysis process is explained, showing the abstract representation of systems that our parsing tools produce, and de ning techniques to compute communication and concurrency metrics from the abstract representation. The description 2

TRACK FILE

LIST

ARRAY

TRACK

INTEGER

QUEUE

ARRAY

INTEGER

RECORD

Figure 1: A design of doctrine processing software (ADT paradigm).

3

PE2

PE1

T1 TRACK FILE

LIST

T1

PE3 TRACK

..

T2

QUEUE

PE6

WCS

C&D

Figure 2: A possible assignment of ADT modules to processors.

4

.. ..

T1

PE4

T2

PE5

T2

.. ..

program

Analysis

Metrics + IR

Partitioning

Partitions/ DADS spec Generate Metrics II

Generate Assignment Assignment/ DADS spec Link, Load, Execute

Figure 3: The con guration approach.

5

of the analysis process is followed by a discussion of the partitioning tool|the algorithm employed, an example and the graphical interface for manually tuning the partitions. A second analysis phase occurs after partitioning, to compute concurrency and communication metrics for partitions. The computation of these metrics is necessary before the assignment of partitions to processors, since the assignment algorithm uses the metrics to assess the suitability of the assignment with respect to the objectives. Finally, a random neural network solution to the partition assignment problem is presented.

2 Analysis: Extraction of IR and Computation of Metrics In this section the analysis techniques and products that enable partitioning and assignment are described. The rst part of the section discusses the intermediate representation of programs that is extracted by compiler tools. The remainder of the section describes how communication and concurrency metrics are produced from the intermediate representation.

2.1 The Intermediate Representation This section de nes a language-independent intermediate representation (IR) for capturing computerbased systems' software features that are essential for the partitioning and assignment processes. Systems such as AEGIS and the HiPer-D vision of the 21st Century shipboard computing system [12] contain software having the structure depicted in Figure 4. We term this structure the mission critical software architecture (MCSA) [23], in order to contrast it with the NSF's grand challenge software architecture. The MCSA depicts software systems that are composed of several layers (or tiers). Elements at tier i are implemented in terms of elements at tier i + 1. Tier 1 consists of a set P of M executable programs or partitions: fP1 ; P2; : : :; PM g. It is sometimes the case that the programs are implemented in dierent languages. Partitions are collections of program units like tasks, packages, classes, methods and objects. The potential concurrency among partitions Pi and Pj is given by the term Cij . At tier 2 are tasks (independent threads of control), which may share resources, and are permitted to run concurrently. The task tier is represented by the task rendezvous graph, a directed graph, TRG = (V; E), wherein: a vertex v 2 V denotes a task object, f(v), and an edge (x; y) 2 E indicates that the code of task object f(x) initiates a rendezvous with an entry provided by task object f(y). Each task object T may be periodically executed, in which case PR(T ) denotes the period of the task. Alternatively, a task may be asynchronous, that is, it may be activated by one or more events E(T ) = feT;1; eT;2; :::; eT;g: Tier 3 is composed of modules with multiple entry points (as in CMS-2), ADT packages (as in Ada, Modula, and Clu) and object classes (as in C++, Smalltalk and Eiel). The employment of tier 3 components provides conceptual clarity, enables tier 2 tasks to be implemented \on top of" abstractions exported by modules, and promotes reuse. At the package instance level, a directed graph is used to show call relationships among instances. A program is modeled by a directed graph, CGRAPHP = (V; E), where: a vertex v 2 V denotes package instance, f(v), and an edge (x; y) 2 E indicates that the code of instance f(x) calls some subprogram(s) provided by instance f(y). The elements of tier 3 are implemented in terms of subprograms (or methods)|the tier 4 elements. At the granularity of the subprogram, a directed graph, CGRAPHS = (V; E), is used to represent the call relationships by letting each vertex m 2 V denote a subprogram f(m), and each edge (m; n) 2 E indicate that the code of subprogram f(m) calls subprogram f(n). The research project described in this paper has produced software tools to extract the IR from Ada applications. With the Ada paradigm, it is possible for a subprogram to initiate a rendezvous with tasks or to call subprograms exported by packages, in addition to calling other subprograms. Likewise, in addition to rendezvousing with other tasks, Ada tasks may call subprograms. Similarly, packages may contain calls to subprograms and rendezvouses with task entries. Such interactions must be considered 6

Program C++

Program Ada

Task

Program etc.

Task

Package

Method

Task

Package

Method

Package

Method

Program Instructions

Figure 4: Mission critical software architecture.

7

CALL RENDEZVOUS GRAPH (Test Case 4)

1

2

5

6 3

7

4

Figure 5: A call-rendezvous graph (CRG).

8

during the system con guration process. Thus, we de ne the call-rendezvous graph (CRG) by combining items from tiers 2, 3 and 4. The CRG combines the nodes and vertices of TRG, CGRAPHP , and CGRAPHS , and inserts directed edges representing calls from tasks to subprograms and packages, and indicating rendezvous initiations from subprograms and packages to tasks. A sample CRG is given in Figure 5; in the gure, dashed lines denote rendezvous edges, solid lines denote subprogram calls, boxes represent task objects, and circles represent packages and subprograms. Methods are implemented as a collection of statements or instructions (tier 5 elements). At this level of granularity, several important features are captured during parsing. The symbol table (SymTab) and the statement table (StmtTab) [11] are extracted and used for dependence and ow analyses. Dependence analysis involves processing of the StmtTab to extract graphs that represent statement-level precedence relations due to control dependences, data dependences, and code dependences. Dependence graphs represent program statements as nodes and use directed edges to denote statement ordering implied by the dependences in a source program. Dierent kinds of ordering requirements are represented in dierent dependence graphs. In the data dependence graph (DDG) 1 a directed edge denotes a data dependence (destination and source nodes need the same value). The instance dependence graph (IDG) 2 uses undirected edges to denote instance dependences (which occur when two nodes use operations exported by the same instance [21]). The subprogram dependence graph (SDG) uses an undirected edge to denote when two statements use the same subprogram. A directed edge in the control dependence graph (CDG) denotes that execution of the destination statement depends on the a decision made by the source statement. In addition to the dependence graphs, the control ow graph (CFG) is extracted at the statement level, indicating the sequential ow of control dictated by the order of the statements in the source code.

2.2 Metrics for Communication and Concurrency To enable optimization of partitioning and assignment, this section de nes metrics for communication and concurrency among program units.

2.2.1 Metrics for Communication The communication that takes place between program units is measured in terms of the amount of data that is passed in calls (or rendezvouses) and in terms of the frequencies of calls (or rendezvouses). We de ne CFAB to be the frequency of calls from A to B. For the calculation of the concurrency matrix in section 4 we also need the call frequency CFkl of a particular operation Okl of package PAk . These parameters are approximated with the compile-time techniques shown in Figure 6. The call rendezvous graph (CRG) is constructed and weights are placed on the edges to indicate the amount of data exchanged among program units, yielding a weighted CRG (WCRG) (see Figure 7). Initially, the edge weights are set to zero. To statement table is examined to determine when a call (or rendezvous) occurs. For each call (or rendezvous), the receiver (callee or rendezvous acceptor) is identi ed as well as the sizes and modes (IN, OUT, or IN and OUT) of the actual parameters. This information is used to update the appropriate edge in the WCRG. Following the identi cation of all calls and the determination of the amount of information passed in each call, the call frequencies are propagated top-down in the WCRG and the communication weights are scaled accordingly. Further discussion of communication metrics is contained in Section 4.

2.2.2 Metrics for Concurrency For the partitioning approach described in this paper it is necessary to compute metrics indicating the amount of potential concurrency among subprograms, ADT instances, and tasks. For the metrics 1 2

The DDG and the CDG are used for identifying concurrency inherent in software. The IDG and the SDG are used for analysis of slowdown due to contention for the code of instances and subprograms.

9

1.5.4 Compute Communication Metrics SymTab SymTab 1.5.4.4

1.5.4.5

Identify parameters

Determine parameter sizes

params sender, receiver

psizes

sender, params

SymTab

1.5.4.6 Determine parameter modes

pmodes

receiver

1.5.4.2 Detect call or rendezvous

StmtTab

sender

Comm. Metric sender, receiver, pmodes, psizes

1.5.4.3

sender

Identify receiver

receiver

1.5.4.7

StmtTab

Update WCRG weight 1.5.4.1 Initialize Weighted CRG

comm 1.5.4.8 Propagate call frequencies

Figure 6: The process for extracting communication metrics.

10

Metrics for Legacy System

1

2 2,2

2,2

3,2

2,2

4,2

1,2

2,2

3,2

2,2

2,2

2,2 2,2

2,2 1,2

3,2

2,2

1,2

3,2

1,2

1,2

2,2

2,2

5,2 4,2

2,2

5 1,2

4,2

3,2 3,2 2,2

2,2

6 3

7

2,2 2,2

4

2,2

1,2

2,2 2,2 1,2

4,2

3,2

2,2 1,2

1,2

2,2

1,2 2,2

Figure 7: A weighted call-rendezvous graph (WCRG).

11

3,2

computation, we assume that tasks are the basic unit of concurrency, and de ne metrics that build on previous work wherein we have developed techniques to compute the following concurrency metrics:

Inherently parallel percentage [25] of methods, ADT instances, and activities. Concurrency dependences [20] (e.g., A and B can run concurrently i B and C can run concurrently). Maximum number of replicas of methods and class/package instances that can be used concurrently [27, 21]. The set of potentially concurrent entities, at the levels of statements, methods, ADT instances, activities, and beads [20, 27, 25, 16].

Our approach to obtaining concurrency metrics is to exploit semantic knowledge that is implicit in the weighted call rendezvous graph (WCRG). To de ne the metrics, it is necessary to de ne the following terms:

M = fM1 ; M2; :::; Mg is the set of modules (subprograms and packages/ADT-instances) in an

application. T = fT1; T2; :::; T g is the set of tasks (activities) in an application. M(Ti ) = fMi;1; Mi;2; :::; Mi;(Ti)g is the set of modules exclusively by task Ti . This also includes modules used indirectly or transitively. Note that M(ti ) M. Q(Ti) = fQi;1; Qi;2; :::; Qi; (Ti)g is the set of all moduless used by task Ti. This also includes modules used indirectly or transitively. Note that P(ti ) Q(ti ) M. U(Mi) = the set of tasks that (directly or indirectly) use (call) module Mi . (x; y) = the percentage of concurrency among program units x and y. Note that either x or y may be a module (package or subprogram) or a task. (x; y) is a number between zero and one, where 0 denotes no concurrency and 1 full concurrency.

With these de nitions, the concurrency metric (x; y) is formally de ned as: 1. 2. 3. 4. 5. 6.

8i 8x;y (Mi;x ; Mi;y ) = 0 8i 8x (Ti ; Mi;x) = 0 8i;j : i6=j 8x;y (Mi;x ; Mj;y ) = 1 8i;j : i6=j 8x (ti ; Pj;x) = 1 8j 8i: :(ti 2U (Rj )) (Mi ; Mj ) = 1 8j 8i: ti 2U (Rj ) 0 < (Ti ; Mj ) < 1

In order to approximate (x; y) for the last case, we use the call frequencies of the WCRG. If several tasks of set U(Mi) call module Mi concurrently, the amount of concurrency for a particular task is proportional to its call frequency CF (Ti ; Mj ), i.e. a task has a higher probability to proceed concurrently to a common module the less frequently it can be blocked in calling this module. The amount of concurrency can thus be evaluated as follows: i ;Mj ) (Ti ; Mj ) = P CF (TCF (Tk ;Mj ) , with U (Mj ))

Q (

CF(Tk ; Mj ) = (Rl incallchainTk !Mj ) CF(Tk ; Ml ) 12

3 Partitioning: Mapping onto Logical Processors The general partitioning problem addressed herein is to divide the tasks, packages and subprograms into groups, with the objective of maximizing concurrency and minimizing communication among groups. The reason for partitioning at this level of granularity is due to the magnitude of the applications for which the tool is used. A typical example is the AEGIS Weapon System [12], which consists of about 8 million lines of code. In such systems, the exploitation of concurrency at the statement level or at the loop iteration level is not practical in general, although it may occasionally be useful in isolated situations. Given the intermediate representation and the concurrency metrics, the program units of an Ada program are distributed automatically, according to the technique shown in in Figure 8. The generation of the distribution speci cation is accomplished by considering concurrency and communication relationships among components, and clustering the components in a way such that low communication cost results, while also achieving a high amount of potential concurrency. The distribution speci cation is produced in an abstract, language-independent form. The tool also contains a lter that maps from the language-independent form into the Rational/Veridx DADS distribution speci cation language [10]. The partitioning algorithm distributes the program units among an in nite number of logical processors. This is performed by repeatedly cutting edges in the WCRG, until a collection of disconnected components exist. Each component represents a logical processor, or a partition. The edges are cut in the WCRG by considering concurrency and communication costs. The following three rules are applied during partitioning: 1. when (x; y) = 1, x and y are placed into dierent partitions; that is, partit(x) 6= partit(y) 2. when (x; y) = 0, x and y are placed into the same partition; that is, partit(x) = partit(y) 3. when 0 < (x; y) < 1, the communication metrics are used to determine whether x and y are placed in dierent partitions. The partitioning rules are observed by the partitioning algorithm shown in Figure 3. Initially, the algorithm removes the rendezvous edges from the WCRG, resulting in groups that, except for rendezvous, do not communicate with each other. For example, the graph shown in Figure 10(a) was produced after the rst step of partitioning. Each of the groups resulting from the initial partitioning step contains at least one task, and thus the groups can all run concurrently. 3 The rst step of partitioning accounts for all cases where partitioning rules 1 and 2 apply. Partitioning rule 1 applies to two tasks, or to a package/subprogram used exclusively by task ti and a task tj (or to a package/subprogram used exclusively by tj ); in both of these cases concurrency is 1. Partitioning rule 2 applies when two program units have no concurrency; that is, the rule applies to a task and a package/subprogram that it uses exclusively, or to two packages/subprograms used exclusively by the same task. Although the initial partitions can all potentially run concurrently, it may be the case that some potential concurrency is masked due to cases where multiple tasks reside in the same partition. This occurs when tasks share one or more packages or subprograms, and thus call edges link the tasks; such cases call for the application of partitioning rule 3. In fact, this is the case for tasks 1, 2, 3 and 4 in Figure 10(a). To increase the potential concurrency, the partitioning algorithm continues to selectively cut the edges of the WCRG until each partition contains a single task. To accomplish this, each group is searched for the presence of multiple tasks. Groups having more than a single task are further partitioned by arbitrarily selecting two tasks in the group, nding each path of call edges between the two tasks, and breaking each of the paths at the link that has the smallest communication weight (as determined by the communication metrics de ned in Section 2.2.1). Actually, it was observed in application of the partitioning algorithm that occasionally, partitions with a single task were produced. This is sometimes undesirable for load balancing. To get a balanced distribution of program units among partitions, we also use a parameter which de nes a window in the center of each path within which cuts may be made; 3 Note that since tasks are non-callable units, they appear as nodes without parents in the WCRG after removal of the rendezvous edges.

13

Ada Appln. *.a

Parse & Extract

IR

Conc. & Comm Metrics

Metrics

Partitioning

Distrbution Spec. Library DSL

Display

Assignment

Figure 8: Design of the partitioning tool.

14

GenerateClusters(WCRG, GROUPS, begin

num groups)

num groups := remove rendezvous edges(WCRG,GROUPS); for i = 1 to num groups do if no Tasks(GROUPS(i)) 1 then for each j 2 GROUPS(i) do if (j.type = TASK ) then for each k 2 GROUPS(i), k 6= j do if (k.type = TASK ) then E = Extract Call Chain(WCRG,j,k); /* E - set of edges in call chain */ Break Chain Groups(E,x,y); /* new groups x & y */ GROUPS(i) = GROUPS(i) , fxg; num groups++; GROUPS(num groups) = fxg;

end if end for end if end for end if end for end GenerateClusters

Figure 9: Partitioning Algorithm outside of that window (i.e., close to the ends of the path) no cuts are permitted. Thus, the partitioning algorithm always chooses the cheapest edge within the limits speci ed by the parameter. Figure 10(b) shows how task 3 and some associated packages are placed into a partition separate from tasks 1, 2, and 4, by cutting a call edge in the WCRG. The process is repeated until there exists no group with more than a single task. One more repetition produces Figure 10(c). The nal partitioned WCRG is shown in Figure 10(d). To allow graphical viewing of partitions and to permit manual intervention in the partitioning process, a graphical user interface (GUI) has been developed. As shown in Figure 11, partitions are represented in the GUI as large circles, while program units are depicted as small circles drawn inside the large circles. Pointing to a small circle and clicking displays the name of the program unit and indicates whether it is a task, package or procedure. The partitioning may be manually modi ed by pointing to a program unit, holding down on the mouse button and dragging the unit into another partition. Clicking on the save button will save modi cations in a le. The partitioning tool produces a platform-independent and a distribution-language-independent description of the partitions. Essentially, it is a list of partitions, each containing a list of program units. However, for practicality, a lter has also been implemented to map the list of lists into a distribution speci cation in the DADS language. The DADS speci cation, along with an Ada program, the DADS linker and run-time system, and a concurrent computer are sucient for execution. The partitioning tool thus generates the group speci cation required by DADS. Each group consists of program units and corresponds to a logical processor. Such a speci cation is sucient for execution, and would result in each group being assigned to one physical processor, if possible. The speci cation grammar for a group is: group_specification::= GROUP identifier IS identifier_or_asterisk; {identifier_or_asterisk;} END GROUP;

The group is given an identi er name and each element of the group is referenced by its name de ned in the application program. An asterisk may appear in at most one group; it symbolizes a wildcard 15

1

2 1

2

5 5

6 6

3 7

3 7 4

4

(a)

(b)

1

2

1

5

2

5

6

6 3

7

7

3

4

4

(d)

(c)

Figure 10: Steps of partitioning the WCRG.

16

Figure 11: The partitioning tool's graphical user interface.

17

GROUP 1 is task .At4.T4; package At4; package Atwo4; package Atfr1; package Atfr2; package Atfr3; package Atfr4; package Atwo5; package Att4; package Atwo3; *; END GROUP;

GROUP 3 is

GROUP 4 is

task .At3.T3;

task .At1.T1;

package Ath1;

package A1;

package At2;

package Att1;

package At1;

package Atwo1; task .At2.T2;

package At3;

package Aone1;

package A9; package A10;

package Ath2;

package Aone2;

package A8;

package Atwo2;

package Ath3;

package Aone4;

package A7;

package Ath4;

package Aone5;

package A3;

package Att2;

package A5;

package A4;

package Att3;

package A6;

GROUP 2 is

END GROUP;

END GROUP;

GROUP 5 is package At6;

GROUP 6 is

package Aone3;

package A2; END GROUP;

GROUP 7 is task .At5.T5;

package Asix1;

task .At7.T7; package Aseven1;

task .At6.T6;

package Aseven2;

package Afive3;

package Asix2;

package Aseven3;

package Afive4;

package Asix3;

package Aseven4;

procedure At5;

package Asix4;

procedure At7;

END GROUP;

END GROUP;

package Afive1;

package Afive2; END GROUP;

Figure 12: The DADS partitioning speci cation. which means that unspeci ed program components are to be placed in that group. The (partial) DADS distribution speci cation corresponding to Figure 10(d) is given in Figure 12. Each DADS group corresponds to a partition or to a logical processor. In addition to groups, DADS also provides the station construct. Each station may contain program units and DADS groups, and each station corresponds to a physical processor. The following two sections consider assignment of logical processors to physical processors, i.e., the assignment of groups to stations.

4 Assessing Concurrency Among Partitions The result of the partitioning step is a set of partitions P = fPij(1 i N g, consisting of one task and a number of associated modules. In order to calculate the optimal assignment of partitions to PE's, the neural network uses a so called concurrency matrix C as input. Each element Cij of this matrix gives 18

the potential concurrency between partition Pi and Pj if they are assigned to dierent PE's, i.e. the concurrency without taking into account communication and synchronization with other partitions. In order to be implementation independent, we assume identical PE's with no runtime overhead for context switches etc. 4 and an ideal congestion-free network. This subsection describes the construction of this concurrency matrix. In order to solve the above described problem, we start from the observation that on each PE, at any time only one thread of control can be active. In order to determine which one this is, we consider the dynamic execution of the system and not its static structure as given by the CRG 5 . Since we do not know beforehand which execution path such a thread of control is following at runtime, we have to consider all possibilities. We call the graph of all possible execution pathes an activity [6]. The nodes of an activity are calls to operations provided by modules and the edges are the precedence relations between these calls 6 . We call the set of all concurrent activities an execution. Usually the dierent activities are not independent but communicate via rendezvous and may synchronize at common resources 7 . The dierence between an execution and a CRG is that the latter essentially describes a structural (i.e. static) relationship, while an execution describes a logical and temporal (i.e. dynamic) relationships. An execution can thus be considered as a CRG that is unrolled in time. Similar to the CRG, it can be easily derived from the information collected by the compiler. Since an object-oriented design establishes a hierarchy of abstractions, executions and activities can also be hierarchically decomposed. For this analysis, we, however, make no use of the latter possibility. In an object-oriented environment, common resources are always encapsulated by modules and thus we only have to consider the rendezvous and the common modules of the CG. Since the Ada runtime environment will prevent concurrent access to shared modules, the only dierence between a rendezvous and a call to a concurrent p is whether the synchronization of activities is mandatory or situational, i.e. dependent on the precise timing. Since our analysis only considers possible execution pathes anyway, we can abstract from this dierence. We arrive thus at an execution graph whose nodes are either tasks or ps and whose adges are precedence relations denoting either sequential execution (within an activity) or synchronization (between dierent activities). At the lowest level of abstraction, the execution graph of any reasonable application is far too complex. One possibility to deal with this problem would be hierarchical (de)composition of executions, i.e. (de)composition along the dynamic axis. For our purpose we, however, make use of the information already gathered in the WCRG, which describes a particular aggregation 8, i.e. a (de)composition of the structure. Figure 10 shows the partitioned WCRG of our sample application. In order to construct the execution graph at the highest level, we proceed as follows: 1. We aggregate tiers of nested module calls in the following way: We approximate the computational weight Wkl of an operation call Okl of module Mk by the number of non-commented source lines NCSLk of the module divided by the number of operations NOk : Wkl = NCSLk =NOk. At the source code level of a high level language like Ada, this is of course a very rough approximation because the executable statements have a large variety in execution time. At the target code (assembler) level on a RISC machine this estimate is very reasonable since all statements take approximately the same time. 4 Alternatively one could average the runtime overhead of the operating system over the execution time of the various statements. 5 We consider the rondezvous and iterations shown in the CRG as static information, since it can be derived from a static analysis of the code. 6 Such an activity is thus a partial logical and a total temporal order. 7 An execution is thus a partial logical and temporal order. 8 Since we consider the CRG and not the class/type hierarchy, we prefer to talk about dierent levels of aggregation instead of abstraction.

19

We de ne the call frequency CFkl as the number a particular operation Okl of module Mk is

invoked by all tasks per superperiod SP as de ned in step 2. We assume the execution time Ekl of operation Okl to be directly proportional to Wkl , i.e. each primitive statement has execution time 1: Ekl = Wkl . Within one PE we assume that the communication overhead is included in the general overhead of the runtime system. If an operation Okl at another PE is invoked, we de ne the communication costs CCkl as the extra time needed to transmit the parameters and the results of size Skl over an ideal network without congestion: CCkl = c1 (Skl + 2c2), where constant c1 represents the ratio between the bandwidth of the network and the bandwidth of the PE's. The constant c2 accounts for the xed overhead of a single message, e.g. if an empty message is sent for synchronization purposes. Note that Skl depends on the size and type of the various parameters: IN, OUT or IN and OUT. We account guarded commands by their average contribution to the total execution The P (h = 1 time. average execution time of a n-fold branch with branching probabilities p , to n)p h h= P P 1, is thus Eh ph = Eh . In a similar way, we unfold all iterations and recursions and take the average number of repetitions into account. We account nested calls of Okl by their weight and call frequency CFkl and P their communication costs CCkl , i.e. we perform a transitive summation: Ekl (level n) = 1+ (calls to level n+1)(Ekl + CCkl)CFkl . In this formula the constant 1 represents the call statement itself.

2. We construct an approximate schedule at the highest level of aggregation, based on the results from step 1:

We assume that all tasks start at the same point in time and that each task gives rise to one

activity. Activity Am is either (1) repetitive with period PAm 9 or (2) asynchronously triggered by an (internal or external) event with frequency FAm . A real application may than be an arbitrary mix of these two possibilities. We thus have to consider a superperiod SP = LCM(PAm jall periodic activities) and to account the sporadic activities with their maximum frequency. The schedule can now be constructed by any suitable static scheduling algorithm like the one described in [14] and [15]. At a high aggregation level such a schedule becomes so simpel that it even can be drawn manually. The units for constructing the schedule, i.e the operation calls, are considered als non-preemptable pieces of code, called beads [6]. Note that the probably large size of these beads is no problem because the schedule comprises only two activities on two PE's. Normal scheduling problems usually involve several activities per PE, which easily results in a non-feasible schedule if the preemption is excluded. 3. From this approximate schedule we deduce the intervals Bij where the two activities are blocked (because of a rendezvous or a synchronization at a call of an operations of a shared module) and cannot execute in parallel on their respective PE's. The termination of an activity before the end of SP is counted P as equivalent to blocking. From this we can easily calculate the concurrency Cij = ((SP , Bij )=SP, with 0 Cij 1. Note that Cij = 0 represents a (dead)lock and Cij = 1 represents the independent execution of the two activities or partitions respectively. 4. A re nement of steps 1 to 3 is possible by (1) either lowering the level of aggregation or (2) by re ning the execution times. The rst possibility is obvious. The second one means to calculate the actual length of the various operations Oik rather than their average weight Wkl , e.g. by means of a pro ler. If the evaluation is done at source code level, the Halstead metrics ([5] and [9]) can be used to account for the complexity of the code. 9

Note that the Ada programming model implies that PAm = PTm as de ned in section 2.1.

20

To demonstrate the method, we consider a sample Ada application consisting of ve tasks, each of them giving rise to one activity. This ve tasks and the modules they use are distributed over ve partitions by the method described above. All activities of this sample application are non-periodic an all parameters are IN and OUT. The weighted CRG of this sample application is shown in gure 7. For the sake of simplicity we calculate the concurrency matrix at the highest level of aggregation of the source code without accounting for the complexity of dierent programming constructs. For the same reason we choose c1 = c2 = 1 and take only the number of parameters into account and not their size. Since we are only interested in the elements Cij of the concurrency matrix, we consider the activities pairwise. The resulting partial schedule for activities 2 and 3 is shown in gure 13. From this schedule we determine the potential concurrency between partition 2 and 3 as C23 = (66=1957)100 = 3; 4%. In a similar way we can determine all other elements of the concurrency matrix which is the input for the allocation algorithm described in the next section.

5 Assignment: Mapping Logical Processors onto Physical Processors This section describes a solution to the problem of mapping partitions (logical processors) onto physical processors, with the objective of maximizing concurrency (i.e., the goal is to assign partitions to dierent physical processing elements (PEs), if the partitions can execute concurrently. Note that this objective can be achieved by placing each partition onto a processor by itself. However, such a solution typically yields low processor utilizations. To avoid this, we include the objective of minimizing the number of processors used to maximize concurrency. It is also desirable to conserve processors on parallel computers which allow the set of processors to be partitioned among dierent users. This paper presents an assignment heuristic that is implemented using Gelenbe's random neural network model (RNN) [3, 4]. The neural network serves as the second phase of a two phase assignment approach that forms communicationintensive clusters that have a high amount of potential concurrency among themselves during phase 1, and that considers concurrency and processor conservation when assigning clusters to physical processors.

5.1 The Assignment Problem To eectively utilize a parallel or distributed computer, the partitions of a program must be distributed intelligently among the processors. We denote the set of partitions to be assigned to processors as FAC = ff1 ; f2; : : :; fF g, and we denote the set of PEs as PES = fp1; p2; : : :; pP g. An assignment is represented as a function, PE : FAC ! PES. An acceptable assignment is one in which no two partitions that can execute concurrently are assigned to the same PE (i.e., all potential inter-module concurrency can be exploited). The relation f==g indicates that partition f and partition g have some potential concurrency, the relation f ? g indicates that f and g have no potential concurrency, and the function C(fg) indicates the amount of potential concurrency among f and g. The property of an acceptable assignment is formally stated as: 8fi ;fj :F ((PE(fi ) = PE(fj ) ^ (i 6= j)) , fi ? fj ). An optimal assignment is an acceptable assignment that minimizes the number of processors required. Acceptable assignments cannot be obtained if a sucient number of PEs is not provided in the parallel computer. In such cases we rede ne what it means to have an optimal assignment. We state that a con ict occurs in an assignment whenever two partitions that can execute in parallel are assigned to the same PE. The cost of a con ict is equal to the amount of potential concurrency among the con icting partitions. An optimal assignment is one in which the number of con icts is minimized. This criterion can often be satis ed with several dierent assignments and with varying numbers of PEs. We further restrict the de nition of an optimal assignment to be the one that minimizes con icts while requiring the fewest PEs possible to achieve that number of con icts. In this section we show how a random neural network model for any instance of this assignment problem can be constructed and solved. 21

Method Call at the highest level of aggregation

PE1

PE3:Partition 3

Activity 3

Accept Rendezvous

Activity 2

PE2:Partition 2 Initiate Rendezvous t=0 t=66

Termination t=581

t=1957

Figure 13: The partial schedule of two activities.

22

5.2 Gelenbe's Random Neural Network Model The random neural network (RNN) model was invented by E. Gelenbe [3, 4]. A primary advantage of the RNN model over the traditional Hop eld neural network model is that its stable state can be obtained analytically, rather than through simulation. Thus, the stable state of a network can be obtained by iteratively solving for the stable probabilities that neurons are excited. Random neural networks are composed of interconnected neurons, with each neuron i having a potential greater than or equal to zero. The neuron emits signals, or \ res", at rate r(i), which is an exponentially distributed random variable. It emits a positive signal to node j with a probability p+ (i; j), emits a negative signal to node j with a probability p, (i; j), and emits a signal that departs from the network with probability d(i). Furthermore, the following equality must hold: d(i) +

X(p+(i; j) + p,(i; j)) = 1 j

(1)

Only neurons with positive potential are permitted to re. The arrival of a positive signal to a neuron increases the potential of the neuron by 1, and the arrival of a negative signal to a neuron decreases the potential by 1. Firing decreases the potential by 1. Potentials are never allowed to become negative, so negative signals are ignored by neurons having potential equal to zero. Signals also arrive at each neuron from outside of the network. Speci cally, positive signals arrive at neuron i with rate (i), and negative signals arrive at neuron i with rate (i). The long term (or stable) probability that any neuron i is excited is the rate at which positive signals arrive at the neuron, divided by the rate at which the potential of the neuron is decreased. This is stated as: + (i) (2) qi = , (i) + r(i) The arrival rate of a positive signal to neuron i is computed as: + (i) = (i) +

X q r(j)p+ (j; i) j

j

(3)

Similarly, the rate at which negative signals arrive at neuron i is given by the formula: , (i) = (i) +

X q r(j)p,(j; i) j

j

(4)

5.3 The Random Neural Network Solution The network for approximating the solution to the problem of maximizing interpartition concurrency (i.e., minimizing con icts) with minimal PEs is constructed as follows. We use G(f) to represent the number of partitions that can execute concurrently with partition f, PE to indicate the number of PEs available, and F to denote the number of partitions to be assigned. As shown in Figure 14, there are two (2) neurons for each possible (partition, PE) pair. A neuron R(f; p) represents the assignment of partition f to PE p, and is excited whenever f is tentatively assigned to PE p. Conversely, neuron r(f; p) is excited whenever partition f is not tentatively assigned to PE p. The objectives of the assignment problem are achieved by placing connections between neurons in ways that cause the neurons to enter a state representing an (approximately) optimal solution. There is an inhibitory (negative) connection from each R(f; p) to the corresponding r(f; p), to insure that the contradictory state where both R(f; p) and r(f; p) are excited will not occur when the network stabilizes. There are excitatory (positive) connections from each r(f; p) to all R(f; q), where p 6= q. These connections encourage the eventual assignment of f to some PE. When g==f, there is an inhibitory connection from R(f; p) to R(g; p), to discourage f and g from being assigned to the same PE; the weight of such an 23

Λ ( R( f, p ))

λ ( R( f, p ))

-

R( f, p )

-

R( g , p ) g // f

Λ ( r( f, p ))

λ ( r( f, p ))

r( f, p )

+

+

R( h , p ) h

R( f, q )

f

q=p

Figure 14: A local view of the random neural network. edge is proportional to the amount of concurrency among g and f, that is it is proportional to C(gf). Similarly, there is a positive connection from R(f; p) to R(h; p) when f ? h, to encourage assignment of f and h to the same PE. This tends to reduce the number of PEs required while not limiting concurrency. To obtain a solution to the assignment problem, the analytical solution to the network stable state is obtained and the node with the highest probability of being excited is selected. That node indicates the assignment of one partition to a PE. Such an assignment is made and the process is repeated, until all partitions are assigned. The solution procedure is outlined below: Procedure RNN: 1. Number the Partitions and PEs. 2. Initialize all qR(f;p) = 0. 3. Assign Partition f to PE 1, where (R(f; p)) is the largest for all f and p. This is accomplished by setting qR(f;p) = 1 and qR(f;q) = 0; 8q 6= p. 4. Let the rest of the network \relax". Use the equation above to calculate qR(f ;p) , where f 0 has not been xed to a PE. Repeat this calculation until stable qR values result. 5. Find the maximum qR , say qR(f 1;p1), and make the corresponding assignment by setting qR(f 1;p1) = 1 and qR(f 1;q) = 0; 8q 6= p1. 6. Repeat steps 4 and 5 until all partitions are assigned. 0

24

6 Conclusions This paper contains a description of an automated approach for partitioning and assignment of objectbased concurrent programs. The approach bridges the gap that exists between application software and parallel or distributed architectures. Additionally the paper de nes techniques for computation of communication and concurrency metrics that drive partitioning and assignment is considered, and a language-independent intermediate representation from which metrics are computed is de ned. Software tools incorporating these techniques have been implemented and are described as well. The output of the processes described in this paper allows a program to be executed on a concurrent architecture. For example, the partitioning and and assignment produced can be expressed in the DADS distribution speci cation language to produce a complete software system that can be automatically executed on a parallel or distributed computer on which the DADS run-time system has been installed. Novel aspects of this work include the consideration of tasks and objects as units of partitioning and assignment. Additionally, the concurrency metrics constitute a signi cant result in the area of static assessment of concurrency. Unlike other work in partitioning, we provide a partitioning tool, not just a partitioning language or a graphical tool for constructing partitions. Furthermore, this work has been motivated by a real need in the applications domain|the ability to distribute mission critical software over concurrent hardware platforms. Future work includes additional metrics for concurrency and communication, as well as for timing, fault tolerance and reliability. Although the output of partitioning and assignment is language-independent, future work includes the exploitation of advanced features of the DADS partitioning language, as well as exploitation of the partitioning construct of the Ada-95 language. Additionally, we will compare of various partitioning and assignment techniques on several mission critical systems. Most importantly, the tools described herein are being used in the ARPA/AEGIS-sponsored HiPer-D project, which is establishing essential capabilities for shipboard computing in the 21st Century.

References [1] Shahid H. Bokhari, \Partitioning Problems in Parallel, Pipelined, and Distributed Computing," IEEE Transactions on Computers, January 1988, 37(1), pages 48-57. [2] F. Ercal, J. Ramanujam and P. Sadayappan, \Task Allocation onto a Hypercube by Recursive Mincut Bipartitioning," Journal of Parallel and Distributed Computing, Oct. 1990, pages 35-44. [3] E. Gelenbe, \Random neural networks with negative and positive signals and product form solution," Neural Computation, 1(4), 1989. [4] E. Gelenbe, \Theory of the random neural network model," Neural Networks: Advances and Applications, E. Gelenbe, editor, Elsevier Science Publishers, 1991. [5] M. Halstead, "Elements of Software Science", North Holland, 1977. [6] D.K. Hammer, P. Lemmens, E. Luit, O. van Roosmalen, P. van der Stok and J. Verhoosel, "DEDOS: A Distributed Environment for Object-Oriented Real-Time Systems", Journal of Parallel and Distributed Technology, Winter 1994. [7] R. Jha, J.M. Kamrad II, and D.T. Cornhill, \Ada Program Partitioning Language: A Notation for Distributing Ada Programs," IEEE Transactions on Software Engineering, March, 1989, 15(3), pages 271-280. [8] Virginia Mary Lo, \Heuristic Algorithms for Task Assignment in Distributed Systems," IEEE Transactions on Computers, November 1988, 37(11). [9] R.S. Pressman, "Software Engineering: A Practitioner's Approach", McGraw-Hill, 1992. [10] The Rational Corporation, \Distributed Application Development System Guide," version 6.2.3, December 16, 1994. [11] B. Ravindran, \Extracting parallelism at compile-time through dependence analysis and cloning techniques in an object-based paradigm," M.S. Thesis, New Jersey Institute of Technology, May 1994.

25

[12] A. L. Samuel, E. Sam, J. A. Haney, L. R. Welch, J. Lynch, T. Mot, and W. Wright, \Application of a Reengineering Methodology to Two AEGIS Weapon System Modules: A Case Study in Progress," Proceedings of The Fifth Systems Reengineering Technology Workshop, Naval Surface Warfare Center, February 1995. [13] H.S. Stone, \Multiprocessor scheduling with the aid of network ow algorithms," IEEE Transactions on Software Engineering, Vol. SE-3, No. 1, pp. 85-93, January 1977. [14] J.P.C. Verhoosel, E.J. Luit, D.K. Hammer and E. Jansen, "A Static Scheduling Algorithm for Distributed Hard Real-Time Systems", The Journal of Real-Time Systems, Vol. 3(3), September 1991. [15] J. Verhoosel, "Pre-Run-Time Scheduling of Distributed Real-Time Systems: Models and Algorithms", PhD Thesis, Department of Computing Science, Eindhoven University of Technology, January 1995. [16] J. P. C. Verhoosel, L. R. Welch, D. Hammer, and A. D. Stoyenko, \Assignment and Pre-Run-time Scheduling of Object-Based, Parallel Real-Time Processes," IEEE Symposium on Parallel and Distributed Processing, Oct. 1994. [17] J. P. C. Verhoosel, L. R. Welch, D. K. Hammer, A. D. Stoyenko, and E. J. Luit, \A Formal Deterministic Scheduling Model for Object-Based, Hard Real-Time Executions," Journal of Real-Time Systems, 8(1), January 1995. [18] R. A. Volz, T. N. Mudge, G. D. Buzzard, and P. Krishnan, \Translation and execution of distributed Ada programs: Is it still Ada?, " IEEE Transactions on Software Engineering, vol. 15, no. 3, pp. 281-292, March 1989. [19] L. R. Welch, A. D. Stoyenko and S. Chen, \Assignment of ADT Modules with Random Neural Networks," The Hawaii International Conference on System Sciences, pages II-546-555, Jan. 1993. [20] L. R. Welch, \Assignment of ADT Modules to Processors," Proceedings of the International Parallel Processing Symposium, pages 72-75, March, 1992. [21] L. R. Welch, \Cloning ADT Modules to Increase Parallelism: Rationale and Techniques," Fifth IEEE Symposium on Parallel and Distributed Computing, pages 430-437, December 1993. [22] L. R. Welch, A. D. Stoyenko, T. J. Marlowe, \Response Time Prediction for Distributed Periodic Processes Speci ed in CaRT-Spec," Control Engineering Practice, (in press). [23] L. R. Welch, A. Samuel, M. Masters, R. Harrison, M. Wilson and J. Caruso, \Reengineering Complex Computer Systems for Enhanced Concurrency and Layering," Journal of Systems and Software, July 1995, (to appear). [24] L. R. Welch, \ A Parallel Virtual Machine for Programs Constructed from Abstract Data Types," IEEE Transactions on Computers 37(11), pages 1249-1261, Nov. 1994. [25] L. R. Welch, G. Yu, J. Verhoosel, J. A. Haney, A. Samuel, and P. Ng, \Metrics for Evaluating Concurrency in Reengineered Complex Systems," Annals of Software Engineering, 1(1), Spring 1995. [26] J. Xu and D. L. Parnas, \Scheduling Processes with Release Times, Deadlines, Precedence, and Exclusion Relations," IEEE Transactions on Software Engineering, Vol. 16, No. 3, pp. 360-369, March 1990. [27] G. Yu and L. R. Welch, \Program Dependence Analysis for Concurrency Exploitation in Programs Composed of Abstract Data Type Modules," IEEE Symposium on Parallel and Distributed Processing, Oct. 1994. [28] G. Yu and L. R. Welch, \A Novel Approach to O-line Scheduling in Real-Time Systems," Informatica, (to appear in 1995).

26

Metrics and Techniques for Automatic Partitioning and Assignment of ...

Metrics and Techniques for Automatic Partitioning and Assignment of ...

Suggest Documents

Metrics and Techniques for Automatic Partitioning and ... - CiteSeerX

TECHNIQUES AND METRICS FOR IMPROVING WEBSITE ...

TECHNIQUES AND METRICS FOR IMPROVING WEBSITE ...

Automatic partitioning for multirate methods

Demonstration of Automatic Data Partitioning

Extrinsic Evaluation of Automatic Metrics for ... - TerpConnect

Metrics and Optimization Techniques for Registration of Color to Laser ...

Metrics and Optimization Techniques for Registration of Color to Laser ...

Divide and conquer partitioning techniques for smart water networks

Test Set and Fault Partitioning Techniques for Static ... - Springer Link

Runtime Compilation Techniques for Data Partitioning and ... - CUCIS

Glamdring: Automatic Application Partitioning for Intel SGX

Site-Based Partitioning and Repartitioning Techniques for Parallel ...

Automatic Partitioning for Prototyping Ubiquitous Computing

Automatic Grammar Partitioning for Syntactic Parsing

Analysis of techniques for automatic detection and quantification of

Runtime Compilation Techniques for Data Partitioning ... - CiteSeerX

Models and solution techniques for frequency assignment ... - CiteSeerX

Advanced Partitioning Techniques for Massively ... - Semantic Scholar

Automated Layout and Phase Assignment Techniques for ... - CiteSeerX

Runtime Compilation Techniques for Data Partitioning ... - CiteSeerX

Video and acoustic camera techniques for studying ... - Sound Metrics

Metrics and Techniques for Quantifying Performance Isolation in ... - KIT

Re-evaluating Automatic Metrics for Image Captioning