Predictive performance modelling of parallel ... - Semantic Scholar

6 downloads 37875 Views 588KB Size Report
Science, University of Warwick, Coventry, UK e-mail: ... e-mail: [email protected].uk there have ..... vides a good foundation for this study and demonstrates that .... server systems. .... mance prediction model for parallel computing on non-dedicated ... mercial Hosting Environments which draws on research collaboration.
Cluster Comput DOI 10.1007/s10586-007-0010-2

Predictive performance modelling of parallel component compositions Lei Zhao · Stephen A. Jarvis

© Springer Science+Business Media, LLC 2007

Abstract Large-scale scientific computing applications frequently make use of closely-coupled distributed parallel components. The performance of such applications is therefore dependent on the component parts and their interaction at run-time. This paper describes a methodology for predictive performance modelling and evaluation of parallel applications composed of multiple interacting components. In this paper, the fundamental steps and required operations involved in the modelling and evaluation process are identified—including component decomposition, component model combination, M × N communication modelling, dataflow analysis and overall performance evaluation. A case study is presented to illustrate the modelling process and the methodology is verified through experimental analysis. Keywords Performance modelling · Parallel component composition · M × N communication

1 Introduction Today’s large-scale scientific computing applications, such as those used in multi-physics simulations, require solving problems that span multiple scales, domains and disciplines. The resulting applications are composites of highly specialized codes, developed by large, diverse, geographically distributed and often independent research teams. As a result, L. Zhao () · S.A. Jarvis High Performance Systems Group, Department of Computer Science, University of Warwick, Coventry, UK e-mail: [email protected] S.A. Jarvis e-mail: [email protected]

there have been several efforts to provide support for the coupling of parallel applications, including PAWS [1], CUMULVS [2], Meta-Chaos [3], and MCT [4]. More advanced software architectures are also under development, such as CCA [5] and ICENI [6] for example, in which the interaction between parallel programs is standardized and uniform component-oriented programming environments are provided. An alternative approach that has come about through the development of Grid computing technologies, and particularly those that advocate service-oriented architectures [7], has fostered the use of workflow technologies for the composition of scientific computing programs. While the component technologies mentioned above focus on the interaction between component programs, workflow systems concentrate on the description of component task dependencies and resource management aspects. Workflow and component assembly are therefore complementary technologies, a combination of which will integrate high-performance scientific computing applications in both space and time. Indeed, the component frameworks research community has already begun to consider the use of workflow languages for the scripting of component compositions, and then making use of workflow engines for the coordinated execution of component tasks; for details see [8, 9]. The focus of this paper is the predictive performance modelling of parallel programs composed of interacting and/or interdependent components. The term component is used in the most general sense. That is, a component could be any software program that implements some algorithms and produces expected output if given valid input, and is not specific to the component programming models necessarily adopted in the CCA and ICENI frameworks. Since components are usually developed independently and integrated into applications at a later date, a reason-

Cluster Comput

able approach is to build performance prediction models for each component and combine these component performance models into a composite model when considering the application’s overall performance. This approach allows heterogeneous models developed by various performance modelling techniques to be integrated. Modelling techniques which are considered most appropriate are chosen based on the types of component tasks and the types of platforms on which component jobs are to be run, such as computationintensive tasks [10] vs. I/O-intensive tasks [11, 12], or multiprocessors [13] vs. heterogeneous clusters [14, 15] for example. The resulting heterogeneous models, however, need to follow a common specification in order to be combined. The remaining sections of this paper provide: an investigation into the state-of-the-art parallel program component composition technologies (see Sect. 2); development of a methodology for predictive performance modelling of parallel component composition (see Sect. 3); experimental results from a case study that verifies the proposed component-based modelling approach (see Sect. 4), and a closing discussion and conclusion (see Sect. 5).

2 Parallel component composition A component is a re-usable software object which encapsulates certain functionality and is required to interact with other components. A component has a clearly defined interface and conforms to a prescribed behavioural specification common to all components within an architecture. Multiple components may be composed to build other components. A set of operations which can be invoked by component users are defined in the component interface. Function calls and message passing are two of the most common styles in which operations supported by a component are defined. For example, an object-oriented (OO) component will define the supported operations in the form of an interface containing a set of methods. The operations supported by an MPI- (Message Passing Interface) based program are constrained to the message passing routines defined by the MPI standard; this kind of interface is often implicitly specified and can only be identified either from the program documentation or by inspecting the source code. In many cases, a component needs functionalities provided by other components in order to work properly. This requirement needs to be specified explicitly in the interface as “out-going” operations. The component interface therefore includes specifications for both the operations that it provides and the operations that it requires. Components are linked through component interfaces to build composite applications. In general, these components are either linked into a single executable unit or are distributed and interact through messaging mechanisms. In the

Fig. 1 Component connection schemes

Fig. 2 A tree-structure performance model assembly for SPMD programs

context of high performance computing, a component is a parallel program which will span multiple processes when executed. This is illustrated in Fig. 1, which is adapted from [16] and was originally used to contrast direct-connected frameworks and distributed frameworks in the context of CCA. However, the concept is general enough to demonstrate the differences between a scheme where all components share every single process (left graph of Fig. 1) and a scheme where components distribute over different processes (right graph of Fig. 1). In both cases components span several processes. In the sub-sections that follow, we discuss how the different component connection schemes determine the performance analysis process of component compositions. 2.1 Component direct connection In this scheme the components are linked together to form a single executable unit. Inter-component communication is implemented using direct function calls, with an additional (although often small) overhead required to allow inter-language calls. The inter-component function calls are usually collective (in the MPI sense) because programs are often SPMD (Single Program Multiple Data) parallel. The composition structure can be represented as a tree, which corresponds to the application’s call trace with the root being the program entry (see Fig. 2). It is possible to assemble primitive components into a tree structure because in

Cluster Comput

the SPMD paradigm the calling component is blocked until the called component returns. At any logical point in the program, only one component is executing. Therefore, this scheme is also referred to as component sequential composition. In this scheme the inter-component communication cost is usually small compared with that of the component computation and is seen as negligible when developing composite models. Every node in the tree will contribute to the overall performance. Therefore, when evaluating the overall performance, we can simply traverse the tree and sum the performance of each node. However, if linked components use different data representations (e.g. sparse versus dense), a data structure translation has to be performed before the called component function can be invoked. Furthermore, in the case of a distributed data structure (i.e. the data structure is distributed across a group of processes and every member process holds a portion of the data elements), the converted data structure may need to be redistributed. This results in more inter-process communication and may therefore have a significant effect on performance. In this case, we might envisage the insertion of a virtual adapter between two linked components with different data representations and/or different data distributions. This adapter performs data representation translation and data redistribution. Some programming environments provide system-specific components to help users realize this data transformation and redistribution. Without such system support, programmers must provide this implementation by hand. In order to evaluate performance models of components which are successively invoked, the input size should be available before the models are computed. Therefore, before we can evaluate the performance, we need to compute the size of results that the modelled component will produce given a particular input. This means that the program’s data flow needs to be analysed and evaluated. Finally, the cost involved in data representation translation and data element redistribution should be included in the overall performance evaluation. 2.2 Component distributed connection In this scheme parallel components are independent programs. A call to functions provided by another component typically involves some kind of network communication, because components are mapped to separate processes (and often to separate processors or geographically distributed computers). A major difference between the component distributed connection and the direct connection is that more potential parallelism can be exploited by concurrently running components. This scheme is also referred to as component concurrent composition. Component parallelism

makes it impossible to assemble the components into a tree structure because of the concurrency and synchronization of multiple threads of (component) execution. An analysis of performance behaviour may therefore rely on some general-purpose concurrency modelling formalisms, such as Petri Nets and Process Algebras with timed extensions for stochastic analysis, or the use of DAG- (Directed Acyclic Graph) based critical path analysis techniques for one-time execution analysis. We will return to these evaluation techniques in Sect. 3. The inter-component communication in a distributed environment involves the problem of transferring data from a parallel program running in M processors to another parallel program running in N processors. This is often referred to as the M × N problem, and is a generalisation of the problem of data representation translation and data element redistribution discussed in Sect. 2.1. Specifically, the M × N communication involves the following operations: • Data structure translation: translate the data representation (e.g. column-major arrays) on the sender side into the representation (e.g. row-major arrays) used on the receiver side. Developing a standardized translation solution is a recognized problem, largely because the data representation is often very application-specific. Several efforts are currently underway to investigate the representation of a selection of commonly used data structures in scientific computing, such as multi-dimensional arrays [17–19]. • Data redistribution: redistribute the data elements in a converted representation over the target processes on the receiver side. Each sender process must determine which data elements in its local memory it should send, and similarly each receiver process must determine which data elements in its local memory it expects to receive. • Communication scheduling: compute a schedule that determines which parts of data associated with each sender process will be sent to which process on the destination side. The communication schedule should conform to the constraints expressed in the data distribution specification on both sides. That is, certain data of a given process may have to be sent to a specific process but not to others. • Parallel transfer: transfer data in parallel using available communication facilities. Employed communication mechanisms usually depend on runtime environments and on implementation policies. For example, the sender component may have a set of designated processes to transfer all the data gathered from a local group using some highthroughput data transfer protocol (such as GridFTP [20]) over a wide area network, or the component may send data through a Unix-like pipe model (such as that implemented in [21], which is built on PVFS in a local cluster environment).

Cluster Comput

Significant research [17–19] has been undertaken to develop advanced communication libraries that solve the M × N problem. This communication middleware encapsulates complex technical details and is invoked through an interface that makes the M × N communication as transparent as possible. Programmers use well defined constructs to describe data distribution on both sides and the libraries compute the communication schedule and the data transfer automatically. Consequently, performance modelling of distributed connected components is decomposed into two aspects: intracomponent computation modelling and inter-component communication modelling. The inter-component communication model should be able to compute the data transfer schedule and evaluate the parallel transfer performance. The size of the data arriving at the receiver should be computed. This requires information about the data structure and distribution specification on both the sender side and the receiver side. Again, the cost of computing the data translation and the communication schedule should be considered, although very often this computation cost is negligible compared with the communication cost itself. It is noted that the separation of component computation modelling and inter-component communication modelling also applies to the component direct connection scheme, in which the inter-component communication can be regarded as a special case of M × N communication where the sender and the receiver are the same group of processes, and if there is no data transformation, the communication consists of several in-memory function calls.

3 Predictive performance modelling of parallel component composition In this section a methodology is presented for the performance prediction of programs composed of multiple components, based on the investigation outlined in Sect. 2. For the past decade, numerous techniques and toolkits have been developed for performance modelling and evaluation of parallel and distributed applications with a focus on standalone programs. We consider applying these techniques in a compositional context. We define a performance prediction model as one that predicts the execution time of an algorithmic implementation running on a given set of resources. Whatever the format adopted (e.g. a C library, a java class or a CCA component), a software component implements an algorithm or some algorithms, operates on some data structure and produces desired output. Its corresponding performance model operates on the input problem size, models its performance behaviour and predicts the execution time. Some algorithms require processes to be arranged in a specific topology (e.g. a 2-D mesh) and performance is greatly

influenced by how tasks are mapped to the resources. Program performance is also influenced by real-time resource characteristics such as CPU availability and network contention, etc. Descriptions of problem size and process topology are clearly application dependent, and the description of real-time resource characteristics depends on the resource models being used. The process of component composition performance modelling and evaluation is separated into three parts: component decomposition, composite model construction, and data flow analysis and model evaluation. The approach is to separate inter-component communication activities of a component program from its intra-component computation activities. Each component’s behaviour is modelled as a process, consisting of two types of activities: either intracomponent computation or inter-component communication. The performance of each type of activity is evaluated separately, from which the overall performance of a composition can be derived. This approach is adopted because the performance of a component’s internal computations does not immediately rely on interactions with the external world but may depend on data transferred through inter-component communications. The behaviour and performance of intercomponent communication activities cannot be evaluated until a composition is made; clearly this also applies to the behaviour and performance of the composite applications. By decomposing a component into a process of internal computation activities and external communication activities, a clear view of the component’s structure and behaviour can be formed and we will be able to identify data dependencies between computation and communication activities. A performance model is built and attached to each intra-component computation activity when a component is constructed. The problem size required by these computation performance prediction models can be reasoned about with inter-component dataflow analysis when compositions are decided. Finally, a complete picture of the overall performance behaviour will be obtained after the performance of the communication activities is evaluated and synchronization is considered. In order to illustrate the modelling and evaluation process concisely and yet remain as precise as possible, we base the process description on a set of sets, relations and functions. Despite its mathematical representation, this is not a rigorous mathematical formalization. The mathematical constructs are used for illustrative purposes only. Many of the elements are not formally defined since their meanings are straightforward to infer from their names. 3.1 Component decomposition Component compositions are specified through a set of bindings which connect compatible component interfaces.

Cluster Comput

Fig. 3 Flow graphs of component compositions

The static structure of a composition can be pictured in a representation known as a flow graph (see Fig. 3), composed of components, interfaces and bindings. As we have discussed in previous paragraphs, we choose to represent a component and model its behaviour as a process consisting of intra-component computation activities and intercomponent communication activities, and based on this the overall performance is evaluated. The component interface specifies all the operations that a component is able to commit. Operations are abstracted as communication actions. This is straightforward for message-passing-style interface operations. For operations defined in the form of function or method calls however, we model each function as two communication actions: one of which sends messages from the caller to the callee, containing function arguments; the other of which is the message sending from the callee to the caller, communicating back the return values. We use C to denote the set of communication actions specified in the component interface, and P to denote the set of internal computation actions, which represent the computation parts separated by those communication actions in C (see Fig. 4). The term activities is used to distinguish modelling activities from actions; that is, a performance activity is attached to each action and provides a predicted duration for that action. A performance model is a function from the problem size to the time domain. A specification of the problem size may include a data Fig. 4 Component decomposition

size description of the data structure that the computation operates on, the number of processes or processors being used, or the tasks and their precedence relationships, which are all application dependent. The set of performance models is denoted by F . In addition to the performance model, a function for predicting the new problem size as a result of a computation action is also included in the computation activity. The set of resulting problem size prediction functions is denoted by G. Note that both the performance model and the resulting problem size computation function are not provided for communication activities at this stage. These functions will be provided when constructing the composite model. Therefore, as shown in Fig. 4, we use the empty set Ø to indicate that prediction functions for communication activities are not yet available at this stage of component decomposition. Finally, a process of a component is defined in terms of both sets of activities. Note that each occurrence of an activity in the process of a component has different contributions to the performance cost of the entire program. Therefore a component process is in fact defined over instances of activities. When we evaluate the performance of a long-running software system, which may serve many requests, we are usually more concerned with long-term performance indicators such as throughput and utilization rather than the service time of a single request. The derivation of these systemaverage performance measures is usually dependent on well established mathematical theory, namely modelling the system as stochastic processes and studying the steady-state behaviour. Well studied formalisms such as Stochastic Petri Nets (SPN) [22] and Stochastic Process Algebras (SPA) [23] have been developed for quantitative modelling of stochastic systems composed of multiple concurrent and interacting components. In this paper however, we focus our performance analysis on the transient behaviour of software

Let P be a set of intra - component computation actions of a component, ranged over by p 1 , p2 , . . . C be a set of inter - component communication actions of a component, ranged over by c1 , c2 , . . . F be a set of performance models where each model is a function f : Problem_size → Duration G be a set of resulting problem size computation functions g : Problem_size → Problem_size We define : Computation_activities:P → F × G Communication_activities:C → ∅ × ∅ Activities = Computation_activities ∪ Communication_activities Component is a partial ordering on the set of Instances of Activities A Binding of component A and B is a relation from CA to CB Compositions is a set of Bindings

Cluster Comput

systems, i.e. predicting the timely execution of applications. Therefore, we choose to represent the process of a component as a partial ordering on activity instances. 3.2 Composite model construction The development of composite models includes two steps: combining component process models by fusing each pair of communication actions in a binding, and deciding appropriate performance prediction and problem size computation functions for each fused inter-component communication activity (see Fig. 5). Recall that a binding between component A and B is a relation from the communication action set of A to the communication action set of B. For each pair of communication actions (c, c ) contained in that binding, we replace both c and c with a new action identifier cf , so that both A and B will refer to the same communication activity. Figure 6 illustrates this fusion process through a simple example. Since a component process is defined as a partial order, it can be represented as a Directed Acyclic Graph (DAG) known as a precedence graph, with nodes representing activities and arcs indicating the order of precedence in an execution. Communication actions are attached to the communication channel through ports (see Fig. 7). In any one-way communication activity, the communication action at the sender side writes data out to an attached port, and the communication action at the receiver side reads data in from its associated port. In the case of parallel component compositions, the communication channel is an M × N channel. Ports are defined to be compatible in the M × N sense, in that data representations are not required to be identical on both sides, provided that data can be cast from one side to the other. We define a Port as a data flow interface, which decides the layout of a parallel data structure used in the related communication actions, based on which a parallel data transfer over the M × N channel can be established. As discussed in Sect. 2, an M × N communication involves operations such as data representation translation and data element redistribution, communication schedule computation and parallel data transfer. Both the data layout translation and the communication schedule computation require knowledge of the data structure on both sides, as well as how the data is spread through the processes. A specification of a distribution pattern normally requires processes being arranged in some topology. A Port is consequently characterized by a 3tuple (consisting of descriptions of a global data structure, a logic process topology and a data decomposition specification), which unambiguously defines a parallel data structure layout in most cases. For example, a port (1-D array, linear process topology, block decomposition) will define an even distribution of array data over all the processes and each process holds a continuous array portion.

ALGORITHM 1 Constructing the Composite Model from a set of Bindings procedure model construction (L: Bindings) for each pair (c, c ) in L begin new cf c → cf , c  → cf (Portc , Portc ) -> translation function g /** g is a function determining the problem size at the receiver side **/ Portc → data_distribution_of_c Portc → data_distribution_of_c (data_distribution_of_c , data_distribution_of_c ) → M × N _schedule M × N _schedule → performance prediction function f /** f is the performance prediction model for the M × N communication activity involving c and c **/ associate f and g to cf end Fig. 5 Composite model construction

Fig. 6 Component process model fusion

Fig. 7 Port and communication channel

Given port specifications on both sides, we can decide a problem size computation function g for the receiver. The function g is actually a translation function which translates the problem size specification at the sender side into a problem size specification suitable for use at the receiver side. With the data layout obtained from the port specifications, an M × N communication schedule is computed, which describes how the data will be transferred. A performance model f for the parallel transfer can accordingly be

Cluster Comput

Fig. 8 Control flow structures

constructed from the schedule. Finally we attach functions f and g to the communication action cf , which is created by fusing c and c , so as to form a complete communication activity. After all the communication action pairs defined by bindings have been fused and corresponding performance prediction and receiver-side problem size computation functions are determined, the development of a composite model is complete if no control flow structures such as loop and choice need to be considered. The resulting composite model without control flow structures is also a partial ordering on the union of sets of component activity instances, and hence can also be represented by a DAG (as shown in Fig. 6). Furthermore, a sub-DAG can be introduced as a node into another DAG in order to form a hierarchy, which means that composite models can be further combined and reused. There are two cases when we consider extending the composite model to include control flow structures. A control flow structure either contains only intra-component computation activities or also contains inter-component communication activities. In the first case, the control flow structure resides only within a single component, is managed during the component decomposition, and should be hidden from the component model composition. In the second case, the control flow structure spans all components participating in the inter-component communication activities and has to be considered in the composite model construction. In either case however, similar techniques are used to handle choices and loops. A common approach to dealing with a choice in performance prediction is to associate branching probabilities to each branch of the choice. A weighted sum can then be used to represent the execution time of the choice block, if deterministic values are used to characterise execution time (see Fig. 8 (1)). The output size of the choice block is computed similarly. If the performance evaluation is based on stochastic analysis however, such as Markovian models, the branching probabilities are used to compute state transition rates and derive steady state distributions. See [22, 23] for further details. A loop is treated in the same way as a choice construct, thus simplifying these performance analysis techniques that are based on stochastic processes.

In situations where deterministic models are used, a widely chosen approach to handling a loop is to estimate the number of iterations for that loop. If the performance of the activities in the loop body does not depend on specific iterations (many scientific computing applications do have this feature in practice), the execution time of the loop block can be simply computed by multiplying the execution time of the loop body by the number of iterations of the loop (see Fig. 8 (2)). If the loop body performance varies from iteration to iteration, we have to either use an average to approximate the loop body performance or predict that performance for each iteration and add them together (see Fig. 8 (3)). We note that the second approach is quite inefficient, and an application-specific aggregate function is normally developed as a viable alternative, if it is indeed possible to do so. Finally, the output size from the last iteration is used as the output size of the loop block. 3.3 Data flow analysis and model evaluation This section is concerned with composite model evaluation. Due to the inclusion of many small primitive models in a composite model, the problem size for these models needs to be computed before model evaluation. Problem size determination is achieved by forward data flow analysis (see Fig. 9). Provided with problem size specifications for all the component participants in a composition, the problem size for all the activities involved in the composite model are subsequently computed along the data dependency path (see Fig. 10 for example). Note that a data stream could be split, switched or multicast along the data flow path in practice, and the input size for each activity (i.e. Inn0 ) should be adjusted correspondingly. However, for simplicity and clarity, complex “split-point” operations are omitted from Fig. 9. Primitive models attached to each activity are evaluated when the problem size is available. It is not difficult to assemble these prediction results in the precedence graph. A simple algorithm will compute the overall performance by simulating the application’s execution (see Fig. 11). In each step, the activity with the least remaining running time (weight) is chosen. The time is added to the overall performance and is subtracted from the remaining execution time of all the other activities running in parallel. New activities are added to the running list when they are available for execution, having already removed the completed activity nodes. The computation repeats until all the activities have been considered, therefore demonstrating the execution trace of the application. More advanced algorithms such as those for critical path analysis can be used for finding critical activities, which is particularly useful for bottleneck identification. The time and space complexity of this componentoriented performance prediction method is determined by

Cluster Comput

Let A be a set of activities of a composite model I be the input problem size to the application For each activity n ∈ A, let inn be the input size for n, out n be the output size for n, and f n be the problem size transfer function for n. Thus we have : ∀n out n = f n (inn )  ∀n = n0 inn = ∨ out m | m ∈ pred(n) inn0 = I Fig. 9 Forward data flow analysis

Fig. 12 A case study—splitter-sorter

application that x models is m. Then the problem size for each component activity i is a function gi of m, i.e. gi (m), which is the input parameter for component model computation fi . In each round of the composite model evaluation, one function is computed at each step (i.e. gi for the first round and fi for the second round). If C(gi ) and C(fi ) represent the complexity of computing gi and fi respectively, then the total number of steps that are needed to evaluate m  the composite model would be m i=1 C(gi ) + i=1 C(fi ). Therefore the time complexity is O(m ∗ (C(f ) + C(g)) and space complexity is determined by the maximum of space required by fi and gi .

4 An illustrative case study

Fig. 10 Data flow example

ALGORITHM 2 Evaluating the Overall Performance procedure model evaluation (G: precedence graph) texec = 0 L = ∅ /** L is a sorted list in increasing order **/ for all nodes v in G with v.indegree = 0 insert v in L while L is not empty begin remove v from the head of L texec = texec + weight(v) for all elements e in L weight(e) = weight(e) – weight(v) remove v from G for all nodes w in G with w.indegree = 0 insert w in L end Fig. 11 Overall performance evaluation

the number of component activities (which decides the number of component models that need to be evaluated) of the final composite model and the time and space complexity of each component model computation. The process of composite model evaluation consists of two phases: first, the input problem size for each component model is computed; then each component model is evaluated and prediction results are summed. Suppose the number of component models of a composite model x is n and the input size to the

An illustrative case study is presented for constructing and evaluating a composite performance model for a splittersorter parallel application (Fig. 12). The methodology presented in Sect. 3 is a “macro” prediction approach, which focuses on integration and depends on the availability of component models. The splitter-sorter application represents a typical parallel RPC scenario, representing one of the more common constructs used in today’s scientific computing applications. The simplicity of this combined program helps demonstrate the modelling process following the steps outlined in Sect. 3. Previous research at Warwick has developed performance models for merge-sort algorithms with high degrees of predictive accuracy (>90%, see [24]), which provides a good foundation for this study and demonstrates that previously developed models can be reused and integrated. Splitter and sorter are two parallel programs written in C and which use PVM [25] for component-wide communication. The splitter divides an array of data over the sorter component processes and gathers sorted data after the component sorters finish. Sorter implements a parallel version of the sorting network [26]. In order to be more efficient, each process of the sorter first sorts the data in local memory using the qsort provided by the standard C library, and then exchanges data with other processes following the mergesort algorithm of the sorting network (see Fig. 13). Each data exchange is based on the compare-split operation: processes with lower rank hold the smaller part while higher rank processes hold the larger part. When sorting finishes, the arrays in the processes’ local memories (from P0 . . . Pn−1 ) constitute a globally sorted array. Inter-component communication is implemented using the InterComm library [18]. InterComm is a runtime library

Cluster Comput

Fig. 13 The composite model for the splitter-sorter program

that achieves direct data transfers between data structures managed by multiple data parallel languages and libraries in different programs. The latest version of InterComm (1.1) supports components written in C, Fortran77, C++ and Fortran90. Supported data parallel libraries include Chaos, MultiBlockParti and P++. Currently InterComm only supports multi-dimensional array data structures, and InterComm classifies data distribution patterns into two types, a block decomposition where entire blocks of an array are assigned to tasks, and a translation table where individual elements of an array are assigned independently to a particular task. In the case of the former, the data distribution descriptor is relatively small and can be replicated on each of the tasks. In the case of the latter the translation table is large and must be partitioned across tasks. In the splitter-sorter application, both the splitter and sorter partition an array evenly across participating tasks— i.e. in a process of block decomposition. The splitter communicates with the sorter through the same type of Port (a 1-D array, linear process topology, block decomposition). No additional data representation translation is needed and the problem size of sorter equals that of splitter. There are two communication actions in the splittersorter interaction. One is unsorted data sent from splitter to sorter and the other is sorted data sent back from sorter to splitter, both involving data redistribution and parallel transfer. Since data are evenly partitioned on both sides, data decomposition is simple, array_size/process_number for each process. This application is intended for running on a network of homogeneous workstations where the method in which processes are mapped to processors does not cause perceptible differences in performance. Suppose the array is of length Size, and the splitters tasks number M and the sorters tasks number N . In the case of the splitter-to-sorter communication, each splitter process needs to send data of size Size/N to N/M sorter processes. Since InterComm is built on PVM and in PVM sending messages is asynchronous, at each process the sending of the N/M messages is actually concurrent. Therefore, the splitter-to-sorter communication involves N concurrent message sends, where each message is of size Size/N. The sorter-to-splitter communication is similar.

Performance models are built for the computation parts of sorter, namely local qsort and global merge-sort, using the PACE toolkit [10], a toolset for the performance prediction of parallel and distributed systems. The PACE models capture the inter-process communication structure, counts the operation frequency of each process computation, and evaluates the performance of computation and communication by combining the application model with CPU and network benchmarks. Since the PACE model simply counts the number of primitive operations in relation to the array size and processor number, its computation is quite efficient (time complexity O(1)) when the array size and process number are known. In these experiments, the evaluation takes only a few milliseconds, while the application runs from seconds to minutes as the array size increases. An asymptotic representation of the PACE model for the sorter is: i 

log N

Tqsort(Size/N ) +

Tcompare_split(Size/N ),

i=1 j =1

which equals Tqosrt(Size/N ) + (1 + log N ) ∗ log N ∗Tcompare_split(Size/N )/2, where Size is the array size and N is number of sorter processes. As shown in Fig. 13, the composite model is a sequential composition of four activities: split, qsort, mergesort and gather. Split and gather are the M × N intercomponent communication activities, and qsort and mergesort are internal computation activities of the component sorter. The overall performance is therefore: Tsplit + Tasort + Tmerge−sort + Tgather , where Tsplit and Tgather are equal. The application performance was benchmarked on a dedicated network of homogeneous workstations, consisting of 24 nodes, connected using a 100 Mb Ethernet. Each node consists of an Intel 2.4 GHz processor and 512 MB RAM. The OS is Linux 2.4.21-20.EL with gcc 3.2.3, pvm 3.4 and InterComm 1.1. Example benchmark results and predictions are shown in Fig. 14. The application was tested for eight different configurations. Splitter was allocated 2 and 4 processors and sorter was allocated 2, 4, 8 and 16 processors respectively. For each configuration, benchmarks were collected with the input array size varying from 64 k to 64 M. Benchmarking results are compared with the model predictions. It is noted that the prediction error is smaller with a smaller experimental configuration. As the number of employed processors increase, so the error also slowly increases. For example, the average prediction error for a combination of 4 splitter processes and 16 sorter processes increased by 10% compared with a configuration of 2 splitter processes and 4

Cluster Comput

Fig. 14 Benchmarks and predictions of the splitter-sorter program with different configurations

sorter processes. It is also noted that the predicted execution time is normally smaller than the actual execution time. A careful inspection reveals that the prediction error is mainly attributed to the communication model used. The same communication model is used both in the sorter’s merge-sort performance model and in the splitter-sorter communication performance evaluation. The communication model is developed by benchmarking ping-pong times between an arbitrary pair of nodes in the cluster. However, network contention can easily be caused by the communication intensive behaviour of the sorting algorithm. During each merge-sorting phase every two processes are paired for local array exchange, which involves N (the number of sorter processes) concurrent message transfers. The number of concurrent communications grows linearly with the number of sorter processes, as does the probability of network congestion. Simple communication models such as this clearly underestimate communication contention. Inter-component communication therefore increases prediction inaccuracies slightly in this case. In all, an additional 8% extra prediction error is attributed to component composition. A communication contention model could be introduced in order to improve prediction accuracy for both the sorter component and the composite program. These results are however well within acceptable bounds.

5 Discussion and conclusion This paper describes a methodology for predictive performance modelling and evaluation of parallel applications composed of multiple interacting components. The necessary steps and operations (i.e. component decomposition, M × N modelling, component model combination, dataflow analysis, and composite model evaluation) involved in the modelling and evaluation process are identified; further research is still needed for defining a formal interface and be-

havioural specifications for compositional prediction models. It is also noted that, only by unifying the ad hoc interface between individual models can the process of model integration and evaluation be fully automated. The advantage of this component-based prediction method is its engineering benefit. With this method it is possible to build a library of performance models along with the component library. When components are assembled, their models can be retrieved and combined on-the-fly in order to evaluate the performance of the final application, tune important parameters, or optimize the composition. However, this method requires component models being amenable to composition, and will be more appropriate for use in situations where deterministic values are chosen for time characterization. Stochastic models are generally more difficult to combine and normally require signification computation efforts. Also this methodology is aimed at predicting oneoff application performance and is not suitable for characterizing the average performance behaviour of long-running server systems. Many parallel applications execute over heterogeneous computing platforms. In these environments there are two additional considerations for parallel component composition. The first is where components are running on disparate homogeneous platforms. In this case, each component will be modelled separately, with different resource models and possibly with different techniques. Separate evaluation results of these component models will be combined to derive the composite performance. The second consideration is that a component itself is running on a heterogeneous platform. In this case, one approach might be to decompose this component into smaller units until each sub-component can be treated as running on homogeneous platforms. Another approach might be to introduce layers into the models, abstracting program and platform details. Heterogeneous resources will be characterized based on the same set of parameters and metrics. We can consider this approach as us-

Cluster Comput

ing a virtual resource model to cover resource heterogeneity; the research in [14, 24] has demonstrated concrete examples based on this approach. The availability of large-scale computational resources for scientific research is becoming commonplace; there is also a growing demand for these sophisticated architectures to support increasingly complex scientific applications. Efficient and effective resource utilisation is therefore attracting a significant amount of research effort. Techniques for evaluating compositional performance models are likely to form the basis for large-scale distributed application scheduling, planning and reservation. Acknowledgements This work is sponsored in part by funding from the EPSRC e-Science Core Programme (contract no. GR/S03058/01), the NASA AMES Research Centre (administered by USARDSG, contract no. N68171-01-C-9012) and the EPSRC (contract no. GR/ R47424/01).

References 1. Beckman, P., Fasel, P., Humphrey, W., Mniszewski, S.: Efficient coupling of parallel applications using PAWS. In: Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing, July 1998 2. Geist, G.A., Kohl, J.A., Papadopoulos, P.M.: CUMULVS: Providing fault-tolerance, visualization and steering of parallel applications. Int. J. High Perform. Comput. Appl. 11(3), 224–236 (1997) 3. Edjlali, G., Sussman, A., Saltz, J.: Interoperability of data parallel runtime libraries. In: Proceedings of the 11th International Parallel Processing Symposium, IEEE Computer Society Press, Washington (1997) 4. Larson, J.W., Jacob, R., Foster, I., Guo, J.: The model coupling toolkit. In: Proceedings of International Conference on Computational Science, 2001 5. Common Component Architecture (CCA) Forum, http://www. cca-forum.org/ 6. Furmento, N., Mayer, A., McGough, S., Newhouse, S., Darlington, J.: A component framework for HPC applications. In: 7th International Euro-Par Conference, LNCS 2150, August 2001, pp. 540–548 7. Foster, I., Kesselman, C., Nick, J., Tuecke, S.: The physiology of the grid: an open grid services architecture for distributed systems integration, Open Grid Service Infrastructure WG, Global Grid Forum, June 2002 8. Govindaraju, M., Krishnan, S., Chiu, K., Slominski, A., Gannon, D., Bramley, R.: Merging the CCA component model with the OGSI framework. In: Proceedings of CCGrid2003, 3rd International Symposium on Cluster Computing and the Grid, May 2003 9. Mayer, A., McGough, S., Furmento, N., Lee, W., Newhouse, S., Darlington, J.: ICENI dataflow and workflow: Composition and scheduling in space and time. In: UK e-Science All Hands Meeting, Nottingham, UK, September 2003 10. Nudd, G., Kerbyson, D., Papaefstathiou, E., Perry, S., Harper, J., Wilcox, D.: PACE: a toolset for the performance prediction of parallel and distributed systems. Int. J. High Perform. Comput. Appl. 14(3), 228–251 (2000) 11. Qin, X., Jiang, H., Zhu, Y., Swanson, D.R.: Towards load balancing support for I/O-intensive parallel jobs in a cluster of workstations. In: Proceedings of the 5th IEEE International Conference on Cluster Computing (Cluster 2003), December 2003, pp. 100–107

12. Rosti, E., Serazzi, G., Smirini, E., Squillante, M.S.: Models of parallel applications with large computation and IO requirements. IEEE Trans. Softw. Eng. 28(3), 286–307 (2002) 13. Adve, V.S., Vernon, M.K.: Parallel program performance prediction using deterministic task graph analysis. ACM Trans. Comput. Syst. 22(1), 94–136 (2004) 14. Yan, Y., Zhang, X., Song, Y.: An effective and practical performance prediction model for parallel computing on non-dedicated heterogeneous NOW. J. Parallel Distributed Comput. 38(1), 63–80 (1996) 15. Qin, X., Jiang, H., Zhu, Y., Swanson, D.R.: Dynamic load balancing for I/O-intensive tasks on heterogeneous clusters. In: Proceedings of the 10th International Conference on High Performance Computing (HiPC 2003), Dec. 2003, pp. 300–309 16. Bertrand, F., Bramley, R.: DCA: a distributed CCA framework based on MPI. In: Proceedings of the 9th International Workshop on High-Level Parallel Programming Models and Supportive Environments, April 2004 17. Keahey, K., Fasel, P., Mniszewski, S.: PAWS: Collective interactions and data transfers. In: Proceedings of the 10th IEEE High Performance Distributed Computing, August 2001 18. Lee, J., Sussman, A.: Efficient communication between parallel programs with interComm, Technical Report CS-TR-4557 and UMIACS-TR-2004-04, University of Maryland, Department of Computer Science and UMIACS, January 2004 19. Damevski, K.: Parallel RMI and M-by-N data redistribution using an IDL compiler. Master’s Thesis, The University of Utah, May 2003 20. GridFTP Protocol Specification, Global Grid Forum Recommendation GFD.20, March 2003, http://www.globus.org/research/ papers/GFD-R.0201.pdf 21. Bertrand, F., Yuan, Y., Chiu, K., Bramley, R.: An approach to parallel M × N communication. In: Proceedings of the Los Alamos Computer Science Institute Symposium, October 2003 22. Marsan, M.A., Conte, G., Balbo, G.: A class of generalised stochastic Petri nets for the performance evaluation of multiprocessor systems. ACM Trans. Comput. Syst. 2(2), 93–122 (1984) 23. Hillston, J.: A Compositional Approach to Performance Modelling. Cambridge University Press, New York (1996) 24. Papaefstathiou, E., Kerbyson, D.J., Nudd, G.R., Atherton, T.J., Harper, J.S.: An introduction to the layered characterisation for high performance systems, Research Report RR335, Department of Computer Science, University of Warwick, December 1997 25. Geist, A., Beguelin, A., Dongarra, J., Jiand, W., Manchek, R., Sunderam, V.: PVM: Parallel virtual machine: a user’s guide and tutorial for networked parallel computing. In: Scientific and Engineering Computation Series. MIT Press, Cambridge (1994) 26. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, second edn. MIT Press, Cambridge (2001)

Lei Zhao is a PhD student and research associate in the High Performance Systems Group in the Department of Computer Science at the University of Warwick. He received his Master degree with distinction from Imperial College, London, UK, and his first degree from University of Posts and Telecommunications, Beijing, China. He was the chief researcher and engineer of a number of large-scale high-end distributed systems. His areas of interest are distributed computing, grid

Cluster Comput computing, Web services, biomedical computing, performance engineering and resource management. He is the primary author of a number of conference and journal papers.

Dr Stephen Jarvis is a Senior Lecturer in the High Performance Systems Group at the University of Warwick. He has authored over 100 refereed publications (including three books) in the area of software and performance evaluation. While previously at the Oxford Uni-

versity Computing Laboratory he worked on the development of performance tools with Oxford Parallel, Sychron Ltd and Microsoft Research in Cambridge. He has close research ties with IBM, including current projects with IBMs TJ Watson Research Center in New York and with IBM Hursley Park in the UK. He is also principle investigator on the recently awarded EPSRC Fundamentals of Computer Science for e-Science project Dynamic Operating Policies for Commercial Hosting Environments which draws on research collaboration with IBM, BT at Martlesham Heath, Hewlett Packard Research Laboratories in Bristol and Palo Alto, the National Business to Business Centre and the University of Newcastle. Dr Jarvis sits on a number of international programme committees for high-performance computing and enterprise systems. He is co-organiser for one of the UKs High End Scientific Computing Training Centres; Manager of the Midlands e-Science Technical Forum on Grid Technologies; elected member of the EPSRC Review College and holds a William Penney Fellowship to support research on computational performance analysis for the UK Atomic Weapons Establish.