Tuning Parallel Programs with Computational ... - Semantic Scholar

Tuning Parallel Programs with Computational Steering and Controlled Execution Michael Oberhuber, Sabine Rathmayer, Arndt Bode LRR–TUM, Technische Universität München e-mail: foberhube,maiers,[email protected] Abstract On-line visualization and computational steering of parallel scientific applications has been widely recognized as the key to better insight and understanding of the observed simulation. From the parallel program developer’s point of view, further problems arise and need to be solved. The behavior and performance of parallel programs does not only depend on the input data but also on inter-process communication. To reflect this fact we propose a novel combination of on-line visualization, computational steering of parallel High Performance Computing applications and controlled deterministic execution. Both, the visualization and the classical part of steering is based on the VIPER tool. For the control of the communication we rely on a tool called codex, which was developed to test and control communication by the use of control patterns. Finally, VIPER and codex form a environment for tuning, steering and testing based on VIPER’s extended programming model.

1. Introduction Parallel computing has become a widely used technology during the past few years. Parallelizing existing applications or developing new parallel applications even though still being difficult has become a profession among programmers. Therefore the interest in parallel computing has shifted a bit towards actually working with parallel programs and optimizing them. On-line Visualization and Computational Steering has well been recognized as a key to interact with long running and resource intensive applications in many ways. The range of computational steering varies a lot. On the application level it means that the user can observe his computation during runtime and change for example parameters of his program in order to do experiments. By doing so, Copyright 1998 IEEE. Published in the Proceedings of the Hawai’i Int. Conference On System Sciencs, January 6-9, 1998, Kona, Hawaii

he can enormously reduce the amount of time spent in the so-called application cycle. This applications cycle usually consists of a pre-processing phase where the user for example generates a partitioned grid. Very often the data has then to be transfered to a parallel computer which might reside somewhere far away in the WAN (Wide Area Network). And, after waiting for the parallel computation to finish, the data must be collected again and somehow post-processed before it can finally be visualized. The range of applications goes from Computational Fluid Dynamics to Finite Element Analysis, Medicine, Physics, and many more.

But computational steering also covers other important aspects within the field of parallel computing. Performance optimization is another example. Through the detection of load-imbalances during run-time the user can for example trigger redistributions of the workload given that the application provides the corresponding features. Also, detecting errors in parallel computations is another aspect. Errors ranging from inappropriate communications as well as algorithmic errors done maybe during program development can be detected and be handled or removed. The exploitation of nondeterminism and the manipulation of communication, i.e. modification of access orders offers new advantages for development of parallel programs. To reflect this fact we propose a novel combination of on-line visualization, computational steering of parallel High Performance Computing applications and controlled deterministic execution. The visualization and the classical steering is based on the VIPER tool [9]. For the control of the communication we rely on a tool called codex [3], which was developed to test and control communication by the use of control patterns.

After giving a brief overview on other related research the area, we will introduce both VIPER and codex to afterwards propose new methods for steering highly interactive parallel programs.

2. Related work

VIPER GUI (JAVA)

There has been quite a lot of research during the last couple of years in the field of on-line visualization and computational steering. Some work concentrates more on the application aspect of the problem whereas others have their main interest in performance optimization. The tools and environments which do application-oriented visualization and steering are usually very problem specific. Therfore, efforts towards more application-independent tools are necessary. Here, we want to refer to the CUMULVS environment which is based on PVM [6]. For a more detailed overview on computational steering works, please see the annotated bibliography by Jeffrey Vetter [11] which gives a nice overview on past and current research in the field. But so far, we are not aware of any work that applies techniques of controlled or deterministic execution in the area of program steering. It is only used for debugging and testing parallel programs. The basics of this idea stem from the experience that the approach of recording and replaying executions [7] is insufficient for testing and debugging nondeterministic parallel programs. There exist two major contributions, Carver/Tai’s deterministic execution and Damodaran-Kamal’s controlled execution. Carver/Tai [1] are proposing the use of so-called SYN-sequences which are in fact a trace of the interprocess communication. The sequences may be manipulated and the execution can be repeated until the point of manipulation. Sequences are either manually or automatically modified. But this method is not suitable for interactive manipulation of the communication, but for automatic generation of test runs [10]. Controlled execution, the contribution by Damodaran-Kamal [2], combines on-line race detection with interactive selection of receive messages. The selection must be specified manually and it is only valid for one communication operation. Furthermore, the approach can only be used for one selected process. Finally, we have to state that there exists no proper solution for the integration of these techniques in a steering environment.

3. On-line visualization and interactive steering with VIPER VIPER stands for on-line VIsualization of massively Parallel simulation algorithms for Extended Research and is described in more detail in [9]. It has initially been developed for on-line visualization and interactive program steering of parallel CFD (Computational Fluid Dynamics) programs. A basic feature, besides visualization and steering, is the control of the amount of data transfer during program run and analysis and evaluation of scientific data. Figure 1 shows an overview of the VIPER system. Socalled computational objects are built from distributed and

VIPER Object Base

computational objects

VIPER Server

... pap 1

pap n

pap: parallel application process

Figure 1. Viper system. replicated data structures of the parallel simulation, e.g (CFD), algorithm. The data structures are based on parameters of the mathematical model and parameters of the numerical methods. These objects are defined and computed in the parallel application processes. They are passed in a pipeline architecture within the VIPER system. This pipeline architecture consists of a computational unit, i.e. the parallel application processes, a connecting unit, i.e. the VIPER server, the VIPER object base, and a simplified visualization unit. Central component is the VIPER server which extracts the data from the parallel application processes (pap) and passes it over the LAN or WAN to the object base. Both the graphical user interface with the object base and the parallel simulation program request different services from the server. Also, they run independently from each other. The parallel application processes only pass the data to the server, not knowing what happens with the data afterwards. The server can pass the data to the object base or write it to trace files. At the same time the graphical user interface only request data from the server. The data can come from either trace files or the the object base. The user is interacting with the different modules of the VIPER system. Interaction with the parallel application processes happens through program steering via the graphical user interface. To steer the simulation process, a modification operator is applied to parameters of the mathematical

model or parameters of the numerical methods. Interaction with the server is performed by controlling the amount of data which has to be extracted, transfered, and handed to the object base. Here,the relevant parameters are switched on or off, meaning that the user can specify by selection in the graphical user interface which objects he wants to observe. The objects are then transfered accordingly. Also the selection may be changed again by the user during the running simulation. A powerful key to interactive visualization is the object base in VIPER. It holds th data from the distributed and replicated data structures of the parallel simulation algorithm after receiving it from the server and offers an interface for any existing visualization system to connect to it. Also, visualization and filter operators can be applied on each object of the object base before being transfered to a visualization system. The concept of the VIPER system is designed so that all components keep an independent character. Synchronization of the parallel application with the visualization system might be intended for observation purposes but can also be kept loose. Different levels of synchronization can be specified in the graphical user interface. The parameters of the application program (e. g. scalar, matrix, physics, grid) are declared and built via source code instrumentation to become the VIPER computational objects. Each object is associated with so-called interaction points (IPs). Just like the objects, they have to be defined and inserted via source code instrumentation. IPs specify locations in the program where certain objects - the ones that are associated with the IP - have values that are relevant for the on-line visualization. These interaction points are executed during the program run. Executing an interaction point in the parallel application process means to send a signal (request) to the VIPER server. Depending on which objects are requested by the user, i.e. switched on, the objects are either extracted by the VIPER server (if running on an Intel Paragon) or sent to the VIPER server by the application (if running in a workstation cluster). The first version of VIPER on the Intel Paragon uses a special Mach feature namely the possibility to read out or write to a different process’ address space. Here the objects can be extracted by the server without any further action of the application processes. In a network of workstations without this feature, the server sends back which data objects are requested by the user and the application processes will send the data accordingly. Objects are built from distributed and replicated data structures of the parallel application. In the VIPER object base the distributed objects are collected and sorted in lists. For every single object it is possible to store a certain amount of incarnations. This way the user can switch back and forth within a so-called time-line and also view

several incarnations of one object at the same time. Also, if a specific incarnation needs to be kept for later use, it can be stored in external files. In order to achieve good performance results for the whole VIPER system, parallelism is used profitably in all processes of the VIPER server with multi-threading techniques of Mach and Unix (multi-threaded server). Likewise, a special technique provides the parallel execution of visualizations of scientific data on the display as well as the import of data into the object base.

4. Controlled execution of parallel programs On a lower, more implementation-specific level we can no more hide the fact, that the sequential parts of parallel programs communicate in order to solve a common problem. This level will always be entered if programs have to be tuned or faults have to be fixed. At this point codex (controlled deterministic execution) can be applied to move the focus to the communication of a parallel application. In order to provide an intermediate level between an algorithmic view and the implementation, we present a model that reflects the communication on an abstract base. After that we introduce a technique to control the communication behavior using communication patterns.

4.1. Modeling parallel executions To model parallel executions we have to take into account that they are fundamentally nondeterministic [4]. Communication that offers some degree of freedom is responsible for the inherent nondeterminism. The freedom may either result from intended nondeterminism to exploit parallelism more extensively or from erroneous communication. For modeling parallel executions, communication between threads of execution is therefore an essential topic. But to avoid adaptations to different communication models (like message passing or shared memory), it is necessary to abstract from individual implementations. Therefore, our approach is based on an abstract communication model called Common Objects which can be mapped to most implementations. So, Common Objects serve as media for inter-thread communication. We require that the functionality of the model consists of at least one method owning properties to modify objects and another one to read the state or contents of an object. For our model, we call these basic operations simply modify and read. All other methods can be based upon combinations of these. The modify operation is characterized asynchronous and non-blocking while the read operation has a blocking character. A successful invocation of modify does not necessarily mean that the operation has been executed successfully.

The read operation always blocks until it either returns an error code or the requested data. There are two reasons for the choice of these properties. First, read operations are synchronous by nature. Even an asynchronous read operation first has to read some status information in order to know that there is no data to be read. At least some error or status information of the related object needs to be regarded in advance. Second, modifying operations need not be synchronous or blocking as we know from several communication models. A synchronous modify would have been a too strict limitation of the model. But in combination with a read operation, the entire method gets a stricter, synchronous characterization. All other characterizations of operations on Common Objects may be based upon a mixture of reads and modifies. This is, for example, important for several message passing environments. A blocking receive call is a combination of a read operation to get the corresponding data and finally a modify operation to destroy the information. The modify is necessary because a receive call is a consuming operation. While these two operations are atomic in the case of a blocking operation, they are split in the non-blocking case. To stay with the receive call, the read simply checks the contents of the Common Object and skips the remaining modification in this case. For our approach we have to focus on operations that are applied to identical Common Objects. Essentially, we are interested in order relations between reads and modifies at Common Objects. There are two fundamental competitions. First, a set of modifies competes for a manipulation sequence. Second, a set of reads competes for different versions of the corresponding Common Object, i.e. in fact there exists a competition with the modify operations. Of course, the access structure of a Common Object is regulated additionally by synchronization mechanisms, but for generality we neglect them for the moment. The competitions are sketched in picture 2. In b) an initial value of the Common Object is supposed. eo i

mod

ify ify

mod

common object

read

eom

eo i

modify

common object

eo k

read

eok

rea

d

eom

a)

b)

Figure 2. Access competitions. Definition 1 (Common Object) A common object is a triple (M; V; co) consisting of: – a finite set M of methods, with read, modify 2 M , – a finite set V of values, – a unique label co of the object. The set of methods comprises all possible methods to access an object, i.e. it describes the interface of the Common

Object which contains at least read and modify. The label is simply an unique identifier of the object. If we regard a message passing environment, a Common Object would be a logical channel between threads of execution. For a complete model of parallel executions we introduce Execution Objects which represent threads of execution. An Execution Object is timeless and has only static properties. Although its existence is static, it may be invoked dynamically at runtime for a certain execution, i.e. its existence is independent of executions. An Execution Object interacts with other Execution Objects via Common Objects. Common Objects are invoked by using the access methods of M . In fact, an Execution Object is a set of accesses to Common Objects and a corresponding order relation on this set. All possible accesses of an Execution Object are either represented by the set of modify accesses AW or the set of read accesses AR. These sets do not contain cyclic accesses in a loop or recursive invocations of methods (see Definition 2). Definition 2 (Execution Object) An execution object is a triple (A; !; eo) consisting of: – a finite set A = AW [ AR of accesses, – an order relation ! A A on A, – a unique identifier eo of the object.

The relation ! is responsible for the order of accesses inside an Execution Object. At least in the case of exactly one execution this results in a totally ordered set of accesses. Otherwise, it is partially ordered. Furthermore, we need a mapping from Execution Objects to Common Objects. Each single access of an Execution Object has to be assigned a unique Common Object identifier. This mapping is provided by the function called that is defined in Definition 3. Our model of a parallel program, called Parallel Object Execution Model (POEM), that is based on the definitions above, is designed to allow an intuitive and easy formulation of a communication behavior. The formal definition of a POEM follows in Definition 3. An example is shown in figure 3. Definition 3 (Parallel Object Execution Model) A POEM is a triple (EO; CO; ) consisting of: – a finite set EO of Execution Objects, – a finite set CO of Common Objects, – a mapping : (A;!;eo)2EO A ! CO from each access to the corresponding common object.

S

In a real programming environment, the set of Execution Objects represents all kinds of threads of execution, as e.g. threads, tasks, processes, . . . , that are potentially created during a program run. Figure 3 shows all elements of a POEM, but Contexts which are introduced below. The example represents a

co p

eo0 2

1 3

1

4

CO_two

CO_one

CO_three

2 3

modify read

2 CO_four

eo1

1

3

eo2

Figure 3. Parallel Object Execution Model (POEM) for a global sum example. simple global sum example where the Execution Objects communicate via message passing and we suppose that the Common Objects have a FIFO behavior (as it is usually guaranteed by the communication system). CO one and CO two serve as initialization channels for Execution Objects eo0 and eo1. CO two is the input channel for the global sum. There is no synchronization, which means that eo1 and eo2 can put their results in CO two in any order. By default, there is no predefined sequence of accesses. The actual sequence during an execution is either determined by environmental influences like processing load, network traffic or cache effects, or it is controlled by the user with the help of a steering tool. Contexts are groups of homogeneous objects that behave like a single object of the same type. Here the Common Objects for the initialization could have been grouped to control the process of initialization. Since dynamic behavior is characteristic for modern parallel programming environments we allow this property for both, Common Objects and Execution Objects. From the view of POEM, an object is being invoked into an execution some time, but not necessarily at the beginning. It is only important to know, which objects may potentially be invoked during an execution. POEM objects exist independently of executions and a subset of objects form an actual execution. Finally, the model is restricted to our purposes, representing the basic features of parallel programming. It ignores time because it must not imply an order where there is none. A POEM implies relations between Execution Objects and Common Objects that are static. Whether these relations exist during a specific execution depends on the input and the steering activities. A POEM may contain all or some potential executions of a parallel program. The sources to build a POEM differ depending on what kind

of sources are available and their corresponding quality. In general, specifications are a suitable source to derive a POEM. But very often specifications are not available. Therefore it is possible to extract the necessary information from static analysis, the analysis of trace files or online data as in the case of program steering.

4.2. Control Patterns Now, we start to develop a method to control and manipulate accesses to Common Objects on top of a POEM. Control patterns shall provide the possibility to change the communication behavior at runtime. For that purpose, we have to analyse typical access sequences to provide an easy and useful method. It should offer an intuitive way to formulate patterns of accesses. The intention is not to change the communication patterns which are fixed as a part of the algorithm. We want to focus on those parts of the communication where some degrees of freedom with respect to the communication exist. Characterization of accesses In order to describe accesses, we have to analyse what kind of behaviors should be expressible. The most primitive, but nevertheless important, behavior is the exact order of two or more accesses. It must be possible to specify an exact sequence of accesses. Often it is just necessary to describe: let Execution Object EO1 read contents of variable var1 before EO2 modifies it. Furthermore, it could be useful to express repetitions of either single accesses or whole sequences. If we call the above description of an exact order expr1, the following statement must be expressible: repeat expr1 exactly five times. But sometimes, the number of repetitions may either be unknown or it may be unimportant for con-

trolling the communication. In the latter case it would be sufficient to state that this access has to happen at least once, just like the closure in a regular expression. That means, an Execution Object may perform its accesses as long as there is no rival Execution Object accessing the same Common Object. If the number of the repetitions is unknown but important for the result, we have to find a way to describe a limit of repetitions that is independent of the actual number. Two events can be used to signal the end of an access sequence of a single Execution Object. The first one is the access to another Common Object, the second one is a blocking situation of the corresponding Execution Object. For instance, if we want all accesses that are executed inside a loop to happen at a Common Object before anything else, we may express it with the following term: let Execution Object EO1 access Common Object CO1 as long as EO1 does not access another Common Object. But this could be fatal if an access to CO1 blocks before the loop was finished. In this case no other Execution Object is allowed to access CO1. If the access is blocked due to a full buffer, all affected Execution Objects are blocked and the application may finally run into a deadlock. This is, of course, not our intention. Therefore we have to extend the above formulation with the following subordinate clause: . . . or it is not blocked. The blocking itself is also a decisive event. We can even extend the source of this event to other Common Objects. Since the fact of blocking could also be relevant if it is not happening at the corresponding Common Object, it is possible to express the term Let Execution Object EO1 access Common Object CO1 as long as EO1 is not blocked at CO1 or another Common Object. That way, we have two criteria to terminate repetitive accesses which are independent of the real number of repetitions. Very often, it is necessary not to order accesses of individual Execution Objects but accesses of specific types of Execution Objects. That means, the requirement to have an arbitrary Execution Object out of a specified set of Execution Objects could be valuable. Imagine, we have a producer/consumer situation that communicates via a FIFO queue. There is a set of producers and a set of consumers. If we want the queue to consist at most of a single element, we have to define an access sequence that enforces alternate accesses of producers and consumers. In most applications there exist different phases like the initialization, data distribution and several steps of computation. Consequently, we need patterns that are related to a specific phase. To that effect, it is necessary to define switches that trigger transitions between different patterns. A phase transition in the application can be reflected by a specific communication event or a given line of code.

Specification of control patterns Control Patterns are either tied to an individual Common Object or to a set of Common Objects, i.e. to a Context. They act as mediator between Common Objects and Execution Objects with respect to given rules. This fact is reflected by figure 4, that provides an impression how Control Patterns are used. P A T

eo i eok

common object T

E RN

eo l

Figure 4. Control pattern for one Common Object . To make control patterns a useful tool, they must be easy to use and allow an efficient specification of accesses. There exist three ways how to specify accesses. The most exact expression reflects both, the accessing Execution Object eo 2 EO and its method m 2 M , denoted eo:m. If only the Execution Object is of importance and not the access method, it is sufficient to write eo. And vice versa, if it is only the access method that is relevant, simply write m. To formulate patterns we use operations that are similar to regular expressions. The corresponding alphabet consists of Execution Objects, accesses and the cross product of Execution Objects and accesses. For identification, a name is assigned to each pattern. A pattern is formulated in the following way: p(co):body p is simply the unique identifier of this pattern, which is applied to the Common Object whose identifier is given in parenthesis, i.e. co is the identifier of (M; V; co) 2 CO. After the colon the most important part follows, the body of the pattern. The body contains the expression that is built from the alphabet A = feo; eo:m; mj(A; !; eo) 2 EO, (M; V; co) 2 CO such that m 2 M g, as mentioned before. The access method m must, of course, be an appropriate operation on the corresponding Common Object. Furthermore it is possible to build a hierarchical pattern where pattern identifier are the elements of the alphabet. This way, a complex pattern can be devided into small pieces. As conjunctions between individual accesses (resp. patterns) we propose the operations that were motivated in section 4.2. There is a detailed description in table 1. The reasons for the distinction between basic and extended operations are the facts, that extended operations are only allowed for accesses where the Execution Object is identifiable (e.g.

eo1.read) and that they are not defined for patterns (e.g. p! is not defined). The operator guarantees the exact order of accesses of two expressions and ax specifies an exact number of repetitions of accesses of a while a+ provides a soft specification for a loop. More complex descriptions are given by extended operations. But they may only be used on access description which contain an Execution Object and not on pattern identifiers and methods. In addition, there is an extension to the original definition of control patterns [3]. The combination of the soft specification and a hard limit, which was introduced especially for steering purposes. An example is given in section 5 The pattern P (BUF ) : EO15 EO2 :read describes the following behavior: at the Common Object BUF , first the Execution Object EO1 makes five arbitrary accesses, then the Execution Object EO2 processes a read operation The restriction to a specific access is shown as well as the description of arbitrary accesses of one Execution Object. Besides the hierarchical patterns we can also formulate repetitions of whole patterns. Therefore we have to put the body in parenthesis and provide it with a basic operation to get a repetetive pattern. This results in a description like the following one: pat(CO one) : (eo1:read eo2:modify)5. According to the requirements, phase transitions inside applications should be considered. Control Patterns offer this by event based transitions of individual patterns. Transitions allow the activation respectively the deactivation of patterns at runtime according to specified events. This mechanism provides a dynamic adaption to varying runtime conditions. Currently, we support only communication events, but in general, events like hitting a defined line of source code can be regarded too. If an event triggers a transition on the local Common Object, the current pattern is simply replaced by the new one. But, if the event affects a remote Common Object, things are not straightforward. The transition reaches the remote Common Object not necessarily in a state which can be guaranteed to be always the same. In other words, we could introduce some nondeterminism by the use of transitions. Since this is not our purpose, transitions on remote Common Objects have to be well defined. To that effect the causal relationship of Common Objects is introduced. A causal relationship between Common Objects uses the path along accesses of Execution Objects. The transition finally happens as soon as the information about the event arrives at the affected Common Object via an access. This topic is beyond the scope of this article and therefore not discussed any further. Of course, it is also possible to use the Contexts that were introduced with POEM to formulate control patterns. A Context simply groups homogeneous objects. Its external behavior is identical to a single object of the same type, i.e.

the identifier of the Context can be employed in a pattern like a single object. During the execution each member may take part in a communication with respect to the pattern the Context is involved in. This, however, allows to formulate nondeterminism for the purpose of comfortability. But this may be useful to allow alternatives where a strict order is not necessary. In summary, we have shown an approach to control communication where it is possible. Common Objects are based on simple expressions similar to regular expressions in order to specify exact orders and different kinds of repetitions. They are adaptable to new conditions by the use of transitions and to ease the use Contexts can be defined to group Execution Objects.

5. Methods for steering highly interactive parallel programs In the previous sections both VIPER and codex have been introduced and described in order to show their individual benefits on the problems they have been designed for. Now we want to show first how these tools can be merged and afterwards discuss how they can be used with respect to specific parallel problems.

5.1. Integration of VIPER and codex VIPER and codex not only differ in the sort of information they provide, but also in the way they collect it. While VIPER is mainly an on-line tool, codex is based on static, post-mortem and on-line analysis. The more data it collects, the better is the quality of information it can provide. Though codex can also work without static and post mortem analysis, it then lacks information about potential communication in the future of the ongoing computation. In contrary, VIPER gets only additional information about already performed executions. The main data, i.e. the data of the current program run, is provided by an on-line object base. The solution for an integration is the common use of VIPER’s object base (see figure 5). It now contains static information as well as on-line information for both tools. For the acquisition of the data we also decided to rely on the infrastructure of VIPER and adapt the requirements of codex. In detail, this means that all actual Common Objects and Execution Objects are reported to the object base at the interaction points (IPs) and that information about control patterns is delivered to the control instance of the communication library. codex’s static data, the specification of Common Objects and Execution Objects which results from trace analysis and static analysis are also stored in the object base. For the use of our tool combination it is necessary to have an instrumented communication library (e.g. for PVM [5])

Table 1. Operations of control patterns

Basic operations:

Extended Operations

a a – indicating that a has to be fulfilled before a , ax – a must be fulfilled exactly x times a+ – a has to be fulfilled at least once. ax+ – a has to be fulfilled at least once but with an upper limit of x 0

0

Specification CO: common object EO: execution object COs EOs

Steering Frontend (JAVA)

Object Base

control pattern

computational objects

Definition: data objects do_1, do_2,... interaction points IP1, IP2

pap n

pap: parallel application process

... ..

pap 1

a# – a has the highest priority at Common Object as long as it is not blocked.

managing the communication according to control patterns, and to instrument the source code to identify IPs and computational objects. In the near future we will be able to use the PVM implementation of OMIS [8], which is currently under development. OMIS provides a complex standard online monitoring interface for the observation and manipulation of parallel executions. So far, we have shown the integration of lower levels, but what is still missing is a common user interface. Both existing interfaces are built on JAVA and therefore it is no problem to integrate them from the technical point of view. It still has to be evaluated, how we can integrate the different levels, the application view of VIPER and the more detailed but nevertheless abstract view of codex. Furthermore, by the use of JAVA, we have a platform independent interface which can run on almost arbitrary machines to manipulate the execution of a big application on a remote cluster of workstations. Finally we got a tool which extracts runtime information and data of the computation and manipulates computational objects as well as communication sequences.

5.2. Manipulation of computational objects

VIPER Server

...

a! – a has the highest priority at Common Object as long as it does not access another Common Object or blocked respectively.

IP1(do_1)

IP2(do_1, do_2)

Figure 5. Integration VIPER-codex.

On-line visualization of application data offers insight and control of program parameters of parallel applications. Parallel CFD simulations for example are known to be very long-running and resource-intensive applications. With offline visualization the user would probably wait for hours, days, or even longer to find out that the problem has not been well enough specified in order to come to a converged solution. Here, it is of great benefit to the user if he can observe the simulation on-line during run-time and maybe not only see the before mentioned problem but also located where the problems occur at an early state. Steering in this context additionally offers a way to manipulate parameters of the program and therefore either improves the behavior of a for example numerical simulation or improves the performance of the program. Regarding

the above described problem, the user might change some parameters of the program and therefor enable a convergence of the problem or trigger a restart of the program after changing certain parameters. Both alternatives would significantly reduce the amount of time spent in the application cycle of CFD programs. In the other case the user can maybe trigger a different load distribution with steering operators. The use of computational objects has the advantage of providing a well-defined interface to the parallel application. This way there are predefined states and parameters which can be changed via the object model.

all following Execution Objects at the right from eo1 (see figure 3). That way, we can formulate an adaptive pattern: p(CO) : EO1:modify! EO2:modify! modifyn. As an effect the convergence might be accelerated or at least the result has a higher precision.

5.3. Manipulation of communication objects The manipulation of algorithmic objects covers all aspects of the computation. But in case of parallel or distributed programs the aspect of communication has a major impact on executions. The impact may be constant if there is only synchronous exchange of information. This leaves no space for the manipulation of communication during runtime. But, if the implementation offers some degrees of freedom by the use of nondeterministic communication operations (e.g. select calls, use of wild-cards), decisive impact on the execution may be the result. Let us motivate this fact with a few examples. Suppose, we compute a global sum in order to determine an important parameter for the remainder of the computation, e.g. a convergence criterion or a threshold. For simplicity, the sum is built by gathering the information on a dedicated node as we have already shown in figure 3. Now suppose, that there are more than two Execution Objects to send their result back to eo0. Since addition is associative and commutative there is obviously no need to sequentialize this process. But in some case on some machine effacement of values might happen due to lack of precision. If we realize at a synchronization point that this might happen because some dominating values exist, we are not able to influence the competition for arrival between different values. Now, this is the point where control patterns can be applied. If there were e.g. two values to cause the effacement and we want them to arrive at the very beginning we simply say: first eo1 modifies CO two, then eo5 and afterwards n read operations of arbitrary Execution Objects. The corresponding pattern looks like: p(CO) : EO1:modify EO2:modify modifyn. In case this takes place regularly and we always want the same order we simply transform the expression into a repetitive one. For the last example we supposed that we know that there is exactly one access by each Execution Object. But if we do not know the number of repetitive accesses, we can make use of the knowledge that after the access to CO two of eo1 a read operation to CO four follows. The same is true for

Fast node

Figure 6. Tree of data distribution

Another area where control patterns are useful is the distribution or redistribution of data. We take a tree structure for the data distribution as it is shown in figure 6 where each node demands data from its father node. Imagine some computing nodes in a network with very high computing power compared to the rest. Each of the server nodes, i.e. each father node, got a budget of data to be distributed. The budget is sufficient for multiple iterations. Now it is possible that the fast nodes consume more parts of the budget and if the budget is exploited a redistribution of data of parts of the tree or even the whole tree is necessary. Since a redistribution is very expensive, it would be beneficial to block the fast nodes from time to time in favor for slower nodes to avoid early redistributions. Let’s say we want to block a fast node before it demands new data the third time after all the others got new information. If EO4, EO8 and E 15 are fast nodes which signal their demands via modify accesses to COx , COy and COz respectively, the resulting total pattern is shown below: p1(COx ) = (modify 2 EO42+ )+ p4(COy ) = (modify 2 EO82+ )+ p3(COz ) = (modify 2 EO252+ )+ p(total) = p1, p2, p3 Finally we can combine both problems. The visualization of the data shows us that a specific part of the data, in general a matrix, contains extreme values. To bring this data to an adequate place of computation we may steer the redistribution of data in an appropriate way in order to loose as less precision as possible. In this section we have shown that we need special techniques to deal with interactive applications. First, we need

the possibility to integrate spread data in a consistent way to provide a total view. Second, the exchange of data must be modifiable to influence data movements and access sequences. Finally, the effect is a higher quality of the result or/and an accelerated computation.

6. Conclusion This work presents a novel combination of tools for parallel processing. In addition to steering computational objects we use a method of controlled execution to manipulate the communication structure during runtime. It is not only possible to modify objects of the computation but in addition we provide a means to modify access sequences to objects of communication. This steering approach allows the developer to optimize his application as well as to find errors. The integrated visualization has proved to be very supportive during the final phases of the development cycle. Currently both, VIPER and codex are implemented in JAVA to allow an easy integration. While VIPER already exists in a different implementation, codex is a new tool as a part of the T OOL -S ET project [12]. The target environment for the common implementation are clusters of workstations and the selected communication library is PVM [5]. By the consequent use of an object-based approach we are able to offer a homogeneous view of a parallel computation independently of the tool. The future work will be focused on the integration of other approaches which are related to debugging with the intention to simplify the process of uncovering errors in huge parallel program systems.

References [1] R. H. Carver and K. Tai. Test sequence generation from formal specifications of distributed programs. In Int. Conference of Distributed Computing Systems, pages 360–367. IEEE, 1995. [2] S. K. Damodaran-Kamal and J. Francioni. Testing races in parallel programs with an otot strategy. In T. Ostrand, editor, Proceedings of the 1994 International Symposium On Software Testing and Analysis, SIGSOFT, special issue, Seattle, Aug. 1994. ACM. [3] M. Frey and M. Oberhuber. Testing and Debugging Parallel and Distributed Programs with Temporal Logic Specifications. In Proc. of Second Workshop on Parallel and Distributed Software Engeneering 1997, pages 62–72, Boston, May 1997. IEEE Coputer Society. [4] E. Fromentin, N. Plouzeau, and M. Raynal. An Introduction to the Analysis and Debugging of Distributed Computations. In First Int. Conf. on Algorithms and Architectures for Parallel Processing, pages 545–553, Brisbane, Mar. 1995. IEEE.

[5] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine – A Users’ Guide and Tutorial for Networked Parallel Computing. MIT Press, 1994. [6] A. Geist, J. Kohl, and P. Papadopoulos. CUMULVS: Providing Fault-Tolerance, Visualization, and Steering of Parallel Applications. SIAM, 1996. [7] T. J. LeBlanc and J. M. Mellor-Crummey. Debugging Parallel Programs with Instant Replay. IEEE Transactions on Computers, C-36(4):471–481, April 1987. [8] T. Ludwig, R. Wismüller, M. Oberhuber, and A. Bode. An Open Interface for the On-Line Monitoring of Parallel and Distributed Programs. Intl. Journal of Supercomputer Applications and High Performance Computing, 11(2), 1997. [9] S. Rathmayer and M. Lenke. A Tool for On-line Visualization and Interactive Steering of Parallel HPC Applications. In 11th International Parallel Processing Symposium. IEEE Computer Society, 1997. [10] K. C. Tai. Reachability Testing of Asynchronous MessaagePassing Programs. In Pro.c of 2nd Internation Workshop on Software Engeneering for Parallel and Distributed Systems, pages 50–61, Boston, May 1997. IEEE Computer Society. [11] J. Vetter. Computational steering annotated bibliography. Technical report, College of Computing, Georgia Institute of Technology, 1997. [12] R. Wismüller, T. Ludwig, A. Bode, R. Borgeest, S. Lamberts, M. Oberhuber, C. Röder, and G. Stellner. T HE T OOL - SET Project: Towards an Integrated Tool Environment for Parallel Programming. In X. De, K.-E. Grospietsch, and C. Steigner, editors, Proceedings of Second Workshop on Advanced Parallel Processing Technologies, APPT’97, pages 9–16, Koblenz, Germany, Sept. 1997. Verlag Fölbach.

Tuning Parallel Programs with Computational ... - Semantic Scholar

Tuning Parallel Programs with Computational ... - Semantic Scholar

Suggest Documents

Computational Experience with the Parallel ... - Semantic Scholar

Optimizing Parallel Programs with Explicit ... - Semantic Scholar

Paraphrasing: Generating Parallel Programs using ... - Semantic Scholar

Optimizing Parallel SPMD Programs - Semantic Scholar

ParaForming: Forming Parallel Haskell Programs ... - Semantic Scholar

Computational Infrastructure for Parallel ... - Semantic Scholar

Paraphrasing: Generating Parallel Programs using ... - Semantic Scholar

Modeling Data-Parallel Programs with the ... - Semantic Scholar

Automatic Generation of Parallel Programs with ... - Semantic Scholar

A Grainless Semantics for Parallel Programs with ... - Semantic Scholar

A Parallel Computational Model for ... - Semantic Scholar

EXECUTING PARALLEL PROGRAMS WITH ... - CiteSeerX

An Auto-Tuning Framework for Parallel Multicore ... - Semantic Scholar

A User-Friendly Approach for Tuning Parallel File ... - Semantic Scholar

Program Analysis and Tuning Tools for a Parallel ... - Semantic Scholar

Tuning the Performance of I/O Intensive Parallel ... - Semantic Scholar

Tuning the Performance of I/O Intensive Parallel ... - Semantic Scholar

Tuning and Fine-Tuning of Synapses with Adenosine - Semantic Scholar

Tuning Synthetic Pheromones With Evolutionary ... - Semantic Scholar

Evolutionary Algorithm Parameter Tuning with ... - Semantic Scholar

TUNING PARALLEL EXECUTION

Using Abstraction in Explicitly Parallel Programs - Semantic Scholar

Formal Analysis of MPI-Based Parallel Programs - Semantic Scholar

Performance Evaluation of Parallel Programs in ... - Semantic Scholar