high-performance pervasive computing

2 downloads 0 Views 6MB Size Report
Table 2 shows the average values measured for the communication latency of results respectively exchanged from a worker to the collector and from col- letor to ...
In: Pervasive Computing Editor: Mary L. Howard

ISBN: 978-1-61122-057-5 c 2010 Nova Science Publishers, Inc.

Chapter 5

H IGH -P ERFORMANCE P ERVASIVE C OMPUTING∗ Carlo Bertolli†, Gabriele Mencagli‡and Marco Vanneschi§ Department of Computer Science, University of Pisa, Largo B. Pontecorvo 3, 56125 Pisa, Italy

Abstract In this work we define a general framework and a programming model for the design, development and evaluation of High-Performance Pervasive Computing applications. High-Performance Pervasive Computing (HPPC) is an emerging paradigm aiming at the effective exploitation of heterogeneous complex pervasive platforms for the adaptive deployment and execution of parallel and distributed, high-performance computations. Examples are real-time, mission-critical, emergency-management, and homeland-security applications. The HPC issue has a profound impact on the exploitation of emerging distributed platforms, because, with respect to other more traditional application fields, the control strategies, reconfigurations and deployment actions themselves have to respect heavy performance constraints in order to guarantee the expected QoS. In ∗

The authors have been partially supported by the FIRB Project In.Sy.Eme. RBIP 063BPH. E-mail address: [email protected] ‡ E-mail address: [email protected] § E-mail address: [email protected]

2

C. Bertolli, G. Mencagli and M. Vanneschi the target applicative scenarios, the number of events, their meaningful combinations, the possible choices, and the expected performances introduce high complexity of design and optimization. For this reason, design methodologies and cost models are needed that are quite new with respect to the traditional systems and applications. We consider the presence of distributed platforms characterized by general-purpose processing nodes (e.g. servers, clouds), but also specialized/embedded processing nodes (e.g. wearable devices, sensors, smart-phones), and different heterogeneous wired and/or wireless networks. All these resources, even those apparently (till yesterday) limited in processing power and capacity, are used to actively take part to the distributed high-performance computation. A proper parallel programming model has to provide mechanisms for the explicit definition of the application control logic, that decides and manages adaptive reconfigurations in response to various events. An associated cost model has to be defined to formalize and carry out the most suitable control strategies for achieving the desired QoS level (e.g. response time, energy saving, precision of computed results, and other execution metrics and their proper combinations). In our approach, control logic provides “to switch” the application among several distinct and different versions of the same application, able to exploit the available resources, and /or the resources onto which the application could be forced to run, at best. To each version “switching” a dynamic reconfiguration of the application, or its parts, is associated.

1. Introduction The last decade has been characterized by the arising or strong evolution of several innovative applications thanks to the development of wireless communication technologies, the widespreading of multicore processors (also for mobile devices), and, consequently, to their integration in a single pervasive platform. These kinds of platforms, which offer a computing and communication service that can be called pervasive, include centralized high-performance computing nodes (e.g. clusters or large shared memory machines), as well as interface nodes providing communication facility between distributed nodes, and mobile/pervasive devices such as Personal Digital Assistants (PDAs), laptop and, in some cases, sensors. Examples of applications for these platforms are Emergency Management, Intelligent Transportation, Environmental Sustainability and National Defense.

High-Performance Pervasive Computing

3

A common point in a such heterogeneous set of applications resides in the performance constraints, i.e. Quality of Service (QoS) parameters, for the provided services, which can also dynamically vary during the execution. Such QoS is typically related to the performance at which computing results or communication facilities are provided to users and/or to the service availability and reliability w.r.t. software and hardware failures, although several other metrics exist depending on the application semantics. This class of applications can be defined as High-Performance Pervasive Applications. A main issue in executing this class of applications on pervasive platforms is hence given by the chance of defining proper programming models and runtime supports aiming at enabling the definition and dynamic satisfaction of QoS constraints. This issue can be seen as even more complex by considering the strong dynamic variability of computing and communication services provided by pervasive platforms, also given by the mobility of some of its nodes and their geographical distribution. Figure 1 shows an abstract view of a High Performance Pervasive Computing (HPPC) platform focusing on the heterogeneity of computing resources and on interconnection network technologies. The HPPC paradigm implies the development, deployment, execution and management of high-performance applications that, in general, are also dynamic and adaptive in nature. Adaptivity concerns the number and the specific identification of cooperating highperformance components, the deployment and composition of the most suitable versions of such components, processing and networking resources and services, i.e., both the quantity and the quality of the application components to achieve the needed QoS. The specification and requirements of QoS itself are varying dynamically during the application, according to the user intentions and to the information produced by sensors and services, as well as according to the monitored state and performance of networks and nodes. HPPC applications include data- and compute-intensive processing (e.g. forecasting and decision support models) not only for off-line centralized activities, but also for on-line and decentralized activities. Consider the execution of software components performing a forecasting model or of a decision support system model, which are critical compute-intensive activities to be executed respecting operational real-time deadlines. In “normal” connectivity conditions we are able to execute these components on a centralized server, exploiting its

4

C. Bertolli, G. Mencagli and M. Vanneschi Central Server

INPUT

Wired and Wireless Communications

OUTPUT Decentralized Node

Mobile Node Network

Figure 1. Scheme of a generic pervasive application: a set of wired and wireless communication technologies (center) interconnect a set of heterogeneous computing platforms, including INPUT and OUTPUT components providing input tasks and consuming output results from one of the three computations: task farm of data parallel on the cluster architecture (central server); data parallel on a decentralized architecture (interface node); and task farm on network of mobile nodes.

High-Performance Pervasive Computing

5

processing power to achieve the highest performance as possible. Critical conditions in the application scenario (e.g. in emergency management) can lead to different user requirements (e.g. increasing the performance to complete the forecasting computation within a given, new deadline). Changes in network conditions (e.g. network failures or congestion situations) can lead to the necessity to execute a version of the application model directly on spatially local resources which are available to the users (e.g. personnel, rescuers, emergency managers and stakeholders): when central servers are not available or reachable, such resources are interface nodes and/or mobile devices themselves. In such cases, the forecasting model can be executed on different or additional computing resources, including a set of distributed mobile resources running different application software versions which are specifically defined and designed to exploit such kind of resources. In other words, in this scenario it is important to assure the service continuity, adapting the application to different user requirements but also to the so-called context, which corresponds to the actual conditions of the both the surrounding environment and the computing and communication platform. So the key-issue is the definition of high-performance parallel programming paradigms, models, and frameworks to design and develop these kinds of complex and dynamic applications, focusing on Adaptivity and Context Awareness as crucial issues to be solved and to be integrated with high-performance and real-time features. So, in HPPC applications, various, heterogeneous fixed and mobile computers (e.g. PDA, wearable devices, new generation handphones) and networks must be able to capillary provide users with the necessary services in various connectivity, processing and location-based conditions. According to the current trends in computer technology, interface nodes and mobile devices can be equipped with very powerful, parallel computing resources, such as multi/many-core components or GPUs, thus rendering the embedding of compute intensive functions quite feasible at low power dissipation. These devices can be part of self-configuring ad-hoc/mesh networks, in such a way that they can cooperatively form a distributed embedded system executing specific application components, as well as being able to cooperate with centralized servers (e.g. a workstation cluster) and wired networks. A unified approach for programming large pervasive grid infrastructures, especially for defining time-critical ubiquitous applications, does not yet exist.

6

C. Bertolli, G. Mencagli and M. Vanneschi

Some research works focus on HPC computations in real-time environments, but in these approaches the “pervasive part” of application definition is essentially missing. It means that there are no tools, programming constructs and methodologies to manage and define interactions with sensor devices and to manage context information by means of proper knowledge models. Other research works achieve the necessary expressiveness to define context-aware and adaptive applications, but they do not face on intensive real-time computations performed by HPC centralized resources nor by distributed systems of mobile devices. The general reference point for these kinds of platforms is the Grid paradigm [7, 23] which, by definition, aims to enabling the access, selection and aggregation of a variety of distributed and heterogeneous resources and services. However, though notable advancements have been achieved in recent years, current Grid technology is not yet able to supply the needed software tools to match the high-performance feature with high adaptivity, ubiquity, proactivity, selforganization, scalability, interoperability, as well as fault tolerance and security, of the emerging applications running on a very large number of fixed and mobile nodes connected by various kinds of networks. In conclusion, the definition of Next generation Pervasive Grid platforms [39] is still at its beginning: the integration of traditional applications and ubiquitous applications and devices is a field still requiring intensive theoretic and experimental research. The integration must provide a proper combination of high-performance programming models and pervasive computing frameworks, in such a way to express a QoS-driven adaptive behavior for critical highperformance applications. A high-level programming model is the only solution to one of the most crucial issues in high-performance applications design, i.e. the so-called performance portability: defining parallel programs having a reasonable expectation about their performance, and in general their behavior, when they are executed on different architectures (e.g. a multiprocessor, a workstation cluster, a distributed system of pervasive devices or multicore components). Performance portability is even more important in HHPC, which must be able to dynamically reconfigure the applications onto very different and heterogeneous computing and communication resources. Structured Parallel Programming [17] is a considerable high-level approach for developing highly-portable parallel

High-Performance Pervasive Computing

7

applications. In this approach parallel programs are expressed by using wellknown abstract parallelism schemes (e.g. task-farm, pipeline, data-parallel, divide&conquer), for which the implementation of communication and computation patterns are known. Performance portability can be exploited by using proper performance models for each specific scheme, which make it possible to measure and dynamically modify the application performance and its resource utilization (e.g. performance and memory utilization, battery consumption for mobile nodes). This feature renders it feasible the definition of efficient faulttolerance [11] and adaptivity [44] high-performance mechanisms, which, as we will show in Section 3., are not present in other pervasive computing projects. Structured parallel programming is a valuable starting point, however it is not sufficient for a HPPC programming model, that must be characterized also by reconfiguration mechanisms to achieve adaptivity. We distinguish two kinds of reconfigurations: functional and non-functional ones. Non-functional reconfigurations preserve the application semantics and involve non-functional parameters of a computation (e.g. its memory utilization, its performance, or power consumption). In parallel processing projects and in pervasive computing projects (notably [25]) an “invisible” approach to adaptivity is adopted, i.e. delegating the reconfiguration actions to the run time system, without introducing specific mechanisms visible to the programmer. However, an invisible approach is not sufficient for complex ubiquitous applications. Suppose to have an intensive computation which is processed on a centralized HPC server. Due to some events related to the state of the surrounding execution platform, we could require the migration of this computation onto a set of mobile intelligent devices. This migration is a complex operation, concerning not only simple technological issues (e.g. changing the data format and migrating a partial task, as in Aura), but also concerning the relevant differences of new available resources and their efficient exploitation. A parallel computation for a cluster architecture could not be efficiently executed on a set of mobile nodes, due to their possible limitations, such as memory and processing capacity, or the performance offered by their mobile interconnection networks. In this case, a reconfiguration approach can exploit a specific property of Structured Parallel Programming: we can change the composition of different parallelization schemes without modifying the computation semantics [45], for example the parallelism degree, the data partitioning scheme, the aggregation/disaggrega-

8

C. Bertolli, G. Mencagli and M. Vanneschi

tion of program modules according known cost models. In this way we are able to express multiple compatible behaviors of a certain application part, replacing it without modifying the other parts. Functional reconfigurations consist in providing a set of different versions of the same application or component, each one suitable for specific context situations (e.g. mapping onto specific available resources or when some network conditions occur). All these versions have a different but compatible semantics: they can exploit different sequential algorithms, different parallelization schemes or optimizations, but preserving the component’s interfaces in such a way that the selection of a different version does not modify the behavior of the global application. Again, the run-time system is not able to decide the proper version selection strategy in an invisible way. Instead, the programmer is directly involved in defining the mapping between different context situations and corresponding functional reconfigurations: for this purpose, specific programming constructs for reconfigurations are provided by the programming model. In conclusion, both for functional and non-functional reconfigurations, adaptivity is not completely application-transparent, since the programmer must be aware of the adaptation process, i.e. similarly to the application-aware adaptivity in Odyssey, but according to an approach which is not limited to the quality of visualized data, but includes the quality any application phase. Initially in this chapter we review research works aimed at providing a support for general pervasive applications, both those not specifically oriented towards supporting High-Performance Pervasive applications, and those more related to them. Then we describe our research efforts that are strongly characterized by an architectural approach inherited from the High-Performance parallel programming research field, and which is based on the study and development of a programming model representing the glue of our contributions in a methodological way. This programming model enables programmers to develop applications which adapt themselves to the dynamic platform conditions and users’ requests. Application performance and reliability can be analyzed by means of cost models that are defined by using the high-level properties of the programming model. The outline of this chapter follows: Section 2. describes related work introducing programming models and support for general pervasive application,

High-Performance Pervasive Computing

9

and High-Performance ones. Section 3. describes a typical application case for pervasive platforms, from which we derive the requirements for a programming model for High-Performance Pervasive applications. Section 4. gives the full details on the ASSISTANT programming model, also including some examples. Section 5. describes the analytical tools which are used to define the run-time performance of applications. Section 6. shows an implementation of the ASSISTANT model and Sections 7. and 8. extend the described implementation with mechanisms to implement protocols to dynamically switch between component versions and to modify its support, also guaranteeing the consistency of these activities w.r.t. the application semantics. Section 9. describes an Emergency Management application which we implemented in ASSISTANT. The results of experiments are described in Section 10.. Finally, Section 11. gives the conclusion of this chapter.

2. Related Work In this section we describe the state of the art concerning adaptivity for both high-performance computations and for mobile systems executed on pervasive distributed execution platforms. In particular our objective is to describe how adaptivity and dynamicity are expressed, focusing on the expressiveness of different approaches. In many cases adaptivity is expressed at the run-time support level only. During the execution, the run-time system can select different protocols, algorithms or alternative implementations of the same mechanisms, in response to specific events which describe the actual context situation. This support level is often called middleware [22]: a set of common services, operating on lower level resources, utilized by distributed cooperating application components. At the middleware level adaptivity can be expressed by a proper static or dynamic selection of different service implementations, or by setting specific parameters of configurable primitives. In this approach all the reconfigurations and adaptation processes are in general fully invisible to the applications. In other research works adaptivity is a key-issue which is directly expressed at the application level. Mechanisms and tools are provided that allow programmers to define how their applications can be reconfigured and what the sensed events are. Applications can be defined in such a way that multiple versions of the same component or module are defined, and a proper version selection

10

C. Bertolli, G. Mencagli and M. Vanneschi

strategy must be expressed by the programmer. So, adaptation strategies and policies are directly part of the application semantics, which can be characterized by a functional part and a control logic expressing the adaptive behavior of the application. This Section is structured as follows: first of all we partially describe the state of the art concerning adaptation methodologies for high-performance applications on traditional HPC platforms (e.g. cluster of workstations, supercomputers, mainframes and Grid architectures). Next we introduce some relevant research works concerning self-adaptation for mobile pervasive applications facing with dynamic user-intentions and highly-variable mobile platform conditions.

2.1. Methodologies for Adaptive High-Performance Applications A crucial issue in the development of parallel applications is to take advantage of the underlying execution platform in the best way as possible. The definition of applications as a composition of well-known parallel patterns [17] make it possible to study formal performance models based on queuing theory principles, and thus the development of optimized solutions for specific execution platforms. According to this approach, called Structure Parallel Programming (SPP), some notable languages [45] and tools [34] have developed, able to analyze an high-level description of a parallel application and to produce an optimized implementation for a specific platform. In the last years, the SPP approach has received an even growing attention due to the recent hardware technology evolution/revolution of multi-core components (multiprocessors on chip). The proper exploitation of such technologies is a currently leading problem, that imposes a resolute change from the sequential vision to the parallel vision in application development [4]. In some research works this issue has an important impact not only on the availability of next-generation general-purpose parallel servers, but mainly in a new way of exploiting specialized and embedded devices and network processors [24]. With the widespread diffusion of dynamic environments and Grids/Clouds, optimizations based on static knowledge about the execution environment are no more effective: applications must be able to change their structure at runtime and to exploit an autonomic behavior [33]. In this scenario the exploitation of SPP paradigms makes it possible to define autonomic applications in

High-Performance Pervasive Computing

11

which all the reconfiguration activities are completely transparent to the application programmer. It is the run-time support which is responsible for selecting a proper control strategy and for deciding the reconfiguration activities which are able to change the application configuration [2, 44]. However these kinds of autonomic processes are limited to parallelism degree modifications or to mapping decisions; a first attempt for defining more powerful reconfiguration classes is introduced in ASSISTANT (In.Sy.Eme Project, MIUR-FIRB) [8] in which the so-called “functional reconfigurations” have been introduced. In this case the programmer can define a set of equivalent behaviors (versions) of the same component, which usually differ in the implemented algorithm or in the exploited parallelism scheme. In non-structured parallel programming approaches autonomicity can still be introduced [42], but with some limitations: in this case all the reconfiguration mechanisms have to be developed entirely by the programmer. Moreover the adaptive logic cannot define proper performance models, and reconfiguration decisions must be taken without any a-priori knowledge about their impact on the overall performance.

2.2. Frameworks for Adaptive Pervasive Mobile Applications Independently of high-performance applications, application adaptivity is a key feature in many real-world systems, especially in Pervasive Mobile execution environments characterized by highly variable degree of resource availability (both for computing and network resources). Some research works deal with the issue of adapting mobile systems according to the actual network behavior. For instance in Odyssey [38] mobile applications features run-time reconfigurations which are noticed by the final users as a change in the application execution quality. The Odyssey framework is responsible for exploiting periodically resource monitoring activities and for interacting with mobile applications raising or lowering the corresponding quality levels. In this approach all these reconfigurations are automatically activated by the run-time system without any user intervention. One of the most important drawbacks of this approach is the quite limited definition of the quality concept: in many cases it consists in the quality of the visualized data but this assumption can be restrictive when we consider more complex applications involving an intensive cooperation between computation, communication and visualization.

12

C. Bertolli, G. Mencagli and M. Vanneschi

In research projects adaptivity requires the execution of migration activities of specific parts of mobile applications for meeting variable user-intentions, for energy consumption optimizations and for reliability reasons. As an example in Aura [25] adaptivity is expressed introducing the abstract concept of task: i.e. a specific work that a user has submitted to the system (e.g. write a document) and which can be completed by many applications, suitable for different classes of computing resources (e.g. workstations and smartphones). The framework is responsible for deciding the execution of migration activities in a fully transparent way w.r.t the final user. Aura essentially considers very simple applications (e.g. write a presentation) which make it possible a straightforward way for implementing such migrations. On the other hand, if we consider more complex mobile applications (e.g. executing a forecasting model for disaster prevention), transferring a partial computed task to a different platform can be a critical issue concerning both performance and consistency issues. Since the last years some first attempts to execute time-critical applications, which require compute-intensive elaborations also on mobile computing platforms, have been made. A relevant work is MB++ [37], a framework for developing compute-intensive applications in Pervasive Grid environments. Such applications are pervasive (i.e. designed for mobile devices) and require also the execution of high-performance computations performed by HPC centralized resources (e.g. a cluster architecture). Typical examples are transformations on data streams (e.g. data-fusion, format conversion, feature extraction and classifications) for metropolitan-area emergency response infrastructures. In this approach one of the most important drawback is the quite limited utilization of mobile nodes, which exploit only pre-processing or post-processing activities, whereas compute-intensive elaborations are suitable only for HPC resources. In many other critical scenarios, such as emergency response systems, we could require also the possibility to dynamically execute real-time intensive computations on a distributed set of localized mobile resources, equipped with a sufficient computational power (e.g. multicore smartphones). In this section we have presented the actual state of the art concerning selfadaptive systems both for traditional parallel computing problems but also for mobile applications. From our point of view there is not yet a unified approach for programming adaptive high-performance computations executed on highly heterogeneous and dynamic execution environments, such as Pervasive Grid in-

High-Performance Pervasive Computing

13

frastructures. Some research works focus on HPC computations in real-time environments, but in these approaches the “pervasive part” of application definition is essentially missing. It means that there are no tools, programming constructs or methodologies to define adaptation strategies in the context of mobile systems and networks. Other research works achieve the necessary expressiveness to define context-aware and adaptive applications, but they do not face on intensive real-time computations performed by HPC centralized resources nor by distributed systems of mobile devices. In this chapter we describe our programming model approach for High-Performance Pervasive Grid application in which some drawbacks and limitations of previous research works can be overcame.

3. An Example of High-Performance Pervasive Application In this section we introduce an example of High-Performance Pervasive application to define the requirements, in terms of support mechanisms and language constructs, of a high-level programming model for these kinds of applications. We have seen that pervasive platforms can include computing and network resources which are highly heterogeneous between themselves. For instance, we can exemplify three kinds of parallel computing nodes: • a centralized server, supported by a distributed memory architecture, e.g. by a cluster of multicores (e.g. Roadrunner [1]) interconnected by means of High-Performance networks (e.g. InfiniBand [27]); • a decentralized node, such as a wireless/wired network interface node or a mobile node, supported by a single multicore processor. This kind of node may be affected by the energy consumption issue, and it is certainly subject by memory-related degradations (e.g. due to the presence of a memory hierarchy and/or specific cache coherence mechanisms); • a network of mobile devices (e.g. Personal Digital Assistants, or PDAs), where each device supports a “unicore” processor (or small multiprocessors) with limited capacity in terms of available memorization support and energetic autonomy;

14

C. Bertolli, G. Mencagli and M. Vanneschi = F(

)

W . . .

E

W

C = F(

)

Figure 2. Representation of a task farm parallel structure: tasks received from an input stream are assigned to worker processes (W) by an emitter (E) process (each task goes to one worker) and their results are passed to a collector process (C) which delivers them to the output stream. Input tasks are represented with gray-scale circles, and results with squares with the same color of the corresponding input tasks. The F function denotes the execution of the task passed as input value, with the meaning that the application of F is replicated over workers. Suppose that we want to map an application component to one of these nodes. To fix the ideas we may consider numerical problems related to the resolution of differential equation systems (e.g. direct methods for tridiagonal systems or Cholesky decomposition for more general systems) or a physical/chemical simulation (e.g. the well-known Ocean [18] benchmark or the BarnesHut application [6]). By focusing on each of the three described architectures we can select: (i) a specific best algorithm for the chosen application. For instance, if we are implementing a tridiagonal system solver, we may select a specific direct method between the existing ones; (ii) a specific parallelization scheme, optimizing the parallel execution of the chosen sequential algorithm on the target architecture. Below we exemplify three parallel programs, one for each architecture, based on the following well-known parallelism structures: Task Farm this is a stream-based parallelism structure (see Figure 2) in which each task is solved sequentially by one of the worker (W) processes, which represent the parallelism unit. Input tasks are assigned to workers by a special emitter process (E), according to round-robin or on-demand strategies (the latter to optimize load-unbalance situations), and results are collected by another special collector process (C), according to a First-In-

High-Performance Pervasive Computing

15

First-Out (FIFO, in short) strategy. As known, this parallelism paradigm does not decrease the processing latency of a single element of the stream, but it decreases the service time 1 (increases the throughput), provided that the stream interarrival time is sensibly less than the sequential processing time (stream processing situation in the true meaning). W

W

step 1

. . .

G

W

step 2

W

W

. . .

W

W . . .

. . .

S

W

. . .

step N

W

. . .

. . .

W

W

W

W

W . . .

W

W

Figure 3. Representation of a stream-based data parallel computation: each input task corresponds or generates an input state which is scattered (S) to worker processes (W). After receiving their partition, workers start a data parallel computation including multiple steps, each one possibly characterized by a communication stencil. At the end of the stencil computation the resulting partitions are gathered (G) (some further parallel computation may be introduced here, such as a reduce) and the final result is passed to the output stream. Data Parallel in these kinds of computations an initial composite state is partitioned by a scatter process (S) onto replicated workers (W), where each worker performs the sequential algorithm for its partition (see Figure 3). 1

The request latency is defined as the time needed to serve a request, which in the streambased computing terminology corresponds to the time needed to perform a task. The service time considers a whole stream-based parallel computation and it corresponds to the time passing between two successive result deliveries, i.e. it does not take into account the cost of performing a single tasks but an average over the whole task stream.

16

C. Bertolli, G. Mencagli and M. Vanneschi The sequential algorithm may be subdivided in successive computing steps and workers may cooperate during each step according to a proper communication stencil. At the end of the stencil computation the partial results on each worker are collected on a gather process (G). With respect to the task farm structure, this parallelism paradigm works both in a stream processing situation, and when only a single system has to be processed (i.e. equivalently, when the stream interarrival time is greater than the sequential processing time for a single task). Moreover, it is able to decrease the processing latency of a single tridiagonal system and the memory size per node. In a stream situation, the disadvantage of a stencilbased data parallel structure, w.r.t. the farm paradigm, is a potential load unbalance and a more critical impact of the communication/computation time ratio, thus in general a greater service time.

In this example we do not want to enter into the specific details of an algorithm and its optimal parallelization for a given architecture. Rather, for the purposes of this chapter, we are interested in describing three different parallel programs, all solving the same problem, but which are possibly based on different sequential algorithms and different parallelization schemes. For each program we also give some details related to the dynamic modification of the parallelism degree and fault tolerance mechanisms: Parallel Program for Centralized Server for this architecture we select a task farm program, where each worker is mapped to a whole multicore processor and it is implemented as a data parallel computation. That is, the whole program is a composition of task farm and data parallel. As in a classical task farm each task is assigned to a single worker, but in this case it is solved in parallel by the workers of the data parallel program mapped to a multicore processor. From a performance viewpoint this parallel program optimizes the service time but it also reduces the request latency. Nevertheless, note that the parallelism degree of each data parallel is limited to the number of cores in each multicore processor. If needed, we can modify the parallelism degree of this computation principally by modifying the number used multicores, i.e. by modifying the number of task farm workers. To increase the parallelism degree it is sufficient to connect the new worker (or workers) to the emitter and the

High-Performance Pervasive Computing

17

collector, and to modify the behavior of these latter processes to take it into account. Rather, a larger effort is required when decreasing the parallelism degree, as the action of removing a worker influences the execution of the task currently mapped to the removed worker. In general terms, we have to guarantee that worker removal is consistent with the computation (otherwise one or more results will be never delivered2 ). We can define two simple protocols guaranteeing a correct worker removal, depending on the relation between the average time needed to perform a single task, i.e. the request latency (denoted with Ltask ), and the time available to remove the worker3 (denoted with Tremove ): (i) if Ltask < Tremove we can wait for the worker to terminate the task and deliver the result and then disconnect it from the emitter and the collector; (ii) otherwise, if Ltask ≥ Tremove we have to stop and remove the worker before it can terminate the task. Assuming that the worker can be terminated at any time, or that there are specific computation points in which it can be stopped and that the time passing between two such points is less than the admitted removal time, then we only have to re-schedule the task previously assigned to the removed worker. To do so, we can keep on the emitter all tasks which results have not been yet passed to the collector. Fault tolerance mechanisms can be defined to support a worker failure. For this failure model we can use the same mechanism for worker removal based on re-scheduling. Note that this is only a solution given for the purposes of the example: further mechanisms can be based on scheduling the same task to multiple workers (i.e. to replicate the task execution). Parallel Program for Decentralized Node for this architecture we can select a data parallel program, where, as above, each worker is mapped to a core. The performance characteristics of this computation is the same of the data parallel structure described above, instantiated to the specific program implemented (i.e. the overall performance depends on program semantics). 2 We do not enter into the details of the definition of a correct stream-based computation. Refer to Section 8. for a formal definition. 3 For performance reasons we assume that there is some limitation on the time needed increase or decrease the parallelism degree.

18

C. Bertolli, G. Mencagli and M. Vanneschi As is known, computations executed on multicore processors may feature performance degradations due to the presence of a hierarchy of shared memory (i.e. due to concurrent accesses to the memory) and, for some applications, to the presence of cache coherence mechanisms. Aside of optimizing the sequential algorithm and the used parallel structure, a way to control the extent of these kinds of degradations can be based on either modify the parallelism degree or the grain of tasks. In the first case we simply modify the number of processes trying to concurrently access to the shared memory, while in the second case we possibly diminish the size in memory of the computation state, obviously depending on the implemented application. Unlike the task farm case, to modify the parallelism degree of a data parallel (i.e. either increase or decrease) we have to run some protocol guaranteeing the consistency. In general, we have to consider the average time needed to perform a task Ltask and the available time to apply the increase or decrease Tmodif y : if Ltask < Tmodif y then we can wait for the termination of the task which is currently in execution and then modify the parallelism degree; otherwise, if Ltask ≥ Tmodif y then we have to modify the parallelism degree during the task execution. For this purpose, we can stop the computation and re-start the task execution from its beginning. Otherwise, if we cannot loose the work performed until now for the current task, we can reach a consistent state by means of an ad-hoc protocol for data parallel programs (see Section 8.), and then we can re-scatter the state to the workers and re-start the computation from the last data parallel step. Note that for this kind of architecture it is highly unprobable that one of the cores will stop executing or fail while the other cores continue to work in a correct way: more likely an actual failure will affect the whole processor. Therefore, we can rely on the existence of further spare multicore processors which can be used as hot or cold replicas4 of the main one to be used according to some replication strategy. If we adopt hot replication the spare replicas can be managed according to active [40] or primary/backup [28] management strategies. If we adopt cold replication, in case of failure we have to deploy and start the new replica. In both cases we have to consider the time available to re-start the computation on the new replica Trestart and the time needed to terminate the task

High-Performance Pervasive Computing

19

currently in execution on the failed replica Ltask : if Ltask < Trestart , we can restart the task execution from its beginning on the new replica; if Ltask ≥ Trestart we have to adopt some checkpointing protocol. In case of hot replication we proactively propagate the checkpointed states to the replicas; in case of cold replication we have to save the checkpointed states to some stable storage memorization support, surviving the processor failure. The replica deployed after the failure of the original computation will recover the last checkpointed state from that support and it will re-start from the partially computed state. Parallel Program for Network of Mobile Nodes for this architecture we select a task farm computation, where workers are mapped to mobile nodes and the emitter and the collector are mapped to: (i) an access point in a typical infrastructured network solution, leading to what we can call an infrastructured task farm; (ii) mobile nodes in a typical ad-hoc network, leading to what we can call an ad-hoc task farm. Note that the very same performance models described above are valid, with a special careful on wireless communications which require a specific modeling, taking into consideration node distance and the possible presence of obstacles. In this paper we avoid to detail the issues and solutions related to parallel computing on wireless networks, which we will study in future work. For this kind of computation we have also to consider issues related to the limited available memorization support and to the energy consumption, as well as the possibly high mobility of nodes. We can target the memory and energetic limitations by deriving proper cost models (e.g. see [14] for an example of memory models) and by dynamically modifying the task grain, based on the assumption that for most applications larger task grains mean either a larger memory requirement or a larger energetic consumption. To target the mobility issue we can think to map node disconnections to transient failures, i.e. to see disconnections as failures and recoveries of processes hence to support them according to proper fault tolerance mechanisms. Therefore, for this kind of computation we have to con4 Replication is hot when the replicas are deployed on spare processors, and are active and ready to be used or are to be activated; replication is cold when the replicas are not deployed.

20

C. Bertolli, G. Mencagli and M. Vanneschi sider both full node failures, e.g. because batteries have run down, and transient failures caused by disconnections. An efficient solution may be based on task re-scheduling techniques and, in case of ad-hoc task farm, emitter and collector replication according to optimized techniques (e.g. see [10]). We can also consider supports which guarantee some mobility properties, such as Virtual Mobile Nodes [19], which enable the selection of an application trajectory (e.g. maximizing node coupling) independently of actual node mobility. This is implemented by means of proper process replication and migration techniques aiming at moving the computation in the desired direction.

Given these three computations there are several cases in which we may select one of them, independently of the way in which they are composed into an application. For instance there are performance and availability constraints which induce a specific selection: • in some cases we can select a computation minimizing the request latency rather than the service time. For instance, the data parallel solution on the decentralized node could be preferred to the task farm on the mobile network; • depending on the used mechanisms, its configuration, and the context situation, one of the computations may provide a lower response to failure than another one. For instance the cluster computation may provide higher availability (i.e. lower node failure recovery time) than the network of mobile nodes; • one of the computations may feature a lower overhead (w.r.t. the other ones) in modifying its parallelism degree. Broadly, a task farm structure features a lower overhead than other kinds of parallel structures (e.g. data parallel or its composition with the task farm). Moreover, we may support the task farm with spare worker copies, which have been previously deployed, and which can be promptly include to the set of workers, hence minimizing the modification of the parallelism degree. In this case, we should select the task farm on the mobile node network. Now consider the three computations as computing components of an application, which also includes a component generating input tasks (denoted with

High-Performance Pervasive Computing

21

INPUT) and a further component collecting output results (denoted with OUTPUT), according to a generic graph-like composition which we do not specify to be more as general as possible. The application will be mapped to a pervasive platform which include the three related architectures for the computations described above, but also further nodes and communication resources to map and make communicate the INPUT and OUTPUT components, as well as any other required components which depend on the application structure and semantics. As shown in Figure 1 the three components will communicate either by means of wired and wireless technologies, also depending on their geographical mapping. Moreover, some components may be mapped to the same nodes: for instance the OUTPUT components may be mapped to the same resources on which it is mapped the infrastructured mobile task farm. Considering the application as a whole we can define further motivations of selecting one of the three computations to solve the intended problem. For instance we can exemplify the following cases: • if our goal is to minimize the computing component service time over a task stream we also need to consider the task interarrival time to the component, which corresponds to the inverse of the task reception frequency. The interarrival time is evaluated as the communication latency between the INPUT component and the selected computing component, if we assume that the INPUT component service time is negligible. For instance, if the communication latency between the INPUT component and the cluster computation is higher than the computing component service time (communications are the bottleneck), then we can think to select another computation. For instance, if the decentralized node is geographically nearer to the INPUT component we may expect a lower communication latency between the two components, possibly leading to a better utilization of the decentralized node computations w.r.t. the cluster one. In this case we should select the decentralized node computation; • in a complex network, as the one supporting the considered pervasive platform, we may expect that some network links are not available, hence leading to the disconnection between the INPUT and OUTPUT components to one or more of the computing components. In this case we have to select the computing component which can communicate with both

22

C. Bertolli, G. Mencagli and M. Vanneschi INPUT and OUTPUT components, or at least with the OUTPUT one, if INPUT data can be simulated.

Note that the exemplified events may be verified also in a dynamic way: for instance, a link failure may happen during the computation leading to the disconnection of two components. The communication latency between two components may vary dynamically due to the mobility of one of the communication parties. This may happen if the OUPUT component is mapped to a single mobile device. It may also vary due to the presence of other applications demanding a higher quality communication service (e.g. a videocall). Moreover, also final user requirements in terms of application performance may change dynamically: for instance at a given execution point we may be required to start to minimize the computing component service time rather than its request latency. As a consequence of these observations we can see that the logic under which we select one from the three computations has the form of some function (e.g. it is a strategy optimizing some function) which has variable parameters. The parameters assume the values of application-, support- and environmentalrelated quantities, such as the service time of components, the possible failure of processes and, for some applications, the value returned by environmental probes or monitors (e.g. the air temperature). Therefore to face with these dynamic events we have to support at least the following mechanisms: • it must be possible to dynamically modify the mechanisms used to support the computations as well as their configuration. As described above it must be possible to: modify the parallelism degree; modify the mapping of processes to computing nodes and logical communication to physical links implementing them; the replication degree or checkpointing frequency. We denote these kinds of dynamic actions as non-functional reconfigurations as they do not modify the computation executed but only some aspects of its implementation; • it must be possible to dynamically select one of the computations solving the same problem, possibly without provoking a modification of the application semantics (i.e. dynamic selection must be consistent). We denote this kind of action as functional reconfiguration, as we dynamically modify either the sequential algorithm or the parallelism structure used.

High-Performance Pervasive Computing

23

We can now see how the exemplified computations represent alternative forms or versions of a same computing component which adapt itself to the dynamic platform and application conditions. The computing component hence includes in its program the three programs of the exemplified computations. Along with these programs the components should also include a further code section, issuing reconfiguration actions which aim at guaranteeing that certain parameters of continuity of execution and efficiency are offered by the application during its execution. This further component code can be denoted as control or adaptivity component logic, as opposed to the functional component logic which is described by the three parallel programs described above. As an example, the adaptivity logic of a component should map specific context situations to reconfiguration actions, for instance by means of a set of condition-action rules. In more general terms we can think of at least three ways of defining the adaptivity logic: • as exemplified above, it can react to the verification of certain context events. We denote this kind of solution as reactive logic; • it can also perform reconfiguration actions before specific context event are verified either to avoid their verification or to make their handling more efficient and/or simpler. We denote this kind of solution as proactive logic. For this purpose, it may be possible to define prediction strategies of the future state of execution parameters of application. For this reason, this solution is also denoted as predictive logic.

4. The ASSISTANT Programming Model We now introduce the ASSISTANT programming model which includes all the programming feature described above in a unified model. That is, even if the behaviors described above can be implemented by composing up different programming models and supporting tools in a single program, we claim that the only way to guarantee actual continuity and efficiency of applications is to rely on a high-level unified model from which we can derive the desired properties.

24

C. Bertolli, G. Mencagli and M. Vanneschi

4.1. Applications An ASSISTANT application is composed of distributed and interconnected application components. By using a proper programming construct (i.e. the application construct), the programmer can express an ASSISTANT application as a direct graph of components interconnected by means of streams of data, i.e. sequences possibly of unlimited length of typed elements. The set of ingoing and outgoing data streams to and from a component identifies its input and output interfaces. ASSISTANT provides constructs to express two kinds of components: • Parallel and Adaptive components expressed by means of the ParMod construct. With this construct the programmer can express both the functional logic of the component (e.g. the parallel computation) and its control logic. As hinted, the definition of this logic is a crucial issue in order to efficiently deal with a dynamic execution environment and timevariable QoS requirements; • Primitive Context interfaces expressed by means of the primitive_interface construct. These are sequential components which periodically use application monitoring activities (e.g. to obtain a module service time) and platform monitoring activities (e.g. to obtain the available network bandwidth on a specific connection). We refer to this information as the current execution context. Figure 4 shows an example of ASSISTANT application implementing a flood management emergency management service. In the figure black arrows denote data streams while black and white arrows denote context streams. The application includes third-party components, such as a Wireless Sensor Network (WSN) providing current environmental data, a precipitation data module and a Geographical Information System (GIS). These components provide the input on which applying meteorological (i.e. precipitation) and flood forecasting models, which in this application are implemented as ParMods. The results of these forecasting activities are passed both to application clients (which are themselves expressed as a ParMod) and to a further ParMod implementing a decision support system. In this paper we are mainly interested in describing how adaptivity logics and multiple functional versions can be expressed in the ASSISTANT ParMod.

High-Performance Pervasive Computing WSN

GIS

Precipitation Data

Meteorological Forecasting

Flood Forecasting

service time interface

Decision Support

25 Client

comm. latency interface

ParMod Context Interface Third-Party Components

Figure 4. Example of Emergency Management application implemented as an ASSISTANT application. The application includes third-party components, i.e. implemented in other languages than ASSISTANT, ParMods and context interfaces monitoring the service time of processes and communication latencies. We leave to future work the description and implementation of interconnection logics under which ASSISTANT and third-party components can interact.

4.2. The ASSISTANT ParMod The ParMod construct is used in ASSISTANT to express parallel adaptive computations. From an abstract point of view a ParMod can be seen as composed of two cooperating entities: • the Operating Part executes the Functional Logic of the module. The programmer can express the elaboration performed in terms of multiple versions, as required when supporting Pervasive Applications, where at run-time one of the versions is dynamically selected as the currently executed one. The execution of a version computation is activated by the reception of elements from its input interfaces (i.e. ingoing data streams), and it produces results on its output interfaces (i.e. outgoing data streams). The performed elaboration is applied to each input element and it can be a sequential code or a parallel computation expressed according to any scheme of Structured Parallel Programming [17] even in complex and compound forms; • the Control Part of the ParMod executes the adaptive logic of the com-

26

C. Bertolli, G. Mencagli and M. Vanneschi ponent (see Section 3.). In our approach to adaptivity the run-time support provides a set of predefined ParMod reconfigurations, whereas, the adaptation strategy, i.e. when and how the computation has to be reconfigured, must be directly expressed by the programmer by means of a specific programming construct. As hinted, from our point of view this is an effective way to optimize application execution w.r.t. the state of the underlying platform and variable QoS requirements.

Note that all alternative versions of the same ParMod features the same input and output interfaces, which correspond to the input and output data streams of the ParMod. Figure 5 shows the abstract structure of the ParMod. The Operating Part is activated either according to a data-flow scheme (i.e. the module waits for values from all the ingoing streams) or to a non-deterministic behavior (e.g. a CSP-like [31] semantics based on guarded commands). As hinted, the Operating Part can perform a sequential or a parallel computation, hence it can be possibly implemented by means of network of distributed processes.

ParMod Input streams

Operating Part Reconf. commands

Output streams

Monitoring and reconf. feedback

Control Part Context Parameters

Context Interfaces

Figure 5. Abstract overview of an ASSISTANT ParMod. The Control Part is started whenever a monitoring message is received: • from the set of primitive context interfaces connected to it; • from the Operating Part, which can periodically send execution monitoring information concerning the actual performance of the computation

High-Performance Pervasive Computing

27

(e.g. its throughput) and other execution metrics (e.g. memory occupation). In both cases we refer to this information as a monitoring update. When a monitoring update is received, a set of reconfigurations can be decided according to a programmer-defined adaptation strategy. Many strategies can be expressed to deal with several execution conditions and to configure the ParMod behavior with the goal of guaranteeing in time specific QoS objectives (e.g. as introduced in our previous works [8, 9, 13]). As we have discussed in Section 3., when implementing a HighPerformance Pervasive application programmers need to express two kinds of reconfiguration activities, i.e. functional and non-functional ones. The ParMod includes constructs to implement both kinds: • Functional Reconfigurations. The ParMod functional logic can be mapped to different classes of computing nodes by using the definition of multiple versions. Thus, based on the actual state of the execution platform, the control logic can select which is the best platform to which map the computation. “Best” in the sense that, according to the actual platform conditions and QoS requirements, the control logic can infer the mapping which probably leads to a better expected performance. For this purpose ASSISTANT provides an automatic support to dynamically switch between versions, while the logics under which version switching is issued is expressed as part of the control logic by the programmer; • Non-Functional Reconfigurations ASSISTANT provides programmers with an automatic support to non-functional reconfigurations such as the dynamic modification of the component parallelism degree (i.e. of the main version) and the mapping of processes to computing nodes. Also these kinds of reconfigurations must be issued from the ParMod control logic. The logical interconnections between the Operating and Control Part (Figure 5) are required to implement the reconfiguration protocol. When the control logic has selected the reconfiguration which must be executed, proper reconfiguration commands are sent to the Operating Part (to all involved processes and/or threads). When the reconfiguration is completed, the Operating Part sends a corresponding feedback message to its Control Part which notifies the completion

28

C. Bertolli, G. Mencagli and M. Vanneschi

of the reconfiguration phase. In the following we give details on the functional and control logics. 4.2.1.

ParMod Functional Logic

The programmer is allowed to express each alternative version of the functional part of the ParMod by means of the operation construct, i.e. each operation defines (at least) a ParMod version. As we discuss below, when implementing adaptivity logics according to reactive strategies, also part of the control logic is expressed inside the operation construct. In this sense the operation construct is the expressivity unit of functional and control parts of an ASSISTANT ParMod. Inheriting from our previous experience in parallel programming [45], each operation is expressed according to a well-known parallelism scheme (e.g. stream-parallel schemes such as pipe, task-farm but also data-parallel schemes such as map, map&reduce and different classes of communication stencils). This methodology is widely known as the Structural Parallel Programming (SPP) [17, 45]. Many SPP approaches have interesting features in terms of high-level programmability and performance portability compared to other non structured parallel programming models based on message-passing or sharedmemory languages (e.g. MPI and OpenMP). In addition to this the SPP approach is also of special interest in ASSISTANT when expressing the ParMod Control Logic (see Section 5.). Each operation of a ParMod can define a parallel computation according to a generalized scheme, inherited from [45], which requires the definition of three sections: • input_section: to express how the input streams are accessed, the programmer can use a non-deterministic selection based on a guardedcommand [31], in which each guard contains a set of input streams. After the non-deterministic selection of a guard, a distribution strategy of the received data to the parallel activities performing the computation is executed. Some strategies are predefined (e.g. on-demand, scatter and multicast); • virtual_processors (VPs): these are the logical units of parallelism which execute the main function delegated to the module. For each received

High-Performance Pervasive Computing

29

task a virtual processor executes a sequential code and produces a corresponding result. VPs can be programmed to express different parallelism schemes (e.g. both task-parallelism and data-parallelism can be implemented), and they are mapped to a set of distributed processes (called workers) responsible for their execution. The number of worker processes is the actual parallelism degree of the computation; • output_section: in this section we express the collection of results from virtual processors and their transmission to the ParMod outgoing streams. The collection phase can be programmed by means of some predefined strategies (e.g. gather and FIFO). Operating Part Example We introduce the constructs to express a ParMod functional logic version by means of a simple example, in which we implement a task farm program (see Figure 6). The operation name is farmExample and it consumes from an input stream named inT elements of type task; it produces elements on the output data stream named outR with data type result (see line 1). Both task and result must be specified as data types inside the ASSISTANT application construct (for brevity, we do not show their definition). The virtual_processors in this operation are anonimous and they all behave in the same way: this is expressed at line 3 by the topology construct. The input_section is applied until the input data stream is eventually closed (do-while construct at lines 6 and 8). When an input element is consumed, it is assigned to one of the workers according to the on-demand strategy (line 7). As discussed above, the guard commands can be used to implement several input stream accessing strategies, which we do not show in this example. All workers feature the very same behavior (lines 10-15), which is expressed as part of the wBehavior elaboration (in ASSISTANT an elaboration is a sequence of procedure calls). The execution of a task is here implemented in the compute procedure, taking an element consumed from the input stream and producing a result for the output stream. The implementation of compute depends on the actual application semantics and it is not shown here for generality purposes. The result of each task execution is collected on the output_section according to a FIFO strategy, which in ASSISTANT can be expressed by means of the ANY worker selection command.

30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

C. Bertolli, G. Mencagli and M. Vanneschi

a p p l i c a t i o n a s s i s t a n t A p p ( InputStream t a s k i n T ; OutputStream r e s u l t outR ) { o p e r a t i o n farmExample { t o p o l o g y none Workers ; input_section { do { g u a r d 1 : on , , i n T { d i s t r i b u t i o n i n T on_demand t o w o r k e r s } } while ( true ) ; } virtual_processors { w B e h a v i o r ( i n g u a r d 1 ; o u t outR ) { VP { compute ( i n i n T ; o u t outR ) ; } } } output_section { c o l l e c t outR from ANY w o r k e r s ; } } }

Figure 6. Implementation of a task farm program using the ASSISTANT functional logic constructs. See the text for a description of the program 4.2.2.

ParMod Control Logic

As hinted, in our methodology each application component must be explicitly programmed in such a way as to express both its parallel computation and its adaptation strategy. The adaptive behavior has the main objective of maintaining desired execution properties despite time-varying execution conditions. These requirements can be expressed according to different specifications: • we might require to optimize some execution parameters (e.g. the number of completed tasks), i..e. the Control Part is responsible for solving an

High-Performance Pervasive Computing

31

utility optimization problem; • we might require to maintain specific execution parameters within a userdefined range (e.g. keep the service time in a given range). In this case we refer to this approach as a threshold specification problem; • we might require to maintain some execution parameters as closer as possible to a set of desired reference values, as in a classical set-point regulation problem [29]. Following these specifications, in ASSISTANT we can express two classes of adaptation strategies: • Reactive Control: the programmer expresses a proper mapping between specific run-time operation conditions and corresponding module reconfiguration activities (both functional or non-functional); • Predictive or Proactive Control: instead of merely react to stimuli the control logic acts in a proactive way. Being proactive means that a ParMod can consciously involve acting in advance of a future situation. As an example the control logic can avoid the violation of QoS constraints by performing in advance proper reconfigurations. This approach is typically based on a systematic use of predictions (e.g. for time-varying workloads and resource utilization) and on-line optimization techniques [16], from which we derive the term Predictive. Though the second approach is an interesting research issue in the field of autonomic and self-adaptive computing [33, 35], in this paper we introduce the control logic constructs used to express a reactive control. Reactive/Proactive control logics will be the main object of future work. In ASSISTANT it is possible to express a reactive behavior by defining how the Operating Part must be reconfigured in response to specific events or conditions. The Control Part periodically receives monitoring updates allowing it to identify if the actual ParMod configuration behaves as the user expects. Hence, the periodically receiving of such updates makes it possible to identify the presence of some QoS violations: • some execution parameters are not equal to certain desired reference values;

32

C. Bertolli, G. Mencagli and M. Vanneschi Condition 0

. . .

switch ( op0, op1 )

OP 0

OP 1

. . .

OP 3

. . . Condition 6 reconf ( op0 )

OP 2

Condition 5

Condition 7

switch ( op2, op3 )

reconf ( op3 )

Figure 7. Scheme of ParMod control automaton. States represent operations, transitions between states represent functional reconfigurations and selftransitions represent non-functional reconfiguration. • some execution parameters are not within a required range (e.g. the average service time is higher than a threshold). The main essence of such kind of control is to properly react to these QoS violations by automatically modifying the actual ParMod configuration, in such a way as to reach the desired execution goals as soon as possible. on_event : condition 0: do / / Non−f u n c t i o n a l r e c o n f i g u r a t i o n : value = cost_model ( . . . ) ; p a r a l l e l i s m value ; enddo ... c o n d i t i o n N−1: do / / Functional reconfiguration : operationName . s t a r t ( ) ; enddo

Figure 8. Syntactic form of the on_event construct. This construct can be used in an operation definition to express reactive control when the corresponding operating is the currently active one within the ParMod. The construct includes a set of nondeterministic event-reaction rules.

High-Performance Pervasive Computing

33

The mapping between undesired conditions and reconfiguration activities is a key-issue. A proper mapping can be defined exploiting a form of model reflecting the computation behavior. As hinted, the parallel computations of an ASSISTANT ParMod are well-known parallelism schemes, characterized by specific interaction patterns between parallel processes and a clear and welldefined semantics, which make it feasible the definition of proper performance models. For performance model we intend an analytical formulation which can describe: • the expected performance of the computation, for instance its average service time, in function of the parallelism degree and the current interarrival time. These models are based on queueing models [36] derived by considering the set of messages exchanged between the overall application process netowkr as a traffic flow between service nodes. In previous works we have studied the performance models of ASSISTANT applications in several schemes and configurations [8, 9, 13] by using Queueing Theory results [36] (i.e. by assuming that the computation performance behaves as a Markovian process); • the overall memory occupation of a parallelism scheme, in function of the parallelism degree, the task size and the memory occupation of the sequential algorithm. In [14] memory utilization models are dynamically instantiated to configure a parallel computation executed on a set of mobile nodes equipped with limited memory capacity. In Section 5. we give full details of the methodology used to derive performance models from a functional logic program. In ASSISTANT the reactive control logic is formally expressed by means of a control automaton, in which: • each internal state of the automaton corresponds to a different operation (i.e alternative version); • input states are logical combinations of QoS-related or platform-related boolean expressions, e.g. stating the presence of a certain QoS violation and a certain level of network and computing resource availability;

34

C. Bertolli, G. Mencagli and M. Vanneschi • output states are reconfiguration actions related to each transition. These actions can be: non-functional reconfigurations, denoted with the reconf(OPi ), an example of which is the use of the parallelism N construct, to modify an operation parallelism degree; or functional ones, denoted with switch(OPi , OPj ), which identifies a switching between two different operating modes (i.e. from OPi to OPj ).

In Figure 7 it is depicted an overview of a generic control automaton. Its starting state is the operating mode which the programmer has specified with the keyword initial in the operation definition. According to the ParMod semantics only one operating mode can be marked as initial. In our automaton we can observe that non-functional reconfigurations are self-transitions, whereas transitions between different internal states are functional reconfigurations. For the considered reactive control, a control automaton can be syntactically expressed as scattered in each operation definition within a ParMod. That is, as we have introduced above, each operation defines, aside of its functional logic, how the module can react to specific events when that operating mode is currently executed. To do so, the programmer makes use of a set of non-deterministic clauses expressed in a specific programming construct (i.e. on_event Figure 8), where each clause is related to a reaction program. A reaction program is expressed with C-like constructs and reconfiguration commands. The event-reaction rules defined inside an operation identify the possible outgoing transitions from the related operation. In the shown on_event scheme we make use of the parallelism construct, which is used to dynamically re-set the parallelism degree of the currently active operation, and the start method for operations, which implements the operation switching (from the currently executed operation to the operationName one. Finally note that reconfiguration commands can only be received at certain points of the Operating Part execution, i.e. when the corresponding processes (implementing the active operation) check for the presence of reconfiguration commands on channels linking them to the processes implementing the Control Part. After a notification of reconfiguration commands from the Control to the Operating Part, it may be needed to guarantee that reconfiguration commands do not modify the application semantics, i.e. the transition between two successive versions or different configurations of the same version is consistent with

High-Performance Pervasive Computing

35

the application semantics. Consistency can be obtained by making processes implementing the Operating part run proper distributed algorithms, which in ASSISTANT can be implicitly defined by using the parallel structure definition or explicitly implemented by the programmer. Reconfiguration points and protocols are subject of two separate sections, i.e. Section 7. and 8.. Control Part Example We show an example of Control Part based on the example discussed in Section 3.. Figure 9 shows the automaton which we want to implement. The automaton includes three states, each corresponding to one of the introduced version: cluster operation implementing a composition of task farm and data parallel; decentralized node operation, based on a data parallel scheme; mobile network operation, implemented as a task farm program. In the automaton we show a subset of all possible events which can be defined to express a proper adaptive logic. For instance, we switch from the task farm on the mobile node network to the multicore operation in the case in which: (i) the QoS requirements change from a strategy based on the minimization of the service time to another strategy which can be reduced to a minimization of the request latency, and (ii) the cluster operation is either unreachable or features high communication latencies. The network and cluster operations also include self-transitions used to express the modification of the task farm parallelism degree as a reaction of a new monitored or requested service time. Figure 10 shows the program of the mobile network operation, related to the Control Part section. The ASSISTANT application includes three declarations and corresponding definitions of operations. For the mobile node network operation we show two event-reaction rules: • In case a new service time is monitored or requested by the clients, a new optimal parallelism degree n is computed by applying the operation performance model and the operation functional part is reconfigured to target it by means of the parallelism construct; • in case we need to optimize the request latency we switch to the decentralized node operation, by invoking its start method.

36

C. Bertolli, G. Mencagli and M. Vanneschi new service time

disconnection INPUT Cluster

cluster operation

reconnection INPUT Cluster

decentralized node operation

optimize service time

disconnection of clients

optimize request latency

mobile network operation

new service time

Figure 9. Example of control logic automaton for the application described in Section 3.. For brevity, transition labels only show the event which induce a reconfiguration, avoiding to show the reconfiguration program. In this example events are expressed in an informal way, whereas the corresponding control logic program must make use of logical clauses. Note that the programmer does not need to specify how reconfigurations are actually implemented (i.e. parallelism degree change and operation switching): such activities are automatically performed in the ASSISTANT run-time.

5. Performance Models As hinted, one of the main issues for programming high-performance adaptive applications on pervasive platforms is related the definition of proper adaptation strategies, which make it possible the achievement of the required QoS constraints characterizing critical applications. The control logic of an ASSISTANT ParMod is responsible for identifying

High-Performance Pervasive Computing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

37

a p p l i c a t i o n x a m p l e A p p l i c a t i o n ( InputStream t a s k i n T ; OutputStream r e s u l t outR ) { operation c l u s t e rO p e ra t i o n { .. } operation decentralizedNodeOperation { .. } op erat i on mobileNetwork { / / f u n c t i o n a l p a r t ( s ee t h e t a s k farm example ) ... on_event { newServiceTime : do i n t n = mobileNetworkPerformanceModel ( /∗ . . read r e l e v a n t context i n t e r f a c e s . . ∗/ ) ; parallelism n ; enddo optimizeRequestLatency : do decentralizedNodeOperation . start ( ) ; enddo } } }

Figure 10. Implementation of the control part of the mobile node network operation. In case a new service time is monitored or requested by the clients, a new optimal parallelism degree n is computed by applying the operation performance model and the operation functional part is reconfigured to target it by means of the parallelism construct. In case we need to optimize the request latency we switch to the decentralized node operation, by invoking its start method. Both parallelism and start methods are automatically implemented in the ASSISTANT run-time.

38

C. Bertolli, G. Mencagli and M. Vanneschi

when a ParMod reconfiguration can lead to a significant gain in the application execution (e.g. a performance improvement, a lower memory occupation or energy consumption depending on the QoS perspectives). Independently of the fact that the control logic only exploits the actual execution conditions (e.g. the current context), or some more sophisticated prediction techniques for predictive strategies, this decision phase requires that the control logic must be able to model the QoS behavior of its parallel computation and to quantify how specific reconfiguration activities can influence the overall QoS in such a way as to take proper adaptation decisions. For these reasons the control logic needs to measure: • for each possible reconfiguration decision its impact in the QoS behavior of the application. This require a proper modeling of different metrics: e.g. performance, memory occupation and energy consumption; • for each possible reconfiguration the run-time overhead spent in completing the reconfiguration and in making effective the new ParMod configuration. Goal of this section is to provide a solid framework for evaluating the performance of parallel programs, which is an essential knowledge for comparing different application configurations and for taking the best reconfiguration decisions. We also provide a clear interaction scheme between the Operating Part and the Control Part of an ASSISTANT ParMod which is a key point to define the necessary overhead to perform reconfiguration activities.

5.1. Performance Modeling for a generic acyclic application graph We model a generic parallel computation (e.g. performed by the Operating Part of an ASSISTANT ParMod) by means of a directed acyclic graph of distributed processes cooperating by sending and receiving typed messages (see Section 4.). For lack of space in this chapter we do not study the case of cyclic application graphs, which can be expressed as ASSISTANT applications. In an acyclic application graph each distributed process corresponds to a network node logically coupled with an input queue of typed messages (queue station). For each incoming message (served according to a FIFO policy) a

High-Performance Pervasive Computing

M1

M4 M3

M2

39

M5

M7

M6

Figure 11. Example of acyclic queueing network studied in this section. queuing station performs a service (i.e. a specific elaboration) and the corresponding result is transmitted to another queuing station in the network. In Figure 11 is depicted an example of an acyclic queuing network. From the performance point of view we are interested in describing for each queuing station its behavior according to three main random variables: • service time (ts ): it corresponds to the time passing between the beginning of two subsequent elaborations of different input elements. We model this random variable by using an exponential distribution with average value Ts and variance σ s ; • inter-arrival time (ta ): the inter-arrival time corresponds to the time passing between the receivement of two subsequent input elements from source queuing stations. We model this random variable by using an exponential distribution with average value Ta and variance σ a ; • inter-departure time (td ): it corresponds to the time passing between two successive result externalizations from the queuing station. The distribution and the average value of the inter-departure time can be derived from the two further random variables according to specific theorems discussed below. As hinted, we consider exponential distributions (memory-less) which makes it possible to study the queuing station behavior as a Markov process. In this way a set of interesting results from Queuing Theory [36] can be applied for our performance modeling purposes as there will be described in the rest of this

40

C. Bertolli, G. Mencagli and M. Vanneschi

Ta Q

Ts

Td

M

Figure 12. Representation of performance-related random variables for a single QueueingNode. section. The exponential distribution assumption is not so restrictive as one can expect because it describes many real-world applications in which each input task transmission is independent from the other previous tasks. We start by considering the formal modeling of the performance of a single queuing station M (a process in our distributed application logically coupled with a queue Q) as shown in Figure 12. We assume that the queue station is characterized by an exponential distribution for both its service time and its inter-arrival time from a source queuing station. We also assume the knowledge of their average values: i.e. Ta for the inter-arrival time and Ts for the service time. Thus, we can introduce the utilization factor metric for the input queue Q of the process M , which is defined as: ρ=

Ts Ta

(1)

To characterize the performance of a queueing station we consider two different situations: the input queue Q is stable or not. In the unstable case we have an utilization factor which is greater than 1, thus the average service time is larger than the average inter-arrival time to the queuing station. This situation corresponds to a performance bottleneck in the application graph: in the average case the queuing station is slower for serving the requests w.r.t the incoming message rate (i.e. the inverse of the inter-arrival time). In this situation we can prove [36] that the inter-departure random variable has an exponential distribution and its average value is the same as the average value of the service time (intuitively the result externalization rate is equal to the service rate of the queuing station).

High-Performance Pervasive Computing M1

M2

. . .

Mn-1

41 Mn

Figure 13. Representation of a Tandem Queueing Network (or pipe). On the other hand the stable case describes the dual situation in which the utilization factor is smaller than 1. Therefore, in the average case, the queuing station M is faster for service incoming requests w.r.t the inter-arrival rate. In this scenario it is possible to prove that the inter-departure time can be modeled as a random variable with an exponential distribution and an average value which is the same as the inter-arrival random variable (this result is the so-called Burke’s theorem [36]). The (2) equation describes the previous results for the average inter-departure time from the queuing station M . Td = max {Ts , Ta }

(2)

An interesting application of the previous results is the performance modeling of a tandem queuing network (i.e. a graph describing a pipe of queuing stations as in Figure 13). Each node Ni in the queue network is characterized by a service time with average value T si . Suppose that Ta identifies the average value of the interarrivxal time to the first node of the pipe. We can express, by using the relation (2), the average inter-departure time from the last node of the pipe as: Td = max {, Ta , T s1 , T s2 , . . . , T sn }

(3)

We present two other main results from Queuing Theory which are of special interest for studying the behavior of complex acyclic networks. Consider the following situation: we have a queuing station M which is able to receive incoming messages with the same type from different source nodes acting as clients (e.g. nodes C1 , C2 , . . . , Cn ). In this case, independently of the specific source of a message, we consider an unique queue Q coupled with the node M . This situation is described in Figure 14. If we suppose that each inter-arrival time from a specific source Ci is an exponential random variable with average value T ai , it is possible to prove that

42

C. Bertolli, G. Mencagli and M. Vanneschi C1 C2

. . .

M

Cn

Figure 14. Example of a queueing network with multiple clients sending requests to a single queue station. the overall inter-arrival time from every source is a random variable with exponential distribution featuring an average value TA described by the following equation: 1 (4) TA = P 1 n i=1 T ai The dual problem considers a queue station M and multiple servers S1 , S2 , . . . , §n . M , acting as a client, can send different classes of messages to a specific server according to a known probabilistic distribution. For instance a server is specialized for executing a particular elaboration. This scenario is depicted in Figure 15. Suppose that the global inter-departure time from the M node is an exponential random variable with average value Td . Here, we are interested in defining, for each server Si , its specific inter-arrival time T ai . Suppose to known the probability pi of sending a message to a specific server node Si , such that the following equality holds: n X

pi = 1

i=1

in this case we can prove that the inter-arrival time to a specific server Si is an exponential random variable with average value expressed by the following

High-Performance Pervasive Computing

43

S1 S2

. . .

M

Sn

Figure 15. Example of a queueing network with multiple servers receiving input requests from a single queue station. equation: Tai =

Td pi

(5)

The previous results make it possible to characterize the performance of a general queuing network with exponential random variables describing the main performance parameters for each node. By using these results it is possible to identify the steady-state behavior of a parallel application composed of a set of distributed processes. As stated in [43] it is possible to use these results for studying the performance of a distributed system and identify the performance bottlenecks of the application. To remove performance bottlenecks we need to decrease the average service time of a process by parallelizing its elaboration. In this chapter we show how we can model the performance of well-known parallelism schemes, i.e. of structured parallel programming models, which is the model used in ASSISTANT. Such parallelization schemes are based on functional replications (e.g. as in a classical task-farm scheme) and/or datapartitioning and replication (e.g. as in data-parallelism patterns). In the rest of this section we introduce the specific performance modeling both for the taskfarm scheme and for the general data-parallel scheme applying the basic results which have been described in the previous part of this section.

44

C. Bertolli, G. Mencagli and M. Vanneschi

5.2. Performance Modeling for the task-farm scheme As introduced in Section 3., the task farm scheme is based on the replication of a same functionality (e.g. a sequential function F) on a set of worker processes W (see Figure 2). An input stream (i.e. a possibly infinite sequence of typed elements) is scheduled by an emitter E possibly by means of a load-balancing strategy (on-demand). Each worker applies F on each received input and it produces a result which is sent to a collector C which collects the result according to a FIFO strategy. For quantifying the service time of each application process we consider both the time spent on completing a sequential elaboration (i.e. the computation time) and the time needed by the process for exploiting an inter-process communication (a send and a receive primitive on a communication channel as described in Section 6.). We express this communication latency in terms of a specific function Lcom (S), which defines the time needed to perform an inter-process communication in function of the number of bytes of the sent message. The Lcom (S) definition is a critical aspect because, in any case, it is an architecture-dependent function. Moreover, we have to consider the possibility of overlapping communication latencies with the elaboration time. This feature is especially supported in high-performance architectures, as a strong optimization of performance for many fine-grained parallel problems. Altought it is a quite important optimization, in this section we provide a performance modeling for a task-farm structure which does not consider such optimization, but we assume no overlapping between computation and communication, to model the most general case. A simple performance model can be derived for the task-farm by first considering the service time of each involved process: Emitter Process we can evaluate the average emitter service time Te as the sum of the time needed to schedule a worker plus the input and output communication latencies of messages which size (tsize). If we assume that these communication latencies are equal (which in many cases are an upper bounded quantification), we can define Te = Tsched +2×Lcom (tsize), in which Tsched identifies the computation time needed to decide the scheduling of a task to a selected worker process. Worker Process the worker average service time Tw can be defined as the time

High-Performance Pervasive Computing

45

needed to receive a task, plus the time needed to evaluate F plus the time needed to send the result to the collector: Tw = TF + Lcom (tsize) + Lcom (rsize), where rsize is the size of a result. Collector Process the collector service time Tc can be defined similarly to the one of the emitter: Tc = Tcollect + 2 × Lcom (rsize), where Tcollect is the computation time for collecting a result and for possible post-processing activities. For modeling the performance of a task-farm we can consider the pipe model introduced in Section 5.1., where the emitter is the first stage, the set of workers is the second stage and the collector is the third stage of the pipe. Observing that we can consider an uniform probability distribution for sending an input task to a specific worker, thus we can quantify the average inter-departure time from the task-farm as:   Tw ; Tc (6) Td−f arm = max Ta , Te ; N where in this definition Ta is the average inter-arrival time to the emitter and N represents to overall number of worker processes: that is the so-called parallelism degree for a task-farm parallelization scheme. If we assume that neither the emitter process nor the collector process are performance bottleneck, we can observe that in a general acyclic graph we can find the optimal parallelism degree in such a way that the average inter-departure time from the task-farm is equal to the average inter-arrival time (which is the optimal performance that we can expect). This optimal parallelism degree can be defined by the following equation: Tw (7) Nopt = Ta

5.3. Performance Modeling for a generic Data-Parallel scheme We consider a generic data parallel scheme in which a composite state (e.g. an array or a matrix) is partitioned amongst a set of workers W which iteratively apply a function (e.g. H) on each element of their assigned partition. The function evaluation can feature functional dependencies: for instance the evaluation of H on the i-th element of an array can depend on its nearest neighbors i − 1

46

C. Bertolli, G. Mencagli and M. Vanneschi

and i + 1. Functional dependencies are called stencil and, for this performance characterization, must be statically known. Stencils can vary between different iterations of the same data parallel program (variable stencil) or they can be equal for all steps (fixed stencil). At each step i, all functional dependencies are related to the element values computed at the end of the previous step i − 1. In particular cases we can also have completely independent elaboration between workers, that is the elaboration on their partitions can be executed without any communication between workers (i.e. this case is the so-called map scheme). At the implementation level functional dependencies spanning across different state partitions are solved by means of asynchronous communications between workers. In a generic data-parallel scheme we consider different kinds of processes: Scatter Process the scatter process S, for each received input data, performs a corresponding distribution strategy to the set of worker processes. The distribution can be a scatter collective primitive (partitioning of input data among workers), or a multicast (sending a copy of the input data to each worker). Worker Process each worker process W executes the data parallel loop iterating it for a fixed number of steps or until some convergence conditions become true. In this case the convergence evaluation could require proper communications between workers (e.g. the execution of an associative reduce between state elements). Gather Process at the end of the loop a process G performs the gathering of the local results of each worker. The resulting value is transmitted to the output stream. The performance of a data-parallel program can be defined in terms of both its average inter-departure time (from the gather process) but also in terms of the completion time for each input task (the processing latency). In fact we can observe that the data-parallel scheme, w.r.t the task-farm, has the main advantage to possibly reduce both the service time but also the processing latency of the computation. Therefore it is a suitable parallelism scheme for being employed when we have not an input stream of tasks, but only a small number of tasks that must be completed (in this case, if the task number is limited, task-parallelism

High-Performance Pervasive Computing

47

approaches are not useful anymore). As for the task-farm modeling we assume no overlapping between communication and computation. We provide a general evaluation of the performance for a data-parallel program. Suppose to have a fixed number of step s (otherwise in many cases we can define an upper bound for a convergence condition). We consider the most general case in which there are functional dependencies between workers, and these dependencies can be variable at each execution step. That is, we are exemplifying a variable stencil data parallel program. For this parallel structure we can evaluate the time needed for each worker to perform a single step i of the loop:  Tstep (i) = Th + Tstencil gsize(i) where Th is the time needed to apply the H sequential elaboration to all the elements in the local partition and Tstencil (gsize(i )) is the cost (in terms of communication latencies Lcom ) of receiving remote ghost elements and sending the local ghosts to the destination processes at step i. For this purpose we obtain the global size of sent and received messages (that can be variable at each step of the loop) from a function gsize. Note that, if we know the number of processes, the size of local partitions and the mapping between partitions and processes, we can precisely compute the size of the messages to be sent and to be received, i.e. we can statically define the gsize function. Note also that this can be done because we are assuming that stencils are statically defined. At this point we can compute the average service time for the overall set of worker processes as following: Tw =

N X

Tstep (i)

i=1

i.e. as the sum of execution time for each step of the data parallel loop. After this time, the workers are able to receive the successive input task from the scatter process. The scatter service time is the time needed to perform a collective primitive (a scatter or a multicast primitive). We consider a sequential implementation for such kind of collective primitives which are executed by only one process (parallel versions can also be exploited with a k-ary tree of processes performing the communications). We can express the time for performing a distribution as Tdistrib and the time for performing a collection as Tcollect. These

48

C. Bertolli, G. Mencagli and M. Vanneschi

costs can be properly defined in terms of communication latencies, for instance if we consider a sequential multicast of an input data structure (with size tsize) between a set of N workers we have: Tdistrib = Tmulticast (tsize) = N × Lcom(tsize) As similar as the distribution phase we can define the time to perform a gather primitive as Tcollect. Thus, we can measure the overall processing latency Tdp−exec to compute a specific input task as the sum of the three main elements that we have described before: Tdp−exec = Tdistrib + Tw + Tcollect

(8)

We can also compute the average inter-departure time Td from the gather process as the maximum between the scatter average inter-arrival time Ta and the service time of each stage of the data-parallel scheme (i.e. the scatter process, the set of workers and the gather process): Td = max {Ta , Tdistrib , Tcollect , Tw }

(9)

5.4. Reconfiguration Cost Model for the ASSISTANT ParMod The performance modeling introduced in the previous sections is the basic building block for evaluating the performance of an ASSISTANT application and for deciding when reconfiguration activities for each application ParMods are necessary, in order to improve the expected performance according to the actual conditions of the surrounding execution platform. For deciding which reconfiguration activities could improve the performance of our application, we need to evaluate and compare the performance improvement of exploiting a different configuration (e.g. modifying the version or the deployment decisions for an application ParMod or a set of ParMods) w.r.t the overhead spent on completing the reconfiguration activity itself. For this reason we need to quantify the cost incurred when a reconfiguration must be executed. Regardless of the particular control logic defined by the programmer (e.g. reactive or pro-active as hinted in Section 4.), we can describe th e interaction pattern (Figure 16) between Control and Operative Part of an ASSISTANT ParMod. A ParMod can be executed by using different alternative textbfconfigurations. For configuration we mean the used operation and

High-Performance Pervasive Computing old ParMod Configuration

Operating Part

reconfiguration

49 new ParMod Configuration

TStartup

reconf point

Reconf. commands

Control Part

TControl

TDeployment

Reconf. feedback

L Reconf Monitoring Update

T Reconf

Monitoring Update

Figure 16. Abstract interaction scheme between the Operating and the Control Part. its current implementation features (e.g. its parallelism degree). Whenever the Control Part receives context information update, the corresponding control logic is started with the main objective of selecting a new configuration which is more suitable for the current execution condition. To achieve the reconfiguration activities, the Control Part must decide a set of reconfiguration commands (that must be transmitted to the Operating Part in order to start the reconfiguration execution. The average execution time for this selection algorithm is a parameter TControl of our reconfiguration cost model. When reconfigurations have been decided, the Control Part is also responsible for instantiating the new configuration. We can identify different run-time support activities: • if the control logic decides to perform a non-functional reconfiguration (e.g. a parallelism degree variation), it has to interact with the processes implementing the Operating Part and to possibly instantiate new ones. For instance, if a parallelism increase is decided, it has to instantiate a set of workers on available computing nodes; • if the control logic decides to switch to a different operation, the entire Operating Part must be instantiated on a proper set of computing nodes. In the model we denote the cost of both cases as the average time TDeployment . This overhead can be heavily influenced by different factors:

50

C. Bertolli, G. Mencagli and M. Vanneschi • if we suppose a static knowledge of the underlying distributed platform (i.e. target nodes are known), all or a subset of processes implementing the Operating Part must be dynamically deployed on the selected computing nodes. The parameter TDeployment identifies the average time needed to transfer the process source codes (or the corresponding executables) and to instantiate them on the corresponding processing nodes; • in many real-world cases, especially for dynamic distributed systems (as in [30,39]), the Control Part has not a static knowledge of available nodes. Each operation corresponds to a general class of execution platforms (e.g. cluster architectures or multicore nodes). When a reconfiguration decision is made, the Control Part has to discover a set of compatible nodes on which it can start the selected operation or a set of workers. This discovery phase is part of the TDeployment overhead (e.g. see [12]).

Once the reconfigurations have been decided and when all the necessary deployment actions have been done, the Control Part notifies the reconfiguration commands to the Operating Part. At this point, when the Operating Part elaboration reaches a reconfiguration point (which will be described in greater details in Section 7.), the reconfiguration commands will be processed and the new configuration will become effective. At this point, according to the class of reconfigurations which are required, some run-time support activities could be necessary. For instance in the case of a functional reconfiguration we must switch to a different ParMod operation. In our reconfiguration model, when the reconfiguration commands are received, the processes implementing the new operation have been already deployed on the selected computing nodes. For completing the reconfiguration the Operating Part processes must execute specific reconfiguration protocols (see Section 8.). Next, the new ParMod operation is activated. In general the average time to make a reconfiguration effective is the overhead TStartup depicted in Figure 16. In our adaptation cost model we can identify two important reconfiguration metrics: • the total reconfiguration time (i.e. TReconf ) is the time between the reception of a context event and the time in which the new configuration is being executed;

High-Performance Pervasive Computing

51

• the reconfiguration latency is the time needed to reconfigure the ParMod, that is the time for the deployment activities and for starting the new module configuration. By referring to Figure 16 these two metrics can be described by the following expressions: LReconf = TDeployment + δ + TStartup (10) TReconf = TControl + LReconf

(11)

δ is the average time between the reception of reconfiguration commands and the feedback externalization. δ depends on the average time between two subsequent reconfiguration points during the Operating Part execution. The presented adaptation cost model makes it possible a partial overlapping between the total reconfiguration time and the “normal” execution of the Operating Part. For instance we can observe from Figure 16 that the TControl and TDeployment overheads are always overlapped with the Operating Part execution. Therefore the Control Part sends the reconfiguration commands only when: the new configuration has been decided (i.e. the control logic execution is completed) and after the completion of all the necessary deployment actions. In our model TStartup is the only overhead which is not overlapped with the ParMod computation. It is an unavoidable critical overhead which must be paid to make the new configuration effective. We can also observe that, when no reconfigurations are necessary, the TControl delay is completely overlapped with the Operating Part elaboration, TDeployment = 0 and the TStartup overhead is negligible because it only consists in the delay of making an asynchronous notification of the feedback message from the Operating Part to the Control Part.

6. Implementation We describe the ParMod implementation in terms of a concurrent program composed of a set of distributed processes cooperating by means of proper typed communication channels. Inter-process communications are expressed by using specific send and receive primitives which are the only way that processes have to communicate and cooperate in a classical message-passing model (i.e. there are no shared variables). First of all we provide the basic concepts of a concurrent language and then we describe the ParMod implementation.

52

C. Bertolli, G. Mencagli and M. Vanneschi

6.1. Brief introduction to a Concurrent Language A message-passing concurrent language as [5,31] provides the concept of typed channel. A channel is a run-time support data structure which is composed of: • a queue of in-coming messages. The queue has a maximum size and each element is a message; • a set of control information (e.g. the size of a single message, the actual number of un-readed messages in queue and other data for a correct process synchronization). Two specific classes of communication primitives are defined: the send primitive makes it possible an out-coming communication between a sender process (which executes the primitive) and a receiver process. The primitive considers at least two input parameters: (i) a channel identifier and (ii) a message buffer which is copied into the channel queue. If the channel queue is full at the act of sending a message, the sender is suspended until the other communication party (i.e. the receiver process) performs a corresponding receive primitive. Similarly, a receive primitive considers at least two parameters: (i) a channel identifier and (ii) a buffer in which the received message must be copied (i.e. the so-called tag variable). During the receive execution the first message present in the channel queue is copied into the tag variable passed as an input parameter to the primitive, whereas, if the channel queue is empty, the receiver process is suspended until a send on the same input channel is invoked by another process. Different kinds of communication channels can be defined and used in a full concurrent language: synchronous or asynchronous communications and symmetric or asymmetric communications (both with multiple sender and receiver processes). Communication channels are a fundamental mechanism for the composition and the interfacing of processes in a concurrent program. The input channels of a process may be used according to a data-flow scheme where, for each activation, the process waits for values from all the input channels (e.g. in a specific deterministic order). In the most general case a subset of input channels can be selected at each activation: this expresses a non-deterministic behavior. The semantics which we consider for non-determinism is the one of the CSP model [31] based on guarded commands, and in particular the semantics

High-Performance Pervasive Computing

53

alternative { on g1 : p r i o r i t y , p r e d _ 1 ( . . . ) , send / r e c e i v e ( . . . ) do {cmd−l i s t _ 1 } ; ... on gN : p r i o r i t y , pred_N ( . . . ) , send / r e c e i v e ( . . . ) do {cmd−l i s t _ N } ; }

Figure 17. Alternative command in a concurrent language. of ECPS concurrent language [5]: every guard contains at most three elements, each of which may be absent: • a global guard (an input or output guard that is a proper send or receive primitive); • a local guard (i.e. a boolean predicate on local state variables of the process); • a priority (an integer number which can be varied by program). This non-deterministic choice between different input/output channels is expressed by means of a specific programming construct called alternative command, whose general syntax is shown in Figure 17. The evaluation of each guard can lead to three different situations: • the guard is failed (i.e. if the boolean predicate is evaluated to false); • the guard is suspended. It is not failed and the send or receive primitive can be performed without suspending the calling process (e.g. for a send primitive this happens if the channel queue is not full and for the receive primitive when the channel queue is not empty); • the guard is verified. If it is neither suspended nor failed. The semantics of the alternative command is such that when all the guards are failed the command fails, otherwise if there are some verified guards one of these is non-deterministically chosen and the corresponding command list is executed. Finally, if there are no verified guards and if there are some suspended

54

C. Bertolli, G. Mencagli and M. Vanneschi Operating Part

W1

OUT

. . .

. . .

. . .

. . .

W2 IN

Wn

M1

. . .

Mr

Control Part

Figure 18. Implementation of the ParMod as a distributed set of processes including: an input section process IN, an ouput section process OUT, a set of workers W and a set of managers M. guards, the calling process is suspended until at least a guard becomes verified (a corresponding send or receive primitive is executed by other concurrent process).

6.2. ParMod implementation in terms of a concurrent program The ParMod Operating Part includes a set of distributed processes cooperating by proper communications on typed channels. We can identify three different kinds of processes: the IN process performs distribution and scheduling activities, a set of Worker processes execute the parallel computation of the ParMod. Workers can operate independently (e.g. as in a task-farm or map parallelism scheme) or in cooperation (e.g. as in a data-parallel with communication stencils). Finally, the OUT process exploits all the result collection and post-processing activities. We can notice that parallel implementation for the distribution and collection activities are also possible but we do not consider these cases in this first ParMod implementation scheme. The overall overview of the Operating Part implementation in terms of both concurrent distributed processes and their communication channel is depicted in Figure 18. IN process executes the specific distribution strategy receiving data from a set of incoming input channels IN1 , IN2 , . . . , INN (which implement the

High-Performance Pervasive Computing

55

ParMod input streams). The behavior of the IN process is shown in Figure 19. It IN : : w h i l e ( cond ) { alternative { on g1 : r e c e i v e ( IN −1) do { p r e −e l a b 1 ( ) ; s c h e d u l e 1 ( ) ; } ... on gN : r e c e i v e ( IN−N) do { p r e −e l a b N ( ) ; s c h e d u l e N ( ) ; } } }

Figure 19. Pseudo-code of the IN process. is based on a classic guarded alternative command, where the activation of each guard follows an optional pre-elaboration of input data and a proper scheduling strategy (e.g. scatter, multicast, on-demand and round-robin) to workers. This policy requires the transmission of input tasks (or parts of them) to the set of worker by using specific communication channels. The implementation of a generic worker process can be broadly defined as shown in Figure 20. The worker cyclically receives input tasks from the IN proW: : w h i l e ( cond ) { receive ( . . ) ; compute ( . . ) ; send ( . . ) ; }

Figure 20. Pseudo-code for a generic worker process. cess, executes a sequential processing on them, possibly by collaborating with other workers, and it optionally submits obtained results to the OUT process by means of a proper communication channel. In some cases worker can cooperate each other, for instance it is a common situation in many data-parallel schemes. In this case workers can cooperate each other by sending and receiving messages on specific communication channels between the worker set. Finally, the OUT process behavior is shown in Figure 21: It collects result(s) from one or multiple workers (according to specific strategies such as

56

C. Bertolli, G. Mencagli and M. Vanneschi

OUT : : w h i l e ( cond ) { collect ( .. ); p o s t −e l a b o r a t e ( . . send ( . . ) ; }

);

Figure 21. Pseudo-code for a the OUT process. collect-from-any or gather), it performs some post-elaborations and it forwards the possibly transformed results to one or more of the out-going channels OU T1 , OU T2 , . . . , OU TN which implement the output streams of the application ParMod. Finally we conclude this section by briefly describing some implementation aspects of the Control Part of an ASSISTANT ParMod. The ParMod Control Part is composed of a set of concurrent processes cooperating with the Operating Part processes which have been described above. We can define different implementation schemes: • In the most simplest case the Control Part is implemented by means of an unique, centralized process, which is responsible for executing the adaptation strategy of the ParMod and for deciding proper reconfiguration actions. In this case the process (which we can call manager in the rest of the section) receives proper event messages both from manager processes of other Control Parts (Control Parts can cooperate each other to perform a global adaptation of the overall application graph) and from context interfaces, which are responsible for exploiting platform monitoring activities and for periodically provide information concerning the actual platform conditions. After the execution of the control logic, the manager process transmits reconfiguration commands to the Operating Part processes (both the IN, the OUT and the set of worker processes) by means of specific reconfiguration channels (i.e. reconf_channel); • for fault-tolerance reasons in many cases we may replicate the Control Part execution on a set of replicated managers executed on different distributed platforms. In this case this replication can be: (1) active if these replicas are fully active and managed according to a-priori consistency

High-Performance Pervasive Computing

57

guaranteeing protocols [40]; (2) primary/backup if only one replica is active and in case of failure we need to properly activate a manager replica in a consistent way [28]; • we can also think to partition the Control Part execution in different set of manager processes which are responsible for controlling and sending reconfiguration commands to a set of Operating Part processes. This situation can be useful for simplify the mapping decision between ParMod implementation processes and their execution resources and to parallelize the reconfiguration protocol execution and its management.

7. Reconfiguration Points The main objective of this section is to clearly define the concept of reconfiguration point which aim has been partially introduced in the previous parts of this chapter. In the ASSISTANT programming model reconfiguration points are specific computing instants during the Operating Part execution at which the various implementation processes (or threads) test if the Control Part has notified reconfiguration commands. The design and the concrete implementation of these points is especially critical for adaptive applications, as their definition has a relevant impact on the time needed for a ParMod to reconfigure itself (i.e. the reconfiguration latency introduced in Section 5.). As stated in previous section the Operating Part and the Control Part of an ASSISTANT ParMod need to cooperate each other in order to fulfill the necessary run-time reconfigurations. The Control Part implements the adaptivity logic by selecting proper reconfiguration activities. When reconfigurations are selected the Control Part communicates a set of corresponding reconfiguration commands to the Operating Part. The reception of these commands at proper reconfiguration points and the execution of corresponding reconfiguration protocols make the new selected ParMod configuration effective. Reconfiguration points may not automatically coincide with a consistent state for the Operating Part computation but they can be simply considered execution points in which the Operating Part processes test the presence of reconfiguration commands submitted by the Control Part. As a consequence of this

58

C. Bertolli, G. Mencagli and M. Vanneschi

fact, in some case it is required thet processes first reach a consistent state before applying the reconfiguration activities which are necessary (see Section8.). Thus, a main trade-off is that more frequent reconfiguration points typically induce a lower reconfiguration latency, but they may also require a more costly reconfiguration protocol to ensure consistency during the reconfiguration. In this section we present a concrete definition of the reconfiguration point concept by describing the Operating Part generic behavior of a ParMod and its cooperation with the corresponding Control Part. As we have seen in Section 6., the Operating Part computation is executed by a set of distributed processes which behavior can be described by means of a concurrent language.

7.1. Generic Description of Reconfiguration Points For an ASSISTANT ParMod we can define the following general reconfiguration points: • in the IN process the presence of reconfiguration commands from the ParMod Control Logic can be tested before entering the alternative command, or while waiting for one guard to be fired (see Figure 19). Further reconfiguration points can be defined for the IN process by enabling the reconfiguration command reception also during the pre-elaboration and the scheduling phases; • in the worker processes reconfiguration commands can be tested before the receiving of an input task and after producing the result for the previous task (see Figure 20); • in the OUT process reconfiguration commands can be tested after sending a result to one or more output streams and before starting to collect the next result from worker(s) (see Figure 21). Further reconfiguration points can be defined during the post-processing phase. The above description of reconfiguration points is quite general and can be specialized to the parallelism schemes considered in Section 5.. In this chapter, for brevity, we provide the concrete identification of reconfiguration points only for the class of task farm computations, leaving to future work the a similar definition for data parallel programs.

High-Performance Pervasive Computing

59

7.2. Reconfiguration Points for the Task Farm scheme In Figure 2 (see Section 3.) IN and OUT implement the emitter (E) and collector (C) processes. For these processes we can identify the following reconfiguration points: Emitter Reconfiguration Points In general, for the emitter we can consider the presence of two classes of reconfiguration points: (1) points placed while waiting for receiving an input task from the other application modules, (2) points placed before and/or during the scheduling phase. The emitter can be implemented as an alternative command between all the in-coming channels which implement the ParMod input streams. A possible reconfiguration point can be placed by considering a further input guard of the same alternative command, corresponding to a special input channel which provides reconfiguration commands from the ParMod Control Part (i.e. the so-called reconf _channel). In this case if the emitter process is waiting for the reception of the next input task it is always possible to accept reconfiguration commands and to perform proper reconfiguration activities. We can also observe that in this special kind of reconfiguration points the emitter has scheduled the previous task but it has not still received the next one, hence there are no pending tasks on it. In the latter case we consider the scheduling phase: the emitter has received an input task which must still be scheduled to a worker process for the elaboration. We select a load-balancing scheduling strategy for the task-farm, based on a on-demand scheme: the emitter schedules each input task to an available worker, that is a worker which is immediately ready to start the corresponding computation. This is implemented by special “free” channels on which each worker signals the emitter that it is available to receive the next task. In more details, the scheduling phase employes the reception of the availability messages from worker processes by means of a set of “free” channels. After the reception of an input task the emitter non-deterministically waits for an availability message by using another alternative command featuring an input guard from each worker “free” channel. The corresponding command lists are responsible for post-processing activities and

60

C. Bertolli, G. Mencagli and M. Vanneschi for scheduling the input task to an available worker. In this situation a second reconfiguration point can be placed during the waiting for availability messages. Therefore, we provide a further input guard from the Control Part reconf_channel. If the emitter process is waiting for the reception of an availability message it is always possible to accept reconfiguration commands and to perform proper reconfiguration activities before the completion of the scheduling phase. As a difference with the first kind of reconfiguration points, during the reconfiguration we must consider the presence of a pending task, that is a task which has been completely reception by the emitter but it is not yet scheduled to an available worker. task

task

1

IN1

free

. . .

A L T

E reconf channel

A L T

1

. . .

. . . INn

p

free

p

PC Figure 22. Emitter behavior for a task-farm scheme. In Figure 22 it is depicted the emitter behavior considering both the task reception and the scheduling phases to workers. In this representation the emitter process inside the Operating Part is represented as a circle labeled with the symbol E and alternative commands are represented as a square boxes. The emitter behavior considers a first alternative command for the reception of input tasks from other application modules (input channels IN1 , IN2 , . . . , INn ) and a second nondeterministic selection for receiving availability messages from workers (channels f ree1 , f ree2 , . . . , f reep ) in such a way as to perform an on-

High-Performance Pervasive Computing

61

demand distribution. Reconfiguration commands can be received in both the situations by adding a further input guard from reconf_channel in each of the two alternative commands. Channels task1 , task2 , . . . , taskp are used to schedule a task to a specific worker. Collector Reconfiguration Points As in the previous case we can identify two classes of reconfiguration points which characterize the collector process execution: (1) points placed while waiting for receiving the results from the worker set, (2) points placed before forwarding the received results to the output streams of the ParMod. During the task-farm execution the collector process performs a classical collection strategy based on a non-deterministic reception of results from the set of workers. At the implementation level the collector elaboration is activated by an alternative command featuring a set of input guards from the overall set of workers. A possible reconfiguration point can be placed before the reception of the next result by considering a further input guard of the same alternative command, corresponding to the special input channel reconf_channel from the Control Part. In this case, if reconfiguration commands are received, we have no pending results which have been completely collected but not yet forwarded to proper output streams of the application ParMod. When a result is received the collector optionally performs postprocessing elaborations and then the result is sent to an out-going channel which implements an output stream from the application module. At this point we can consider this situation: if the output channel queue on which the result must be forwarded is full, the corresponding send primitive will suspend the collector process until an element is extracted from the queue. It means that the collector process will not be able to receive any reconfiguration commands (an so to start the reconfiguration protocol) until the result transmission is completed. To avoid this situation the collector should be also able to check the presence of reconfiguration commands during the result transmission phase. For these reasons the collector, after a result reception, executes a second alternative command in which we consider a set of output guards (i.e. involving a send primitive on an output channel) and a further input guard of the same alternative command

62

C. Bertolli, G. Mencagli and M. Vanneschi corresponding to the special reconf_channel which provides reconfiguration commands. In this way it is possible to receive reconfiguration commands before the completion of a result distribution. Obviously, in this case there is one pending result which has been completely received by the collector but not yet transmitted to an output stream. result1

OUT1

C reconf channel

p

A L T

. . .

. . . result

A L T

OUTm

PC Figure 23. Collector behavior for a task-farm scheme. In Figure 23 it is depicted the collector behavior considering both the result reception and the subsequent transmission to workers. The collector behavior considers a first alternative command for result reception from workers (channels result1 , result2 , . . . , resultp ) and a second non-deterministic selection for forwarding the results to output streams (channels OU T1 , OU T2 , . . . , OU Tm ). Reconfiguration commands can be received in both the situations by adding a further input guard from reconf_channel in each of the two alternative commands. Worker Reconfiguration Points At the beginning of its execution a worker process initially signals the emitter that it is free to receive a task (availability message). Next, it cyclically receives a task from the emitter, performs a sequential computation and then the result is transmitted to the collector by means of a proper send primitive. Finally, it signals again the emitter that it is available for another task assignment. A worker can be signaled of reconfiguration commands before receiving

High-Performance Pervasive Computing

63

a task and after the corresponding sequential elaboration. In the former case the worker elaboration starts with a non-deterministic reception (i.e. an alternative command) among the input channel from the emitter and the reconf_channel from the Control Part. In this case, if reconfiguration commands are received, there are no pending tasks on the worker. In the latter case reconfiguration commands can be signaled to the worker before its result has been transmissed (which can also suspend the process). For these reasons worker result transmission uses a second alternative command in which an output guard involves a send primitive to the collector (for providing the results) and an input guard from the Control Part checks the presence of reconfiguration commands. In this case there is one pending result in the worker process during the reconfiguration.

task

i

A L T

Wi

A L T

result i

reconf channel

PC Figure 24. Worker behavior for a task-farm scheme. In Figure 24 it is depicted the worker behavior considering both the input task receivement and the subsequent result transmission to the collector process. Worker i considers a first alternative command for the reception of tasks from channel taski ) and a second non-deterministic selection for forwarding the results to the output channel (channel resulti ). Reconfiguration commands can be received in both the situations by adding a further input guard from reconf_channel in each of the two alternative commands.

64

C. Bertolli, G. Mencagli and M. Vanneschi

8. Consistent Reconfiguration Protocols A main issue when a ParMod Operating Part applies a reconfiguration consists in guaranteeing that the application semantics is respected. In fact, the activities performed by processes when applying reconfigurations may modify the semantics of the computation performed by the component depending on the kind of applied reconfiguration and the parallel structure implemented by the Operating Part. We denote protocols which guarantee that the component semantics is not modified during reconfiguration as consistent reconfiguration protocols. This requirement may be issued either by the client or the application programmer: that is, in some cases there is not the need that reconfiguration activities respect the application semantics. To clarify the goal of a consistent reconfiguration protocol consider the following examples: • as described in Section 3., consider the case in which we reduce the parallelism degree of a data parallel program. This must be done by performing a proper protocol otherwise, if we remove a worker without any further effort, we loose the partition which it was processing and other workers may wait undefinitely to receive a message (satisfying the program stencil) from the removed worker. A solution is to wait for the conclusion of the current task and to remove the worker before starting to execute the next task. In another case, we may be required to perform a reduction during the execution of a task: to do so, we can re-assign the partition of the removed worker to other workers. This can be done with some careful: in fact we cannot assign state elements beloging to different computing steps to a same worker because, for instance, the worker program would not be correct when resolving the stencil 5 . Therefore, we may first force workers to reach a common computing step, then re-assign the partition of the removed worker, and finally remove the worker. • when the Operating Part performs an operation switching from a given operation (source operation) to another one (target operation) it is required that the tasks currently in execution on the source operation are 5 Clearly, this depends on how workers are implemented. For the purposes of this example we assume that worker programs are developed assuming that state elements in a same partition always refer to the same data parallel step at any time in the computation.

High-Performance Pervasive Computing

65

not lost. To do so, we can think to perform the following set of actions before applying the switching: stop receiving input tasks; wait for the termination of tasks which are currently in execution; deliver their results to the ParMod output stream. Otherwise, we may transfer from the source to the target operation those tasks which are currently in execution on the source operation by taking proper computing snapshots, which clearly depend on the semantics of both operations. Note that these protocols are different from the ones used to support fault tolerance, as the former have to guarantee that the whole reconfiguration phase is consistent w.r.t. the application semantics, whereas fault tolerance protocols: (i) in some cases have to restore consistency after a failure that has modified the computation semantics; (ii) in some other cases they need to guarantee continuos consistency in spite of failures, i.e. they are applied during the whole execution (e.g. replication-based techniques). In this section we assume that, in the supported applications, tasks are independent: for instance, we do not consider the case of tasks which require the existance of an input state shared between different tasks. These application cases can be simply modeled in our independent task model by aggregating dependent tasks in a same super-task and by executing its sub-tasks all inside the same parallel computation (e.g. as in the case of sub-tasks of a data parallel program). Moreover, we assume that each task execution produces a result on the output streams. In general terms consistent reconfiguration protocols are applied to a situation modeled in Figure 25: a ParMod consumes from a set of input streams and produces results to a set of output stream. If we take a computation snapshot, we can see that there are: (i) a set of tasks TIN which are still to be received and which are stored in the input streams; a set of tasks TP currently in execution on the ParMod; and a set of task results TOU T previously produced on the output stream by the ParMod. In this sense, the goal of a consistent reconfiguration protocol is to properly manage the set of TP tasks. We formalize the concept of consistency implemented by a reconfiguration protocol. In this section we consider two definitions: the first definition only requires that all tasks passed to the ParMod are executed and their results are delivered to the output streams:

66

C. Bertolli, G. Mencagli and M. Vanneschi

Figure 25. Graphical representation of a ParMod computation snapshot highlighting relevant information related to tasks. TIN is the set of tasks produced on the input streams but not yet consumed by the ParMod. TP is the set of tasks currently in execution on the ParMod. TOU T is the set of tasks which results have been previously produced to the output stream. Definition 1 Weak Consistency all elements produced to the ParMod input streams are processed by the ParMod and their results are delivered to the intended consumers. Note that this definition does not admit to loose a result but it permits the replication of results. The second definition forbids also replication of results: Definition 2 Strong Consistency all elements produced to the ParMod input streams are processed by the ParMod and their results are delivered to the intended consumers at most one time. In this section we introduce a formalization methodology of consistent reconfiguration protocols which is based on a proper modeling tool enabling us to define protocols in terms of tasks and results.

8.1. Formalization of Reconfiguration Protocols As hinted, a consistent reconfiguration protocol must manage in a proper way tasks and their results (i.e. stream elements). For this purpose we introduce a modeling tool enabling us to uniquely identify stream elements and which requires that such elements may be recovered at any instant of the computation by accessing the stream with the proper element name. The tool is inspired from the Incomplete Structure (or I-Structure) data structure, introduced with other purposes in data-flow programming models [3] and previously used in [11] to model fault tolerance protocols.

High-Performance Pervasive Computing

67

In our work an I-Structure is a possibly unlimited collection of typed elements where each element has a sequence identifier, or a position. In this paper we map sequence identifiers to interger numbers. There are two ways of accessing an I-Structure: • we can read the element stored in a given position. This operation is denoted with get (position, element) and, in case the provided position is empty, it blocks until a value is produced on the position (i.e. it operates according to a blocking semantics); • we can write a value to a given position. This operation is denoted with put ( position, element ) and it features the following write-once property: it is not possible to perform a put operation more than once on each IStructure position.

...

...

...

Operation 1 Operation 1 Operation 3

OUT

i

ParMod

...

C1

...

...

Operation N

j

j

. . .

. . . GN

IN

i

G1

CM

Figure 26. I-Structure model of ParMod input and output streams. I-Structure elements colored with a darker gray denote elements which have been consumed, while elements colored with lighter gray denote elements produced on the stream but not yet consumed. The shown indexes denote the last consumed elements (i.e. j, l) and the last produced elements (i.e. i, k). In Figure 26 we show how the I-Structure tool is used: all input streams of a ParMod are mapped to a single I-Structure denoted with IN, and all output streams are mapped to a further I-Structure denoted with OUT. When accessing IN or OUT, the producers, consumers and ParMod processes must arrange the generation of positions in such a way that both the write-once and application semantics are respected. For a correct understanding of how I-Structure are used note that there is not any notion of ordering between the input IN and output OUT streams: broadly, a

68

C. Bertolli, G. Mencagli and M. Vanneschi

task produced on the input stream has an index iIN and its corresponding result has an index iOU T on the output stream, where it may be that iIN 6= iOU T , i.e. tasks and results are not necessarily ordered on the input and output streams. In this model we can identify, for each I-Structure, two indexes: the index of the last consumed element (e.g. j, l in the figure); the index of the last produced element (e.g. i, k in the figure). By using these indexes we can precisely characterize the task sets shown in Figure 25: for instance TP can be defined as the set of elements with indexes on the input stream from 0 to j to which we have to substract all elements which results have been produced to the output stream, i.e. result elements with indexes on the output stream from 0 to k. This index-based modeling is especially useful when proving the correctness of consistent reconfiguration protocols. Nevertheless, in this paper we are interested in a more practical description of consistent protocols and we avoid to give correcteness proofs. For this purpose we define reconfiguration protocols at the level of implementation by using the information derived from the abstract I-Structure level to the implementation level.

8.2. Implementation We have seen that in the ASSISTANT implementation we map each stream element to a message in the corresponding channel. To correctly implement the I-Structure abstract model we have to extend the described implementation. In fact, the I-Structure abstract model requires that, at the implementation level, elements (or messages) can be recovered at any time during the computation. If we look at the channel implementation we can see that, when a message is received (i.e. extracted) from the message queue, its content can be overwritten by successive messages. Therefore, we have to extend the implementation level (see Section 6.) to guarantee message recovery. We can think of at least two ways of implementing this property: • we can support communication channels with message logging techniques [21]: when a message is stored to the message queue it is also copied to an external memorization support and it cannot be overwritten. If a process needs to recover a given element, i.e. an element that it has previously received, it accesses the external memorization support with the proper position;

High-Performance Pervasive Computing

69

• we can require all application components to re-generate elements under request. Also in this case, a component can recover a previously received element by performing a special request to the generating component with the related element position. Note that, on one hand, message logging techniques may induce an additional overhead to communications, which is experienced during the whole computation, i.e. not only when applying a reconfiguration. The implementation based on element re-generation is not affected by this additional overhead. On the other hand message logging techniques do not require components to regenerate elements, thus limiting the scope of a reconfiguration protocol to the reconfigured component itself. This means that in the message logging-based implementation reconfiguration protocols scale better with the number of application components. Unlike this, we cannot define an upper bound on the number of involved components during a reconfiguration when we select re-generation to recover stream elements. In this paper we show an implementation based on element re-generation: we will show a message-logging implementation of I-Structure in future work. Nevertheless, the protocols presented in this paper are valid independently of the way in which stream elements are recovered.

Figure 27. Implementation of the I-Structure model: input and output streams are respectively implemented with two communication channels. Recovering of elements is implemented with re-generation from components, i.e. without message-logging. The implementation of the I-Structure model for streams is shown in Figure 27. All input and output streams are respectively mapped to the same CHIN and CHOU T channels. For the purpose of re-generation, elements are labeled

70

C. Bertolli, G. Mencagli and M. Vanneschi

Figure 28. Model of Vector Clock mapping output stream (OUT) sequence identifiers to input stream (IN) ones. The VC model is ordered w.r.t. output stream sequence identifiers, hence in some cases we can possibly avoid to show the OUT fields. with their input stream sequence identifier when they are produced, i.e. when performing a put (or a send to CHIN ) the sent element includes the element content itself along with the input stream sequence identifier. The input stream identifier is also preserved when the corresponding task result is produced to the output stream, i.e. the element passed to the output stream includes the result plus the sequence identifier of the related task on the input stream. In this way we guarantee that all processes have a common view on tasks and results identifiers and can straightforwardly map results to be re-produced to their corresponding input task. Finally, we define the notion of Vector Clock (VC, in short) which is used in the definition of some kinds of consistent recovery protocols. A Vector Clock models a mapping between I-Structure sequence identifiers. In this specific case, for re-generation purposes we are required to map output stream identifiers to corresponding input stream identifiers (i.e. result identifiers to related input task identifiers). This is needed because, as hinted, input and output streams are not necessarily ordered between themselves. Generically, a Vector Clock includes information of the form shown in Figure 28. On the top of the VC we put the sequence identifiers of results on the output stream (from 1 to N ); these are related to the corresponding sequence identifiers of input task generating the results (from k1 to kN ). In a VC we can identify the maximum contiguos sequence identifier, in the set k1 , . . . , kN , as the maximum integer included in the input stream identifiers for which all its precedessors are included in the set k1 , . . . , kN . For instance, in the example of Figure 29, MC = 3, as the mapped identifiers include all numbers from 1 to 3, and the successive included number is 12.

High-Performance Pervasive Computing

71

OUT 1 2 3 4 5 6 IN

18 3 15 1 2 12

Figure 29. Example of VC in which input and output identifiers are un-ordered. This is the typical case resulting from a task farm computation in which the task execution time has a large deviation. The value of MC also depends on the ordering of tasks on the input stream w.r.t. the ordering of corresponding output stream results. For instance, in a data parallel computation the input and the output streams are ordered between themselves, hence MC is also the maximum received sequence identifier of results. This is not true for the task farm case, as the input and output streams may be un-ordered. In some cases, depending on how we use the information included in a Vector Clock, we can reduce its size by removing all elements in the set k1 , . . . , kN with sequence identifiers smaller than MC.

8.3. Consistent Operation Switching Protocols We exemplify consistent reconfiguration protocols implementing the notable case of operation switching reconfigurations. The interested reader can refer to [44] for examples of consistent reconfiguration protocols for the modification of the parallelism degree. There are at least three ways of implementing a consistent operation switching protocol: • we can wait for the source operation to perform all tasks in TP and then make the target operation start from the first task which was not consumed by the source operation. When the source operation is notified of the reconfiguration, it first needs to stop to receive input task from IN. To re-start from the correct input task, the target operation must know the position of the last value consumed by the source operation. We denote these kinds of protocols as rollforward protocols;

72

C. Bertolli, G. Mencagli and M. Vanneschi • when requested to perform an operation switching from the Control Part, the source operation can simply stop its execution. The target operation have to re-obtain the tasks in TP and re-starts their execution. We denote these kinds of protocols as general rollback, because they can be supported independently of the actual parallel computation performed by the source and the target operations. The design of these kinds of protocols requires the target operation to characterize the TP set, possibly sharing information with the source operation. A solution consists in analyzing the IN, OUT streams and a VC relating them. • we can transfer the set of TP tasks from the source to the target operation. Note that, strictly speaking, the value of the tasks in the source operation may be modified by their execution (e.g. as in some data parallel programs). We have then two options: (i) all processes in the Operating Part keep a copy of the task they are processing; (ii) depending on the parallel program implemented by the source and target operations we can define a mapping between the two computations without necessarily rollback to the input task definition. In both cases we denote these kinds of protocols as specialized rollback or rollforward, because their design is dependent on the parallel program implemented by switched operations.

In this section we describe: a rollforward and a generic rollback protocol; two specialized rollback/rollforward protocols supporting operation switching when source and target operations are both data parallel and task farm programs. In the description we assume that the target operation have been previously deployed and it is ready to start its execution. Generic Rollforward As hinted, a rollforward protocol is based on “flushing out” all tasks in TP in the source operation and then start the execution of the target operation from the last un-consumed element from the input stream. To do so, we can simply connect both operations to the input and output channels CHIN and CHOU T and implement the following sequence of actions on the ParMod Control and Operating Parts: • the Control Part notifies Operating Part processes on the source operation to perform an operation switching;

High-Performance Pervasive Computing

73

• these processes go on with their computation, except that they stop receiving tasks from the input stream; • when the source operation processes have completed all tasks and sent their corresponding results to the output stream, they signal to the Control Part that they have terminated the rollback protocol; • the Control Part notifies the processes on the Operating Part of the target operation to start their execution; • these processes simply start to receive input tasks from CHIN , which message queue contains the last un-received task. This is automatically obtained from the characteristics of the ASSISTANT implementation (see Section 6.). Note that this protocol implements the strong consistency definition, as all tasks are executed and result duplication is avoided. The application of this protocol depends, from a performance viewpoint, on the time available to perform the operation switching and the time needed to execute the protocol. This time includes three communications between the Operating and Control Parts, plus the time needed to complete all tasks currently in execution on the source operation Trollf wd . This latter time depends on the kind of parallel computation implemented by the source operation. We consider two notable cases: • if the source operation Operating Part implements a task farm, we have to consider the parallel execution of all tasks possibly executed on workers (each of which costs TF ) plus the time needed to schedule, execute, collect and deliver the last task received from the input stream and currently included in the emitter. The time needed to flush the last task (TE + TW + TC ) can be overlapped to the time needed to collect all results of tasks currently executed on the workers, i.e. N · TC . Therefore we can define:  2TF + TE + TC if TE + TW ≥ N · TC f arm Trollf wd ≤ TF + N · TC otherwise Note that this evaluation of Trollf wd is pessimistic, in the sense that some workers may be idle and/or the emitter could not have a task to be scheduled;

74

C. Bertolli, G. Mencagli and M. Vanneschi • if the Operating Part of the source operation implements a data parallel computation, we have to consider the termination of the task currently in execution (in Tws time) plus the time needed to scatter, execute and gather a task possibly present on the scatter process, assuming that TS + Tws > TG , or the time needed to gather and deliver the result to the output stream, otherwise. Therefore:  2 · Tws + TS + TG if TS + TW s ≥ TG dp Trollf wd ≤ Tws + TG otherwise

Generic Rollback In a rollback protocol we stop the execution of the source operation and we immediately switch to execute the target operation. To do so, the Control Part sequentially signals the processes in the Operating Part of both operations (i.e. it first stops the source operation processes, and then it signals the start to the target operation processes). The target operation needs to obtain all tasks in TP from the related generators. To do so, it needs to access the Vector Clock mapping output to input sequence identifiers, which can be provided directly from the source operation: the process producing results on the output streams of the source operation can generate the corresponding VC by simply analyzing the sequence identifier appended to each result received from its workers. When this process is signaled to perform an operation switching according to a generic rollback protocol, it passes some kind of information to the process on the target operation accessing the input streams. Depending on the information passed, we obtain different protocols: • the information passed can consist in the MC value: in this case the target operation will request the generators to re-generate elements with sequence identifiers greater than MC. Note that some elements may have been previously executed and their results sent to the output stream (this is not valid if the input and output stream are ordered, as described above). That is, the MC information alone may induce a duplication of results: therefore this version of generic rollback protocol implements the weak consistency definition; • the source operation can pass the whole VC to the target operation, possibly by reducing its size as described above. In this case the target oper-

High-Performance Pervasive Computing

75

ation guides the re-generation of elements in TP by issuing to generator components those elements corresponding to the missing sequence identifiers, which are found by scanning from MC to the maximum sequence identifier of submitted results in the VC. For instance, looking at the example of Figure 29, the target operation will issue elements with sequence identifiers from 4 to 11, and 13, 14, 16, and, 17. Note that this protocols avoid the re-execution of previously performed tasks, hence it implements the strong consistency definition. As in the rollforward protocol the target operation, after recovering and reexecuting all tasks in TP , can re-start to receive messages from the input channel. To avoid replication of input tasks we can, for instance, clear the input channel CHIN (other solutions can be based on comparing the sequence identifiers of received tasks and submitted results). Therefore, the target operation performs the following sequence of steps in the two versions: • the target operation issues the required elements to the generators, i.e. it requests all messages with sequence identifier from MC + 1 to the sequence identifier of the message on the top of the CHIN queue (first version), or it directly passes the VC to the generators; • the generators start sending elements which sequence identifiers are greater than MC + 1 or which are missing in the VC, and then re-start from the last sequence identifier sent before the switching; • when all elements are recovered, the target operation go on receiving elements from channel CHIN . The choice of applying this kind of protocol depends on the amount of work which we can loose. We can quantify this work depending on the parallel computation performed by the source operation: • if the source operation implements a task farm computation, we have to re-execute at most N + 2 tasks, corresponding to one task for each of the N workers, plus a task on the emitter and a task on the collector. In addition, in the first version of the protocol we have to sum up also all tasks which results have been delivered in an un-ordered way to the

76

C. Bertolli, G. Mencagli and M. Vanneschi output stream. This value depends on the relative speed of workers and on the deviation of actual task execution times w.r.t. their average (i.e. TF ); • if the source operation implements a data parallel program, we have to re-execute at most three tasks: one on the scatter process, one currently executed by workers, and one corresponding to the result processed by the gather process. As hinted, the input and the output streams as ordered between themselves, hence MC is equal to the maximum sequence identifers of tasks which results have been sent to the output stream. Therefore, in both versions of the protocol there are no further tasks which must be executed.

9. Emergency Management Application on a Pervasive Platform In our research work in the Italian FIRB In.Sy.Eme. (Integrated Systems for Emergency) Project 6 we have focused on Emergency Management applications and we have applied the ASSISTANT model to specifically implement a flood management application. These kinds of applications are of special interest in the context of HighPerformance Pervasive Computing as they include highly integrated computing and communication components, which are mapped to platforms including wireless and wired communication technologies and strongly heterogeneous computing platforms (i.e. an instance of the platform described in Section 3.. The application described in this chapter is represented in Figure 4 and it includes the following set of components: • the third-party (wireless) sensor network (WSN) includes a set of sensors deployed along and around the upper part of a river basin, with the aim of monitoring both punctual precipitations and the river conditions (e.g. water depth and punctual water speed). Data collected from sensors are stored in a data base and, if needed, they can be produced as a data stream to other components; 6

Project reference FIRB RBIP063BPH.

High-Performance Pervasive Computing

77

• a third-party Geographical Information System (GIS) stores both realtime monitored sensor data and historical data about the geographical development of the area of interest; • real-time precipitation data are also collected from facilities such as meteorological satellites and observation laboratories. Also meteo forecasts computed in a past time may be used; • real-time and historical data are the input to a complex precipitation forecasting model. We do not focus on this specific component, but we can note that its implementation can be performed by using an ASSISTANT ParMod, also depending on its QoS requirements; • the main core of the ASSISTANT application is the flood forecasting component, which takes, as input, meteorological forecasting data as well as GIS data, and it returns the near or far future forecasts of the river conditions in large or specific areas. Below we show a ParMod implementing this component; • a decision support system is applied to map flood and other forecasts to proper procedures, aiming to avoiding disasters or to manage them in case they cannot be avoided; • the Clients of the application are both civil protection personnel in an institutional center and mobile operators directly involved in the emergency management activities. Both operate by observing flood forecasts and applying the results of the decision support system. In this chapter we focus on the flood forecasting component, characterizing a main computing core which corresponds to the resolution of system of differential equations modeling the river flow starting from discretized river information. Below, we give a description of how the flood forecasting model can be expressed in algorithmical terms, how these algorithms can be parallelized to target different computing platforms, and on the resulting performance models which can be used as a starting point to define adaptivity strategies for this component. In Figure 30 we exemplify a platform for mapping the flood emergency management application. The exemplified platform includes two main areas:

78

C. Bertolli, G. Mencagli and M. Vanneschi Civil Protection Central Division central servers

wired router

centralized area (metropolitan area) decentralized area (emergency area)

shared memory cluster machine

wireless link plus decentralized node

wireless link wireless link

onsite operators (PDAs network)

wireless link plus decentralized node

downstream sensors

upstream sensors

wireless link plus decentralized node

onsite operators (PDAs network)

wireless link plus decentralized node

Figure 30. Representation of an example of platform for flood management applications, characterized in a centralized metropolitan area and a decentralized emergency area. In the metropolitan area we can found the civil protection central division, which is in charge of coordinating the whole emergency management activities. In the decentralized area we can found sensors on the upstream and downstream of the river and mobile operators supported by PDAs, forming a structured or ad-hoc network. The centralized and decentralized areas are connected by means of a set of wired and wireless links, represented with antenna and router symbols. Some wireless links provide also the possibility of accessing to their computational support, which is here expressed as a multicore router.

• a centralized metropolitan area in which it is situated the civil protection central division. The central division has the task of controlling the overall emergency management activities and it is supported by a certain

High-Performance Pervasive Computing

79

number of servers: in this example we assume that is it supported by a shared memory parallel architecture (e.g. a SGI Altix node) and a cluster architecture (e.g. an IBM BlueGene/L node); • a decentralized area near the monitored environment. In this area we include a set of upstream and downstream sensors, as well as mobile operators directly involved in the emergency management activities and controlled from the central division. The two areas are internally connected and inter-connected between themselves by a series of wired and wireless communication links covering a large geographical area. For instance, we can think to employ WiMax technologies to support large geographic areas and to employ 802.11 technologies for localized communications (e.g. inside a wireless network of mobile nodes), both in infrastructured and ad-hoc modalities. Note that some of the network links also provides computing services, which we can assume to be supported by a multicore processor. Considering the overall mapping of the application to the described platform we can define two typical use-cases, which we are interested in supporting: • an off-line forecasting phase, in which we run the flood forecasting component to understand if in the remote future the monitored area will be subject to emergency. This phase is mostly related to risk prevention and it is characterized by less stringent performance requirements than the following use-case. In practice, this activity gives detailed information on the future situation of the whole monitored area in temporal terms of weeks. In typical settings, a complex hydrodynamic software is executed on the central servers, acquiring input data from all available monitors (e.g. sensors and satellites) and delivering results exclusively to the monitoring board of the civil protection central division; • an on-line or real-time forecasting phase, in which we run the flood forecasting component to support emergency management activities while the emergency is in progress, i.e. for the nearest future. This phase has strong constraints in terms of the quality of results and on the time at which they are delivered. Clients of the application are either the personnel at the central civil protection division, or mobile operators directly involved in

80

C. Bertolli, G. Mencagli and M. Vanneschi the emergency management. For this use-case we are required to map the forecasting component in such a way that we dynamically guarantee the quality, timing and availability constraints of applications during the whole emergency. Therefore, we can think to map the component to the central servers but, if needed, we can dynamically modify its mapping to decentralized nodes, and in some cases to the client devices themselves (e.g. on the clients’ PDAs). The input of the forecasting component can be a subset of sensors, also depending on the computation complexity which can be reached with the available computing support and the actual users’ needs. For instance, we can think to employ only downstream sensors to minimize the communication latency between the components providing the input and the one performing the computation.

After describing the flood forecasting algorithms, their parallelization and their performance characterization in different scenario, in this section we also show how the ASSISTANT model can be used to define an adaptive flood forecasting component, and we analyze some notable dynamic cases.

9.1. Flood Forecasting: Parallel Algorithms The main core of the flood forecasting component can be expressed in terms of a bi-dimensional hydrodynamic model. To do so, we define a bi-dimensional discretization of the river basin and, for each river point of the model, we solve a system of partial differential equations modeling the conservation of mass and momentum with the following parameters (see [41] for the definition of equations): water surface elevation; depth averaged velocity components in X and Y directions; depth of water; distance in X and Y directions; horizontal diffusion of momentum coefficient; Coriolis coefficient; Chczy coefficient; logical time at which variable values are collected. The result is represented by the sum of the components of external forces in the X and Y directions. We can solve the system according to a finite difference method, broadly obtaining a task consisting of, for each system, the resolution of four tri-diagonal systems of linear equations. Therefore, the main computing core of this forecasting component can be defined as following (see Figure 31): it takes as input the information described above from the GIS and/or the WSN; it instantiates the system of partial differential equations and it solves it by solving the related

High-Performance Pervasive Computing

81

Flood Forecasting Component Phase 1 river basin data in points (water depth, speed, etc..)

generation of system of partial differential equations for the mass and momentum

Phase 2 resolution of differential equations with a finite difference method

X,Y force components on river each point

Figure 31. Scheme representing input data, computation and output data of an implementation of the flood forecasting component.

four tri-diagonal systems according to the chosen finite difference method; it returns the values of the external force components. We model this computing core as an ASSISTANT component where: • input stream elements include all the information described above as 8 double precision floating point values; • the computation quality can be configured by selecting finer or coarser grains of a time-discretization, in fact impacting on the size of each solved tri-diagonal system; • the output stream elements are vectors expressing the force components which size varies with the size of the solved systems. There exist multiple optimized techniques for solving tri-diagonal systems of linear equations [20]. In this paper we employ direct methods, i.e. methods which attempt to find an exact solution in a fixed number of steps, which only depends on the system configuration (e.g. its size). This property permits a precise definition of performance models, which cannot be always obtained for models which termination is defined by means of a dynamically evaluated condition (e.g. iterative methods). Examples of direct approaches for tridiagonal systems are twisted factorization and cyclic reduction [20]. In principle, we may define multiple ParMod versions by using different methods. To avoid an excessive discussion

82

C. Bertolli, G. Mencagli and M. Vanneschi

of mathematical aspects, in this chapter we focus on cyclic reduction methods. These methods are also especially attractive as they can be easily generalized to banded and block tridiagonal systems [20]. We focus on two cyclic reduction algorithms which can be derived by the same formalization [32]. The choice of two algorithms is motivated by the fact (which is discussed below) that each algorithm is best-suited for a specific parallelization scheme. 9.1.1.

First Algorithm

This algorithm includes two main procedures. In the first part (denoted with transformation) the input system of including N rows is transformed in q − 1 steps (where q = log2 (N + 1)). At each step l we consider all rows i such as i mod 2l = 0 and for each one of them we solve the following set of equations: = γ i cl−1 i+2l−1

ali

= αi al−1 i−2l−1

cli

bli

+ γi = bl−1 + αi cl−1 i i−2l−1

l−1 l−1 kil = kil−1 + αi ki−2 l−1 + γ i ki+2l−1

l−1 αi = −al−1 i /bi−2l−1

l−1 γ i = −cl−1 i /bi+2l−1

(12) where ai , bi , ci and ki are the diagonal coefficients and the constant term of the i-th row. The superscripts denote the computational step at which their values are taken. α and γ are used in this notation to make equation reading easier. The stencil, i.e. the functional dependencies between successively computed values, refers the same element i and two neighbors: rows i − 2l−1 and i + 2l−1 . The second part of this algorithm solves the system according to a fill-in procedure, and we denote it with resolution. This part includes q steps for l = q, q − 1, . . . , 1. At each step l we consider all rows i for which i mod 2l = 0 and we solve the following equation: l−1 l−1 xi = (kil−1 − al−1 i xi−2l−1 − ci xi+2l−1 )/bi

(13)

The solution can be computed directly by solving this equation for each step and we do not need to transform the x values As it can be noted, the same stencil of the first part of the algorithm is applied. In Figure 32 we show a graphical representation of the stencil of this algorithm.

High-Performance Pervasive Computing

83

Figure 32. Stencil of the first cyclic reduction algorithm. The vertical lines represent the system transformation and resolution operations, where each line denotes the computation on a single system line.

9.1.2.

Second Algorithm

Also the second algorithm includes a transformation and a resolution procedures. In this algorithm the transformation procedure includes the same number of steps of the previous algorithm, but at each step we solve the equations shown in Figure 12 to all system elements, rather than a subset of elements as in the first algorithm. The stencil pattern of this first part is the same of the first algorithm but it includes all rows at all steps. Due to an higher computing effort in the transformation part (w.r.t. the first algorithm), the resolution part can be compressed in a single step in which we solve the following equation for all system rows: xi = kiq /bqi . In this case the stencil is defined for the same row, i.e. row i only needs its previously transformed coefficients resulted from the last transformation step. Figure 33 shows a graphical representation of the stencil pattern of this algorithm.

84

C. Bertolli, G. Mencagli and M. Vanneschi

Figure 33. Stencil of the second cyclic reduction algorithm.

9.1.3.

Discussion

We discuss the features of each algorithm to show which is the best parallelization scheme to be used. The performance features of the described algorithms can be characterized as following: • Number of steps: the first algorithm performs q − 1 = log2 (N − 1) − 1 steps during the transformation part and q = log2 (N −1) in the resolution one. The second algorithm performs less steps: q = log2 (N −1) transformation steps and only one resolution step. Therefore the first algorithm performs more steps than the second algorithm. • Number of Operations: the first algorithm performs a lower number of operations in the first part w.r.t. the second algorithm. This because the second algorithm applies, at each step, the equations (12) to all system elements, instead of only a subset of them. The second part of both algorithms involves the same number of operations. • Number of Functional Dependencies The first algorithm includes a lower number of functional dependencies because, at each step of the transformation part, equations 12 are solved only for a subset of elements.

High-Performance Pervasive Computing

85

9.2. Parallel Cyclic Reduction Algorithms As discuseed above, the first algorithm minimizes the number of operations performed in the whole computation. Thus, it is reasonable to parallelize it according to the task farm model, which has not the strong requirements, in terms of communication efficiency, of the data parallel solution. Rather, the second algorithm is well-suited for a parallelization based on a data parallel scheme. The task farm program can be derived by simply performing the following substitutions in the skeleton program shown in Figure 6: • the task data type is a struct of 8 double precision floating point values; • the result data type includes two arrays of double precision floating point values, each with size N ; • the compute function takes as input a task, it generates the four tridiagonal systems and it applies to each one of them the transformation and resolution procedures described above. For the data parallel program we show the pseudo-code (see Figure 9.2.). The parmod TridiagonalSolver is connected to an input stream of tasks (task data structure) (lines 1,2). For each task the input section generates four tridiagonal systems (line 12), storing them in the inputSystem variable, and it cyclically scatters them (lines 13, 14) to the virtual processor state (see below). For each system, the virtual processors apply the transform and solve procedures, which implementation solves the equations described above on a system row (lines 18-31). Finally, the output section performs a gather communication of the solution of each system by means of the specific ALL construct (line 33-35). Note that in this operation we have defined a single-sized array topology, as in this abstract program we map each virtual processor to a single system row. At the level of implementation subsets of virtual processors are mapped to the same worker process, which is responsible of sequentially executing the corresponding virtual processor code on all its assigned partition elements (i.e. in this application on all its assigned rows). Also note that in this program virtual processors can access a shared variable, expressed with the attribute construct, to which they can access in a

86 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

C. Bertolli, G. Mencagli and M. Vanneschi

parmod T r i d i a g o n a l S o l v e r ( InputStream t a s k i n p u t D a t a ; OutputStream s o l u t i o n s s o l s ) { ... o p e r a t i o n TS−DP { t o p o l o g y a r r a y [ i :N] VP ; a t t r i b u t e s y s t _ r o w c o m p u t i n g S y s t [ 2 ] [ N] s c a t t e r S [ ] [ ∗ i ] o n t o VP[ i ] ; do i n p u t _ s e c t i o n { g u a r d 1 : on , , i n p u t _ s y s t { systRow i n p u t S y s t e m [N ] [ 4 ] = g e n e r a t e S y s t e m s ( i n p u t D a t a ) ; f o r ( i n t i = 0 ; i < 4 ; i ++ ) d i s t ri b u t i on inputSystem [∗ s ] [ i ] s cat t er to computingSyst [ 0 ] [ s ] ; } } while ( true ) virtual_processors { s ol ve_s ys t em ( in guard1 out s o l s ) { VP i { f o r ( l = 1 ; l > TT S : TA is the bottleneck of the system and we can only

High-Performance Pervasive Computing

89

minimize the single system resolution time. In this case we should select the data parallel program, as it minimizes the service latency. Note that in the considered dynamic platform, this condition may change during the execution, due to the change of the performance of the used communication and computation technologies. For instance, when we start the application the input generator may not be the bottleneck (on stream case). Thus, we initially select the task farm version. During the execution, the communication network between the input generator and the flood forecasting component can become a bottleneck of the computation, possibly because the network between the components is partially used by a high-priority service (e.g. a phone call). The current TSS interarrival time becomes higher than the corresponding service time (i.e. TA >> TT S ), also for the lowest parallelism degree of forecasting component. Therefore, we can manage this situation in real-time by switching from the task farm computation to the data parallel one. Note that, when the adaptive logic of the described ParMod takes this decision, we can also instantiate the reconfiguration cost model to understand if the reconfiguration is actually useful, or it only induces an extra useless overhead to the application performance. This overhead is useless if: • the adaptivity logic has some hints about the future network situation: for instance the adaptivity logic predicts that the high-priority phone call will end up in a given time; • the cost of applying the reconfiguration from the task farm to the data parallel and back is higher than to accept the lower performance obtained if going on in the execution for the whole phone call time only with the task farm version. This can happen if the reconfiguration is supported by an application-consistent protocol, as described in Section 8.. We can see that this latter adaptive logic is defined according to a predictive behavior, in opposition to a reactive behavior which would not consider the prediction of future network situations. Flood Forecasting on Local Area The second scenario (see Figure 36) is related to the application of the flood forecasting model to a localized river

90

C. Bertolli, G. Mencagli and M. Vanneschi results

TSM

U

data p

data p

data parallel

W

W . . . W

oint

E

U

. . .

WSN

task farm oint

W

C

S

. . .

G

W . . . W

results

Figure 36. Representation of the application scenario in which we are applying the flood forecasting model to a reduced local river area. In this scenario the flood forecasting model can be applied to a decentralized node or to a mobile node network near the sensors. The organization of the application is that of a client-server model.

area, which is of interest of a group of mobile operators managing the flood emergency. In this scenario the clients act also as providers of input data for the flood forecasting component, by aquiring input information from their nearest sensors. The acquired input data are passed to the forecasting component, which acts as a server: for each input data, possibly consisting in one or more river discretization points, it instantiates and solves the associated hydrodynamic model, returning the computed force components to the requesting client. Therefore, we are in a typical client-server situation and, unlike the previous scenario, the number of clients is now a critical parameter of the performance model. As discussed in Section 5., the performance of these kinds of client-server graphs can be modeled as a Queuing System: a set of clients sends requests that are logically stored in the flood forecasting component input queue (i.e. Q). As, for brevity, in Section 5. we have not considered the case of cyclic graphs, we discuss here a performance model for this specific client-server application. We are interested in the performance experience of each client w.r.t.

High-Performance Pervasive Computing

91

the system: that is, the performance characterization of this scenario is given by the sensation that each client has of the speed of the forecasting service, which can be modeled in this client-server pattern as the client service time (TCl ). By exploiting the queuing theory results presented in Section 5. we can derive that the TCl depends on the service latency (LT S ) and the average waiting time in the flood forecasting component queue WQ , which depends on the component service time, i.e. TT S . Therefore, unlike the first scenario, we are interested in optimizing both values (TT S and LT S ) at the same time. As we described above, the data parallel version minimizes both values, while the task farm one only optimizes the TT S value. The adaptivity logic has, hence, to dynamically instantiate the two variables to understand which version is best suited depending on the dynamic platform and application conditions.

9.4. Adaptive Control We exemplify use-cases in which we are required to adapt the application components to dynamical events characterizing a typical emergency scenario. The goal is to show how the ASSISTANT model and support is used to guarantee that certain QoS parameters are respected during the execution. Consider the temporal situation in which an emergency has been forecasted in past days by the off-line forecasting activities and that the emergency is currently approaching. The mobile operators have been deployed over the critical areas in which some effort could be done to limit the disaster consequences and we are executing the flood forecasting component for the nearest future (e.g. the next hour). The forecasting component considers the whole emergency area by receiving input monitor data from upstream and downstream sensors and the intended clients of the forecast are the mobile operators. We can think of multiple dynamic events which should be properly managed by some application reconfiguration activity to guarantee the availability, quality and performance of the forecasting service: • some communication links may become overused due to the presence of higher priority activities on the same links or due to hardware and software failures;

92

C. Bertolli, G. Mencagli and M. Vanneschi • the mobility of operators near the emergency area may provoke their isolation w.r.t. the centralized area; • the clients may require for a forecast on a specific area of interest, rather than the whole emrgency area, also issueing an higher quality of forecasting results (e.g. higher image resolution or more frequent refresh of forecasts).

All these situations could be handled, depending on the adaptive strategy employed to support the application, by switching from the operating mode for the central servers to the one for the decentralized node. In this switching we can decide to limit the used input data from all sensors to only downstream ones. The operation switching may be supported by some consistent reconfiguration protocol: for instance, if the central server operation is supported by a task farm computation and the decentralized node operation is supported by a data parallel computation, we can think to employ a general rollback protocol, in fact re-generating all input data produced for the central server for the used decentralized node. Next, the used decentralized node may become overused (e.g. one of the operators is performing a high-priority phone call to alert the central division of a specific critical situation). In this case we can move the computation to another decentralized node, if available. For this purpose we can think to guarantee consistency for the operation re-mapping by using a specialized data parallel rollforward protocol. Finally, the operators may become disconnected from any decentralized node, switching the implementation of their communication support from infrastructred wireless network to an ad-hoc modality. In this case we might not guarantee consistency of the switching, and we could simply re-start the flood forecasting component on the PDAs’ network.

10. Experiments The goal of this section is to instantiate the performance models described in Section 5. to various specific application cases, i.e. on a subset of cases described in Section 9., and validate their coherence to the results of experiments. For brevity we will show a subset of the described cases.

High-Performance Pervasive Computing

93

Concretely, we will show: • how performance models for task farm and data parallel programs can be instantiated to the case of tri-diagonal system solving algorithms, as described in the previous section; • the numerical results of experiments of the two programs mapped to different architectures, from which we compute the scalability7 of the computations. Finally, we consider the notable case of operation switching supported by a rollforward consistent reconfiguration protocol, applying the performance model described in Section 8. and validating it with experiments. The computing nodes used in the experiments are the following: cluster 30 nodes Pentium III 800 MHz with 512 KB of cache, 1 GB of main memory and interconnected with a 100 Mbit/s Fast Ethernet; multicore Intel Xeon E5420 Dual Quad Core multicore processor, featuring 8 cores of 2.50 GHz, 12 MB L2 Cache and 8 GB of main memory; In the experiments we map the task farm program to the cluster and multicore architecture, and the data parallel program to the multicore one due to its strict requirements, in terms of ratio between communication and computation, required by this data parallel program. This platform is sufficient to obtain the performance features of computing nodes which can be found in a pervasive platform. Nevertheless, we will perform experiments on fully-compliant a pervasive platform, e.g. including wireless/wired networks and mobile nodes, in future work. Moreover, for these experiments, we have employed off-the-shelf programming technologies, including distributed and shared memory implementations of the Message Passing Interface (MPI) standard and the Linux socket implementation. 7

In parallel computing, the scalability is a measure of the efficiency of the computation with by varying its parallelism degree.

94

C. Bertolli, G. Mencagli and M. Vanneschi

10.1. Task Farm for Flood Forecasting in Large Area We have implemented the task farm operation for the cluster architecture by using the LAM/MPI [15] support. The dynamic support includes the dynamic modification of the grain of tasks (in terms of the size of resolved system), and the parallelism degree (i.e. the number of workers). The program is considered in the Large Area pipeline-shaped application scenario, in which the generator and the client are implemented as ASSISTANT component. The chosen performance metric defined in performance models and evaluated in experiments is the interarrival time of results to the client component, which, in our vision, gives a good point of view on the whole application performance. As hinted, the client interarrival time can be defined as the time passing between two successive result deliveries from the computing component to the client, and is subject to the variations of the parallelism degree and the task grain. To statically evalute this performance metric, we have to instantiate the application performance model with the following values: computation time of a single worker; communication latency of a task from generator to emitter and from emitter to worker; communication latency of a result from a worker to the collector and from the collector to the client. Further input values for the performance model are the task farm parallelism degree, and the size of solved tri-diagonal systems. In this chapter we instantiate these values by performing proper measures on the target platform: for instance, we measure the cost needed to perform a task as the average of a large number of executions of a task. Similarly, we measure the communication latency by performing a large number of experiments with the proper message size. Table 1 shows the worker computing time to perform a task by varying the system size. We have selected system sizes between 8MB and 32MB because they represent good scalability cases, i.e. cases in which we do not experience degradations with small parallelism degrees, but in which communication latency becomes the predominant factor after certain parallelism degrees. Table 2 shows the average values measured for the communication latency of results respectively exchanged from a worker to the collector and from colletor to the client. Finally, the average task communication latency between the generator and

High-Performance Pervasive Computing

95

Table 1. Average task execution time by varying the size of the solved system. task size exec. time (sec.)

8MB 4.0436

16MB 8.4194

32MB 17.4813

−C Table 2. Communication latency between worker and collector LW com and C−Cl collector and client Lcom by varying the size of the solved systems.

task size W −C Lcom C−Cl Lcom

8MB 0.3575 0.3596

16MB 0.7143 0.7514

32MB 1.4281 1.4456

the emitter is LG−E com = 0.000639 seconds and between the emitter and a worker = 0.000163 seconds. LE−W com We show how the performance model can be solved for one instantiation of its variables, for systems of 8MB and parallelism degree equal to 2. The task farm service time can be computed as: TSf arm = max{TE ;

TW ; TC } p

C−Cl W −C E−W where: TE ∼ LG−E com + Lcom = 0.000802 seconds, ; TC ∼ Lcom + Lcom = W −C E−W 0.7144 seconds; TW = Lcom + Ttask + Lcom = 4.4013 seconds and p is the task farm parallelism degree. Tom compute these values we have considered an upper bound, by assuming that communication is not overlapped with computation. Thus, Tcl = TSf arm = TpW = 2.2001. For a comparison, the experiments has given us a client inter-arrival time equal to 2.2044, i.e. the performance model exactly computes the expected performance. The comparison between the experimented performance and the one obtained by instantiating performance models is shown in Table 3. The results are shown as pairs, where the first value is the result from experiments and the one between parentheses is from the instantiation of the performance model.

96

C. Bertolli, G. Mencagli and M. Vanneschi

Table 3. Results of experiments for the task farm version on the cluster architecture and worker service times between parentheses. The results are given w.r.t. system size (in MB) and parallelism degree (p). MB p=1 p=2 p=4 p=8 p = 16 8 4.5769 (4.4012) 2.2044 (2.2001) 1.1423 (1.1003) 0.7092 (0.5502) 0.7090 (0.2751) 16 9.1450 (9.1339) 4.6102 (4.5669) 2.3094 (2.2835) 1.4229 (1.1417) 1.4226 (0.5709) 32 19.258 (18.910) 9.6308 (9.4548) 4.7711 (4.7274) 2.8505 (2.3637) 2.8505 (1.1818)

In cases in which results of instantitation of the performance model are shown in bold it means that the communication latency has overcome the task farm service time, hence it dominates the client inter-arrival time. In bold we show the worker service time to make clear the difference between the performance of the task farm, and the limitations given by the used communication technology. For a correct understanding of the experimental value, note that the collector service time is, for the three system sizes, equal to: TC (8M B) ∼ 0.3575 + 0.3596 = 0.7171 seconds, TC (16M B) ∼ 0.7143 + 0.7514 = 1.4657 seconds, and TC (32M B) = 1.4281 + 1.4455 = 2.8736 seconds.

10.2. Data Parallel for Flood Forecasting in Local Area We have implemented the data parallel version for the multicore processor by using an MPI implementation for shared memory architectures [26]. For this version we have limited the dynamicity to the task grain, whereas the parallelism degree can be only set at application start-up. As for the task farm case, we consider the data parallel computation as included in the pipeline-shaped application scenario. We have chosen to experimentally monitor the following values to instantiate the performance models: average worker execution time during stencil computation; communication latencies between scatter, gather and workers; task and result communication latencies. A further value to be considered when instantiating the performance model is the task grain. Rather, the effects of different parallelism degrees are automatically included in the worker task execution time: in fact, this value de-

High-Performance Pervasive Computing task size W −G Lcom G−Cl Lcom

1MB 0.0452 0.0045

2MB 0.0900 0.0899

97

4MB 0.1795 0.1802

Table 4. Execution time of the data parallel version on the multicore processor w.r.t. system size (in MB) and parallelism degree (p). Between parentheses we have included the corresponding worker execution time. MB p=1 p=2 p=4 p=6 p=8 1 0.1386 (0.1381) 0.0850 (0.0846) 0.0471 (0.0465) 0.0403 (0.0281) 2 0.2956 (0.2937) 0.1836 (0.1823) 0.1036 (0.1068) 0.0779 (0.0763) 0.1112 (0.1138)* 4 0.6498 (0.6455) 0.4461 (0.4441) 0.2564 (0.2556) 0.2163 (0.2004)* 0.2746 (0.2859)*

crease with the parallelism degree, as the same state is partitioned amongst a larger number of workers. The choice of evaluating the parallel computation service time by means of proper experiments is needed as the performance models introduced in Section 5. do not consider degradation factors typically associated with multicore processors (see below). We will study in future work a modification of the presented models to include also such degradations. During experiments we also noted that the used MPI implementation for this kind of architectures enables a partial overlapping of communication and computations, in fact resulting in a halved communication latency experienced while monitoring the application performance. In Table 10.2. we show the full communication latencies associated for the considered system sizes. As hinted in previous sections, a performance degradation factor for shared memory multicore architectures is given by possible conflicts between accesses to the memory hierarchy. This kind of degradation influences the worker computing time and it is thus reflected also in the client service time. Therefore, the client inter-arrival time can be computed by considering the average worker computing time for each task. Table 4 shows the worker service time and the client inter-arrival time. We have limited the size of systems for this architecture to value 1MB, 2MB and 4MB to avoid eccessive performance

98

C. Bertolli, G. Mencagli and M. Vanneschi

degradations due to the described memory accessing conflicts. The results are shown in bold when the communication latency has overcome the computing one; we mark with an asterisk (*) those cases in which the memory conflicts induce a performance degradation to worker computation. Note that such degradations have been monitored only for the largest system sizes and for the larger parallelism degrees.

10.3. Generic Rollforward for Task Farm and Data Parallel Operations We have performed experiments to evaluate the cost of execution of a consistent reconfiguration protocol based on a rollforward logic. The experiments are related to the situation in which we are switching from the task farm version on the cluster, to another task farm version mapped to the multicore architecture. As we have seen in Section 8. this reconfiguration protocol consists in the following sequence of actions: the source operation stops to receive input tasks; the source operation executes all pending tasks and deliver their results to the client; the target operation starts its execution by receiving the last un-received message from the generator. We have seen that, from a performance modeling viewpoint, the rollback protocol cost depends on the time needed to “flush out” all tasks currently mapped to the source operation. This can be broadly evaluated as double the time needed to perform a task (TF in Section 8.). This abstract value should be modified in such a way to include the collection of results and the variability of the task farm service time. That is, depending on the optimality of the number of task farm workers (see Section 5.), to perform all tasks currently mapped on workers it could be needed more than TF . As we note below, this is reflected in the experiments. In the experiments we monitor the time needed to apply the reconfiguration Lreconf , which is evaluated as the time passing from the notification from the Control to the Operating Part of the operation switching to the instant in which the target operation starts delivering results to the client. The results presented in Figure 37 show the Lreconf behavior w.r.t. the system size, which influeces, as noted above, the worker service time. In the results we distinguish three curves, each related to a different parallelism degree.

High-Performance Pervasive Computing 50

par = 5 par = 15 par = 28

45 40 Lreconf (secs.)

99

35 30 25 20 15 10 5 0 12 4

8

16 system size (MB)

32

Figure 37. Switching overhead (Lreconf ) of the rollforward protocol by varying the task grain (i.e. solved system size) Figure 38 shows the Lreconf behavior for a wider set of system sizes by varying the parallelism degree. As we can note, the case of p = 15 is the optimal one w.r.t. the cases p = 5 and p = 32, as it best fits the optimal value for the task farm parallelism degree.

11. Conclusion In this chapter we have described our approach for High Performance Pervasive applications, based on a high-level programming model and an optimized support, aiming to guarantee QoS parameters in terms of application performance and availability. We have identified the class of High Performance Pervasive applications by characterizing issues related to the target platforms and application requirements, defining basic abstraction tools which must be provided by programming models for this kinds of applications. We have described and formalized the main concepts, in terms of constructs and mechanisms, of the ASSISTANT programming model, which is based on Structured Parallel Programming, related to the development of Functional application Logics, and on a Control Logic in which the programmer can express

100

C. Bertolli, G. Mencagli and M. Vanneschi 50

size = 1MB size = 2MB size = 4MB size = 8MB size = 16MB size = 32MB

45 40

Lreconf (secs)

35 30 25 20 15 10 5 0 5

15

28

parallelism degree

Figure 38. Switching overhead (Lreconf ) of the rollforward protocol by varying the parallelism degree. all the adaptivity and dependability strategies required to fullfill application and platform requirements. To analyze and quantify the performance of the autonomic self-adaptive component provided by the ASSISTANT model, we have introduced performance modelling tools, based on Queueing Models and Theory, which enable programmers to provide a study of the performance of the Functional Logics of applications, and the overheads due to the application of adaptation strategies in the Control Logic. Next, we have shown how the ASSISTANT main components can be implemented by means of an abstract concurrent language, also introducing aspects concerning the reliability support for the Control Logic. We have also extended the implementation of ASSISTANT by considering two main points which strongly influence the performance overheads due to reconfiguration activities: (i) we have defined reconfiguration points, or interruption points, which are specific execution points of processes implementing the Functional Part in which reconfiguration commands can be received from the Control Part. These points specifically impact on the time needed to apply reconfiguration and on the kind of reconfiguration protocol which is neeeded to perform in case we have to guarantee the consistency of the reconfiguration

High-Performance Pervasive Computing

101

w.r.t. the application semantics; (ii) we have defined consistent reconfiguration protocols in case the Functional Part implements a general parallel computation, also giving performance models in the notable cases in which the Functional Part implements either a task farm or a data parallel computation. To concretise the research results of the ASSISTANT model and support, we have described a flood emergency management application, showing how different sequential and parallel versions can be defined for the Functional Part, how their dynamic performance behavior can be characterized, and giving some hints on adaptation strategies can be derived for some notable platform and application events. Finally, we have presented experimental results, which main goal is to prove the actual efficacy of the presented performance models for the developed versions of the Functional Logic, and to quantify the overheads due to consistent reconfiguration in case of a general rollforward protocol.

References [1] Ravi Nair Alan Gara. Exascale computing: What future architectures will mean for the user community. In Barbara Chapman, Frédéric Desprez, Gerhard R. Joubert, Alain Lichnewsky, Frans Peters, and Thierry Priol, editors, Parallel Computing: From Multicores and GPU’s to Petascale, volume 19 of Advances in Parallel Computing, pages 3–15. IOS Press, 2010. [2] Marco Aldinucci, Sonia Campa, Marco Danelutto, Marco Vanneschi, Peter Kilpatrick, Patrizio Dazzi, Domenico Laforenza, and Nicola Tonellotto. Behavioural skeletons in gcm: Autonomic management of grid components. In PDP ’08: Proceedings of the 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), pages 54–63, Washington, DC, USA, 2008. IEEE Computer Society. [3] Arvind, R. S. Nikhil, and K. K. Pingali. I-structures: data structures for parallel computing. ACM Transactions on Programming Languanges and Systems, 11(4):598–632, 1989.

102

C. Bertolli, G. Mencagli and M. Vanneschi

[4] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. A view of the parallel computing landscape. Commun. ACM, 52(10):56–67, 2009. [5] F. Baiardi, L. Ricci, and M. Vanneschi. Static checking of interprocess communication in ecsp. SIGPLAN Not., 19(6):290–299, 1984. [6] Josh Barnes and Piet Hut. A hierarchical o(n log n) force-calculation algorithm. Nature, 324:446–449, December 1986. [7] Fran Berman, Geoffrey Fox, and Anthony J. G. Hey. Grid Computing: Making the Global Infrastructure a Reality. John Wiley & Sons, Inc., New York, NY, USA, 2003. [8] C. Bertolli, D. Buono, S. Lametti, G. Mencagli, M. Meneghin, A. Pascucci, and M. Vanneschi. A programming model for high-performance adaptive applications on pervasive mobile grids. In Proceeding of the 21st IASTED International Conference on Parallel and Distributed Computing and Systems, pages 38–54, November 2009. [9] C. Bertolli, D. Buono, G. Mencagli, and M. Vanneschi. Expressing adaptivity and context-awareness in the assistant programming model. In Proceedings of the Third International ICST Conference on Autonomic Computing and Communication Systems, volume 23, pages 32–47, September 2009. [10] C. Bertolli, M. Coppola, and C. Zoccolo. The co-replication methodology and its application to structured parallel programs. In CompFrame ’07: Proceedings of the 2007 symposium on Component and framework technology in high-performance and scientific computing, pages 39–48, New York, NY, USA, 2007. ACM. [11] Carlo Bertolli. Fault tolerance for High-Performance applications using structured parallelism models. VDM Verlag, 2009. [12] Carlo Bertolli, Daniele Buono, Gabriele Mencagli, Matteo Mordacchini, Franco M. Nardini, Massimo Torquati, and Marco Vanneschi. Resource

High-Performance Pervasive Computing

103

discovery support for time-critical adaptive applications. In The 6th International Wireless Communications and Mobile Computing Conference. Workshop on Emergency Management: Communication and Computing Platforms), 2010, to appear. [13] Carlo Bertolli, Gabriele Mencagli, and Marco Vanneschi. Adaptivity in risk and emergency management applications on pervasive grids. In ISPAN ’09: Proceedings of the 2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks, pages 550–555, Washington, DC, USA, 2009. IEEE Computer Society. [14] Carlo Bertolli, Gabriele Mencagli, and Marco Vanneschi. Analyzing memory requirements for pervasive grid applications. In The 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Washington, DC, USA, 2010, to appear. IEEE Computer Society. [15] Greg Burns, Raja Daoud, and James Vaigl. LAM: An Open Cluster Environment for MPI. In Proceedings of Supercomputing Symposium, pages 379–386, 1994. [16] Eduardo F. Camacho and Carlos A. Bordons. Model Predictive Control in the Process Industry. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1997. [17] Murray Cole. Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming. Parallel Comput., 30(3):389–406, 2004. [18] D. E. Culler, A. Gupta, and J. Pal Singh. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. [19] Shlomi Dolev, Seth Gilbert, Nancy A. Lynch, Elad Schiller, Alex A. Shvartsman, and Jennifer Welch. Brief announcement: virtual mobile nodes for mobile ad hoc networks. In PODC ’04: Proceedings of the twenty-third annual ACM symposium on Principles of distributed computing, pages 385–385, New York, NY, USA, 2004. ACM.

104

C. Bertolli, G. Mencagli and M. Vanneschi

[20] I.S. Duff and H.A. van der Vorst. Developments and trends in the parallel solution of linear systems. Par. Comp., 25(13-14):1931–1970, 1999. [21] E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375–408, 2002. [22] Wolfgang Emmerich. Software engineering and middleware: a roadmap. In ICSE ’00: Proceedings of the Conference on The Future of Software Engineering, pages 117–129, New York, NY, USA, 2000. ACM. [23] Ian Foster and Carl Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. [24] H. Franke, J. Xenidis, C. Basso, B. M. Bass, S. S. Woodward, J. D. Brown, and C. L. Johnson. Introduction to the wire-speed processor and architecture. IBM Journal of Research and Development, 54(1):3:1 –3:11, januaryfebruary 2010. [25] David Garlan, Dan Siewiorek, Asim Smailagic, and Peter Steenkiste. Project aura: Toward distraction-free pervasive computing. IEEE Pervasive Computing, 1(2):22–31, 2002. [26] William D. Gropp and Ewing Lusk. User’s Guide for mpich, a Portable Implementation of MPI. Mathematics and Computer Science Division, Argonne National Laboratory, 1996. R [27] Paul Grun. Introduction to InfiniBand(TM) for End Users. InfiniBand Trade Association. http://www.infinibandta.org.

[28] Rachid Guerraoui and André Schiper. Software-based replication for fault tolerance. Computer, 30(4):68–74, 1997. [29] Joseph L. Hellerstein, Yixin Diao, Sujay Parekh, and Dawn M. Tilbury. Feedback Control of Computing Systems. John Wiley & Sons, 2004. [30] Vipul Hingne, Anupam Joshi, Tim Finin, Hillol Kargupta, and Elias Houstis. Towards a pervasive grid. In IPDPS ’03: Proceedings of the 17th

High-Performance Pervasive Computing

105

International Symposium on Parallel and Distributed Processing, page 207.2, Washington, DC, USA, 2003. IEEE Computer Society. [31] C. A. R. Hoare. Communicating sequential processes. Commun. ACM, 21(8):666–677, 1978. [32] R. W. Hockney and C. R. Jesshope. Parallel Computers Two: Architecture, Programming and Algorithms. IOP Publishing Ltd., Bristol, UK, UK, 1988. [33] Markus C. Huebscher and Julie A. McCann. A survey of autonomic computing—degrees, models, and applications. ACM Comput. Surv., 40(3):1–28, 2008. [34] Yuki Karasawa and Hideya Iwasaki. A parallel skeleton library for multicore clusters. In ICPP ’09: Proceedings of the 2009 International Conference on Parallel Processing, pages 84–91, Washington, DC, USA, 2009. IEEE Computer Society. [35] Jeffrey O. Kephart and David M. Chess. The vision of autonomic computing. Computer, 36(1):41–50, 2003. [36] Leonard Kleinrock. Interscience, 1975.

Theory, Volume 1, Queueing Systems.

Wiley-

[37] David J. Lillethun, David Hilley, Seth Horrigan, and Umakishore Ramachandran. Mb++: An integrated architecture for pervasive computing and high-performance computing. In RTCSA ’07: Proceedings of the 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, pages 241–248, Washington, DC, USA, 2007. IEEE Computer Society. [38] B. D. Noble and M. Satyanarayanan. Experience with adaptive mobile applications in odyssey. Mob. Netw. Appl., 4(4):245–254, 1999. [39] Thierry Priol and Marco Vanneschi. Towards Next Generation Grids: Proceedings of the CoreGRID Symposium 2007. Springer Publishing Company, Incorporated, 2007.

106

C. Bertolli, G. Mencagli and M. Vanneschi

[40] Fred B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Comput. Surv., 22(4):299–319, 1990. [41] Bill Syme. Dynamically linked two-dimensional/one-dimensional hydrodynamic modelling program for rivers, estuaries and coastal waters. Technical report, WBM Oceanics Australia, 1991. available at: http://www.tuflow.com/Downloads/. [42] Sathish S. Vadhiyar and Jack J. Dongarra. Self adaptivity in grid computing: Research articles. Concurr. Comput. : Pract. Exper., 17(2-4):235– 257, 2005. [43] M. Vanneschi. Architettura degli Elaboratori. Edizioni PLUS, 2009. [44] M. Vanneschi and L. Veraldi. Dynamicity in distributed applications: issues, problems and the assist approach. Parallel Comput., 33(12):822– 845, 2007. [45] Marco Vanneschi. The programming model of assist, an environment for parallel and distributed portable applications. Parallel Comput., 28(12):1709–1732, 2002.