Workflow Management for Soft Real-time Interactive Applications in ...

6 downloads 57032 Views 1MB Size Report
May 10, 2011 - approach covers enacting soft real-time application service components according ..... of service components based on monitoring information.
*Manuscript Click here to view linked References

Workflow Management for Soft Real-time Interactive Applications in Virtualized Environments Spyridon Gogouvitisa , Kleopatra Konstantelia , Stefan Waldschmidtb , George Kousiourisa , Gregory Katsarosa, Andreas Menychtasa , Dimosthenis Kyriazisa, Theodora Varvarigoua a National

Technical University of Athens, Greece Film Technology, Germany

b Digital

Abstract Many applications, especially the ones implementing multi-user collaborative environments, fall within the context of soft real-time systems in which only small deviations from timing constraints are allowed. The advancements in distributed computing have made it possible to follow a service-oriented approach, taking advantage of the benefits this provides. In this context, applications consist of soft real-time critical application service components that interact with each other to provide the corresponding application functionality, forming application workflows. In this paper we present a new architectural design and implementation of a Workflow Management approach. This approach covers enacting soft real-time application service components according to a workflow description language, synchronizing the application components, monitoring the execution and reacting to events within a distributed virtualized environment. We also demonstrate the operation of the implemented mechanism and evaluate its effectiveness using an application scenario with soft real-time interactivity characteristics, namely Film Post-production, under realistic settings. Keywords: Workflow Management, Quality of Service, Cloud Computing, Service Oriented Infrastructures, Soft Real-time applications

1. Introduction With the advent of Service Oriented Architectures (SOA) [1] and technologies such as Grid [2] and Cloud [3] computing, there has been an increase of the applications that move away from a monolithic approach towards a paradigm that emphasizes modular design, giving rise to the wider adoption of the Cloud Service Model [4], [5]. The model covers all layers of IT, including infrastructure, platform and application, hence the Email addresses: [email protected] (Spyridon Gogouvitis), [email protected] (Kleopatra Konstanteli), [email protected] (Stefan Waldschmidt), [email protected] (George Kousiouris), [email protected] (Gregory Katsaros), [email protected] (Andreas Menychtas), [email protected] (Dimosthenis Kyriazis), [email protected] (Theodora Varvarigou) Preprint submitted to Future Generation Computer Systems

May 10, 2011

terms Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) and Software-asa-Service (SaaS) [6]. The IaaS provider aims to offer the raw machines on a demand basis, possibly concealing the infrastructure through virtualization techniques. The PaaS provider provisions a development environment that allows for the adaptation of an application to a Service Oriented Infrastructure (SOI), as well as acting as the mediator between the SaaS and the IaaS providers. The SaaS provider offers an application as a service over the Internet aiming to benefit from the opportunities this approach has to offer. The growing availability of broadband Internet connections has enabled people to carry out tasks that were, up until now, parts of offline workflows through online collaborative systems, allowing for higher levels of interactivity. Therefore emerging future Internet applications involve a broad class of interactive and collaborative tools and environments, including concurrent design and visualization in the engineering sector, media production in the creative industries, and multi-user virtual environments in education and gaming. Many of these applications tend to use dedicated hardware in order to achieve the desired Quality of Service (QoS), greatly increasing the overall cost for maintaining the needed resources. This can be a major hindrance to small businesses and startup companies that want to make innovative solutions available easily. Adopting a Cloud solution alleviates this problem by providing the option of pay per use without the need to own expensive equipment. Clouds offer many other advantages such as elasticity, reliability and reduced time to market. 1.1. Real-time systems The Quality of Service (QoS) requirements of these applications classify them as soft real-time, since they have stringent timing and performance needs. In general a realtime system is a system that reacts to events within a limited amount of time [7]. More precisely a real-time system is a system whose correctness is defined not only by its final result but also by the time that this is produced [8]. Traditionally, real-time tasks have been divided into two categories. Hard real-time tasks must complete before a certain deadline otherwise critical errors occur that have detrimental effects for the system. Examples of hard real-time systems include military control systems, avionic devices and medical systems, a large portion of which are developed as embedded devices. Soft real-time tasks should complete before a certain deadline. If this is missed then the produced result gracefully degrades, offering a lower level of QoS, but is not catastrophic for the system. Examples of such systems include video streaming applications, virtual reality systems and Internet based telephony. Even though one could claim that all systems should be treated as hard real-time, this is not desirable as hard real-time systems are designed under worst-case scenarios, which inevitably leads to overprovisioning of resources and therefore increased cost. Another misconception regarding real-time systems is their relation with performance. While a real-time system is generally expected to respond to events quickly, it is not speed that defines it, but its predictability and adherence to specific deadlines.

2

1.2. Soft Real-time systems and the Cloud The sharing of resources in a cloud environment leads to the level of performance experienced by each Virtual Machine (VM) to become unstable as has been observed in Amazon EC2 [9], [10]. Therefore for a virtualized cloud environment targeting soft realtime distributed applications, proper CPU and network scheduling technologies, coupled with proper performance modeling and runtime provisioning techniques are needed. In the context of the IRMOS project [11] all resources, computing power, network and storage, are virtualized and QoS is guaranteed for them by using a real-time scheduler that allows for temporal isolation of VMs [12], policing the virtual network links [13] and employing a QoS-aware storage solution [14]. The framework also provides tools that are used to perform application modeling, benchmarking and performance modeling which are needed by the platform to estimate the resources the application needs to achieve the desired QoS. 1.3. Paper contributions In the context of such a Cloud ecosystem with stringent time requirements, the Workflow Management subsystem plays a central part [15], [16], [17], [18]. Companies will be most likely reluctant to outsource any part of a workflow if a certain quality of operation cannot be guaranteed or the infrastructure is not able to dynamically react to changing conditions and adapt to them quickly. Therefore, any workflow management solution targeting the area of soft real-time applications must support the modeling of application modules or components into a complete workflow that takes into consideration QoS requirements, while being able to dynamically react to events triggered by either the user or the platform. In this paper we present a hierarchical Workflow Management System (WfMS) that consists of three different services: the Workflow Manager, the Workflow Enactor and the Evaluator. The proposed system enables the execution of real-time applications on SOIs by providing the required QoS levels. In greater details, the Workflow Manager service resides within the domain of a PaaS provider and is responsible for managing multiple workflows. The Workflow Enactor is deployed within a Virtualized Environment (VE) along with the application services in order to invoke them according to a workflow description document and handle events. The Evaluator service, which is also situated within a VE, monitors the adherence of the application to the QoS terms included in Service Level Agreements (SLAs) and generates events which are propagated to the Workflow Enactor for proper actions to be taken. This two-layer approach, with services residing on top but also in a virtualized environment, is of major importance since it minimizes communication delays between the VE and the platform services and allows for timing-constrained execution and management of real-time applications. Moreover, the performance of these services is benchmarked and modeled along with the application and therefore the platform is able to provide strong QoS guarantees not only to the application itself but also to the management mechanisms. 1.4. Paper outline Following this introduction the paper is structured as follows. Section 2 presents the requirements of a WfMS in a SOI that facilitates real-time, whereas Section 3 presents related work in the field of workflow management. The overall system architecture is described in detail in Section 4, followed by a detailed description of its implementation 3

in Section 5. In Section 6, we present an evaluation of the system based on a Film Postproduction [19] application scenario along with results that indicate the efficiency of the proposed approach. Finally, Section 7 concludes the paper. 2. Workflow Management System Requirements In general, making use of the additional possibilities an external WfMS can offer, will lead to an application better fulfilling the end-users expectations and improve his experience with the application. Within the context of the IRMOS project a Market and Technical Requirements Analysis [20] was conducted to capture the requirements for adopting real-time applications on SOIs that included 42 organizations (26 of which were SMEs). The analysis led to some important points that are relative to this paper: • 60% of the end-users indicated that they use or plan to use SOA technologies to gain flexibility (allowing them a quicker and better adaptation to changing market trends) and easier services integration. • The end users identified as Security being the main barrier for adopting Cloud technologies (21%) followed by Performance (15%), Support (15%), Configurability (10%) and Speed to expand capacity (10%). • 38% Identified the need for real-time user interaction while 8% answered that no such need exists for their companies and 54% did not answer the question. These issues can be translated into requirements for our WfMS that along with others identified in the literature on the same subject (e.g. [21], [22]), while not claiming to be exhaustive, complimented the main aim of our system, which is to provide the mechanisms needed to offer strong QoS guarantees, drove the design and implementation of the proposed WfMS; and are summarized in the following paragraphs. 2.1. Real-time related requirements Considering the special needs of the soft real-time applications, we can derive a set of ideal features that a Workflow Management System should have to efficiently support them while guaranteeing a certain degree of performance. They are listed below. Scalability. A WfMS needs to be able to handle multiple concurrent requests for workflow execution without impact on its performance. This is especially true for soft real-time systems, as these require the system to be able to respond in a well-defined way. A centralized system is hard to be modeled as workloads may vary. (Requirement 1, R1). Interactivity. A WfMS should be able to be flexible and dynamic in order to provide the ability to the user to interact with a running workflow. This means that the user should be able to change the control or data flow of a running application, as well as the QoS requirements of a workflow. This latter requirement may be subject to the terms of the SLA that is in place. In any case the WfMS should be able to change a running workflow, if the requested changes do not conflict with the SLA terms. The need for interactivity relies heavily on the provision of a monitoring mechanism from the platform. Interactivity also implies that the requested operations are carried out fast enough so that the user does not experience any delays. (R2) 4

Fault handling. Faults are bound to happen both on hardware as well as on software level. The WfMS must firstly be able to acknowledge these and also provide the capability to the application developer to define a set of corrective actions to be taken under certain circumstances in order to meet the QoS requirements of real-time applications (R3). 2.2. General requirements for WfMS Declaration of QoS requirements in user-friendly terms. There are numerous situations where an application user needs to be able to define QoS parameters either for the application workflow as a whole or parts thereof. This should be done in terms the user is familiar with, such as the minimum dropped frames of a multimedia application to be under a specific threshold, and not in technical terms, such as network bandwidth and/or CPU speed (R4). Workflow Monitoring. The WfMS must be able to monitor the execution of every running workflow and be able to present the current state to the user (R5). Legacy code support. While most WfMS are targeted towards the SOA paradigm it is also important to be able to execute tasks not developed in a service-oriented fashion. This is feasible by creating service wrappers around legacy code and is important so as to allow for fast integration of legacy applications into the SOA universe (R6). Security. Even though security tends to be neglected in such areas as scientific workflows, it is very vital when concerned with business workflows. A WfMS needs to provide the appropriate level of security for all involved parties, having a sound infrastructure for authentication, authorization and secure message exchange between services (R7). 3. Related Work The area of WfMS is vast and distributed solutions have been around for a while [23], [24], [25]. Moreover various solutions dealing with QoS and service selection have been proposed, such as [26], [27] and [28]. In this section we present related work that focuses on the systems management space and specifically to approaches dealing with QoS provision. See Table 1 for an overview. Taverna [29] is a WfMS that follows a centralized architecture which poses questions on the scalability of the system. Taverna supports web services but does not provide any QoS guarantees. However, the system provides monitoring of running workflows and a friendly environment for users to manipulate them. As far as fault tolerance is concerned it allows for the definition of either a retry operation or an alternate location of the same service. The Askalon project [30] is mainly focused on performance oriented applications. The project follows a decentralized architecture, but with a global decision making mechanism. Users are able to specify high-level constraints and properties centered on execution time and the workflow is scheduled based on performance prediction. Askalon provides monitoring of the executed workflow but does not allow for user interactivity. Check pointing and migration techniques are used for fault-tolerance. The Amadeus environment [31] follows a centralized approach. QoS parameters concerning time and cost are supported and performance prediction is carried out in order to find the optimal resource, while also provisioning SLAs for the agreement between the user and the provider of the service. It does not supply any form of fault tolerance or monitoring of the execution of the workflow. 5

The GrADS project [32] is based on the Globus Toolkit (GT) [33] and it aims at applications with large computational and communication load. It supports the specification of workflows which are analyzed and the dependencies between the tasks are identified. This helps with the parallelization of the tasks and scheduling algorithms can be used. GrADS also supports QoS constraints through estimating the application execution time by use of historical data and analytical modeling. The Kepler workflow system [34] is an open source application that extends the work of the Ptolemy II [35] system to support scientific applications using a dataflow approach. Its main characteristic is that it is based on processing steps, called “actors”, which have well defined input and output ports. Users are able to define workflows by selecting appropriate actors and connecting them within a visual user interface. A “director” component holds the overall execution and component interaction semantics of a workflow. The Kepler system provides various mechanisms for fault-tolerance, most important of which is the ability to define actors that are responsible for catching exceptions. The workflow management service within the GRIDCC project [36] is tasked with optimizing the workflows and ensuring that they meet the pre-defined QoS requirements specified upon them. The project aims at utilizing instruments through Grid infrastructures. It also focuses on Web Services and SOA and implements a partner language to use with BPEL4WS. Instead of defining a new language for workflows with QoS requirements, or embedding QoS requirements within a language such as BPEL4WS, GRIDCC uses a standard BPEL4WS document along with a second document which points to elements within the BPEL4WS document and annotates this with QoS requirements. The end-to-end workflow pipeline takes a user’s design and implements it within the Grid, through reservation services and performance repository. Workflows are defined through a web based editor which allows the augmentation of QoS requirements by defining the user’s expectations for the execution. The WfMS provides a mechanism for building QoS on top of an existing commodity, based on BPEL4WS engine, thus allowing the provision of a level of QoS through resource selection from a priori information together with the use of advanced reservation. VLAM-G [22] is the workflow management system of the VL-e project. It is a decentralized dataflow driven workflow engine targeting the e-science community. The engine consists of a Run-Time Environment for workflow components and a Run-Time System Manager that controls and orchestrates the execution. Workflow components can be special software developed for VLAM-G, web-services or interfaces to legacy software. Workflows are created by connecting components through data dependencies. The systems allows for run-time monitoring of the execution as well as interactivity through parameters of the connected modules. Heinzl et al. [37] propose a service oriented architecture for multimedia analysis that uses Flexible SOAP with Attachments (Flex-SwA) to model data flows in BPEL. Flex-SwA allows references to be used to model data transfer thereby minimizing communication costs between services that exchange large amounts of data. Moreover, the authors achieve scalability using a modified BPEL engine capable of of handling both grid as well as Amazon EC2 resources. Their proposal does not include any notion of QoS enforcement SLAs In [38] the authors propose an architecture for the automated provisioning of services in cloud computing environments which is able to install, configure monitor start and stop software components. The user is able to select predefined services and customize them 6

in a suitable way. Also during runtime the platform is able to support reconfiguration of service components based on monitoring information. All configuration parameters are encoded into the service components descriptions. The solution, as admitted by the authors themselves, lacks fault tolerance mechanisms as well as SLA management and QoS assurance provisions. Rodero-Merin et al. [39] propose Claudia which is an abstraction layer on top pf the infrastructure layer allowing for the control of services lifecycle. Each service in Claudia is defined by a Service Description File that can include customization information, such as the need for the IPs of other components, that is not known before deployment. Moreover, the description contains automatic scalability rules, for example creating new instances of components, based on user defined rules. In [40] the authors investigate a rescheduling mechanism for workflow applications in multi-clusters Grids. The scheduling algorithm proposed uses real-time information during the enactment of a workflow to re-estimate the predictions used to map the different tasks of a workflow to the available resources. An interesting proposal comes from the manufacturing domain. In [41] the authors propose the use agent-based workflow management mechanisms in industrial automation. Equipments and smart objects are wrapped as agents and exposed as web services that contain real-time status information. These can then be used to form a workflow that describes a manufacturing process. The architecture allows for real-time monitoring and control of the process as well as reconfiguration. Workflow techniques has also been used in embedded systems as is the case with [42], where an embedded workflow framework is proposed for assistive devices. The authors propose SISARL-XPDL which is an extension to XPDL. It aims to be used in embedded systems and execute process such as avoiding objects for a robotic mechanism or in intelligent medication carts. While the proposal is very interisting it cannot be directly compared to previous works. The authors in [43] propose a framework that can be used to model meetings and embed them into workflows. Thereafter a meeting broker acts as a mediator between a workflow management system and a telecooperation system. This allows for geographically dispersed teams to cooperate remotely and hold meetings the outcomes of which can feed into the overall bussines process. Our solution moves beyond the presented works in the area of workflow management as it is targeted towards a cloud environment. To this end, the WfMS presented in this paper is able to support the QoS requirements of real-time applications that are defined in a user-friendly way and enforced through SLA agreements. Moreover, the proposed approach for workflow management enables the execution of applications in a virtualized environment, which besides others allows for higher levels of security through the isolation it provides. The hierarchical architecture of our solution provides scalability benefits as well. This is achieved by allowing parts of the workflow management to be situated within the virtualized environment, taking advantage of the QoS guarantees that it provides.

7

Scalability

Security

Virtualization

QoS definition

Monitoring

Interactivity

Fault handling

Taverna

Centralized

YES (WSSecurity)

NO

NO

YES

YES

YES (time constraints) YES (time and cost) YES (time and cost) NO

YES

NO

YES (retry and alternate location) YES

NO

NO

NO

N/A

YES

NO

N/A

YES

YES

YES (only on task level) YES

YES YES (RAM, Processing Cycles, Storage, Network) NO

YES YES

NO YES

YES N/A

NO YES

NO

NO

YES

NO

NO

YES

Partial (automatic reconfiguration)

NO

NO

YES

YES

YES

Partial (automatic reconfiguration)

NO

NO

NO

NO

YES

YES

N/A

N/A

YES (reconfiguration requires restart) YES (subject to SLA terms, or SLA renegotiation)

N/A

N/A

User defined, response to events, checkpointing partly implemented

YES

Askalon

Decentralized

YES (GSI)

NO

Amadeus

Centralized

NO

GrADS

Centralized

YES (WSSecurity) YES (WSSecurity) YES (Apache Rampart) YES (GSI) YES (Globus Security)

YES (use of Amazon EC2) YES

NO

Kepler

Centralized

GRIDCC VLAM-G

Centralized Decentralized

NO

Heinzl [37] Kirschnik [38 ]

Decentralized

YES (GSI)

Centralized

Claudia

Centralized

Zhang [40] EMWF

Centralized

YES (Isolation due to Virtualization) YES (Isolation due to Virtualization) N/A

N/A

N/A

N/A

N/A

N/A

Proposed Approach

Hierarchical approach, Workflow Enactor within the Virtualized Environment

Certificates, WSSecure Conversation, TLS, Isolation due to Virtualization

YES

Mapping of user defined high level QoS to low level through benchmarking

YES

NO NO

Legacy code support N/A

N/A

N/A

Figure 1: Comparison of different frameworks

4. System Architecture 4.1. Design considerations In order to meet the requirements presented in Section 2 several design decisions have been made. First of all, the system follows the SOA paradigm and thus four different layers and actors associated with them, as depicted in Figure 2, are considered. The applications are split up into services forming Virtual Service Networks (VSNs). In order to meet R1, a hierarchical approach was chosen for the WfMS. There is a part of the Workflow Management mechanism which is real-time critical and therefore hosted on a Virtual Environment – a cloud – whose capacity (computational, network, memory, storage) is able to expand and contract. Therefore the WfMS is treated the same way (when it comes to deployment and management by the cloud) as the application service components. Hence, the resource requirements of some of the platform components are not static, but are calculated online making the platform itself scale as the need arises. This not only guarantees the QoS provided to the user, but also means better efficiency for the platform provider. R2 is made possible through a user interface that is able to communicate with the WfMS and provides the ability to pause, stop or reconfigure running workflows, as long as the changes do not create conflicts with the SLA that is in place. Fault handling (R3) is realized through the definition of rules that are evaluated during run-time. Requirement 4 is met though a mechanism that is able to map high level parameters to resource requirements. R5 is met through a monitoring service (as 8

depicted in Figure 1). Monitoring information is collected by the monitoring services in different virtualized environments and aggregated by the monitoring service that resides in the PaaS layer. It is important to note here that events are evaluated within the VE, prior to being aggregated on the Platform layer, which leads to better reaction times. Requirement 6 is met through the implementation of a service that is deployed with each application service component and makes possible the execution of legacy code.

Figure 2: Layered Structure of the Platform

While security (R7) is not the main focus of the paper a brief discussion is needed due to its great importance for commercial systems. Authentication is performed by the Portal service which issues digital certificates to the end user that are then used for all interactions with the platform. Moreover, certificates are automatically issued by the Workflow Manager, which acts as a certificate authority, to all Application Service Components (ASCs) as well as the Framework Service components, during their deployment within the Virtualized Environment. These certificates are used in all communications between services and are checked for validity by both the sender as well as the receiver. Also, the workflow language used by the system provides constructs that can be used to specify different levels of security, such as message or transport level security. Each application that is deployed as a workflow within the Virtualized Environment is in essence an isolated intranet using private IPs by default. Public IPs are only used if they are requested by the user. Therefore security is also realized through network isolation. Another important point is that the proposed WfMS can be deployed in different cloud environments, as it is based on self-containing components. The only requirement is that the communication with the system is done through the exposed interfaces. For instance, different evaluator components may be used as long as the events propagated to the WfMS follow the system’s specification. Of course, in order to achieve the desired functionality, the underlying infrastructure needs to be real-time enabled as well.

9

4.2. Overall System Architecture In the PaaS layer a set of services is available in order to enable the design and deployment of applications on a Virtualized Environment. They are at the core of the platform, as their functionality includes services that are responsible for provisioning and managing the execution of real-time services on request of the Application Layer within the Virtualized Environment. Examples for tasks these services are fulfilling are the support of service engineering, service advertisement, fully automated SLA negotiation, mapping of high level performance parameters to low level, discovery and reservation of resources needed for the execution of an application. The mapping of high level application terms to resource level attributes in the IRMOS framework is performed by introducing Artificial Neural Networks (ANNs) as mediators. As application terms we consider every parameter that is inserted into the SLA and is configurable by the customer. This may include either workload parameters (for example number of users connected in an e-learning application, resolution of a streaming video etc.) or application QoS outputs that denote the KPI levels of the component (like achieved fps of the aforementioned streaming video). The purpose of the ANN is to correlate these terms with the hardware resources allocated to the running instance of the service. In order to do so, a benchmarking phase is introduced prior to the deployment of the service. Through this process, a sufficient dataset is collected in a parameter sweep fashion, that is used in order to train the networks. For each ASC its according ANN is created, through a service oriented framework [44]. Through this service framework, numerical software such as GNU Octave or Matlab can be directly used in the service lifecycle and the created models are stored in a repository for access by all the involved services. After this phase, for each of the ASCs we obtain an algebraic rule, in the form of ANN, that is specific for this ASC and can portray the effect of a selection of a specific hardware resource and of a specific ASC configuration on the ASC’s performance levels, as these are expressed through the KPIs. The reason for applying this technique (ANNs) is that it fulfills many of the requirements that come up from the SOI paradigm. For example, very little knowledge is required for the infrastructures or the ASC, and none of them is confidential. This information is contained in the description of the Application Service Component (ASC) and is mainly the aforementioned application terms that are already included in the SLA. Furthermore, the inputs and outputs of the models can be easily re-arranged in order to include new parameters. In the IRMOS framework, the real-time scheduler that is described in [12] is running on the physical machines that host the Virtual Machine Units (VMUs). The scheduler allows for the definition of the reservation in the form of RSV = (Q, P ), meaning that the processor will be assigned to a specific VMU for Q time units within every interval of P time units (where Q ≤ P ). This allows for a fine grained manipulation of the computational resources that are granted to each VMU while also providing temporal isolation. 4.3. Workflow Management Components As already mentioned, the WfMS follows a hierarchical design and is consisted of three components, which are: Workflow Manager. The Workflow Manager is the central authority for all the running workflows of the platform and is responsible for managing multiple Workflow En10

actors that are deployed within virtualized environments. Furthermore, the user actions (e.g. start, stop, etc) are communicated to the Workflow Enactor. Workflow Enactor. The Workflow Enactor is responsible for executing the various tasks as they are described by the Workflow Description, as well as reacting to events generated by the Evaluator. One dedicated Workflow Enactor is deployed within the virtualized environment for every VSN. Evaluator. The Evaluator is responsible for checking the consistency of the monitoring parameters with the values specified in the SLA as well as receiving notifications form the IaaS provider. The Evaluator service is able to hold rules based on monitorable parameters that when asserted create events that are used by the Workflow Enactor to make changes to a running workflow. 4.4. PaaS Components The PaaS, apart from the WfMS, includes the following modules: Monitoring Service. The Monitoring Service is responsible for collecting the monitorable parameters of the executed services and reporting them to an index that can be queried by the Evaluator. The monitoring service also resides within the VSN and therefore has been benchmarked, so the resources needed are tailored to the needs of each deployed application. SLA Manager. The SLA Manager is responsible for presenting a valid SLA offer to the customer after a proper negotiation with the resources. Discovery Service. The Discovery Service is responsible for finding available resources that comply with the QoS needs of the services that are to be deployed in the Virtualized Environment along with the pricing. It is called by the SLA Manager to whom it passes any results. Portal Service. The Portal Service provides the necessary interface to the end user of the application in order to invoke the negotiation process as well as the reservation of resources. In addition, its functionality includes the starting, stopping and pausing of an interactive application execution. The various components and their interfaces are visible in Figure 3. The components in yellow color are situated within the Virtualized Environment. 4.5. Phases The process of deploying an application within a service-based platform and using it can be divided into three distinct phases. Even though the workflow management system does not participate in all of the phases, these are presented here for completeness reasons since various inputs are provided to the WfMS during the execution phase from the previous phases. 4.5.1. Publication phase In this phase the Application Provider models the application in order to be able to be deployed within a service-based platform. Each application is a workflow consisted of numerous services, each of which provides some discrete functionality and is called Application Service Component (ASC). The application developer interacts with a development interface which enables the definition of the input and output interfaces of an ASC as well as the required computing and network resources, which may depend 11

Figure 3: Component Model

on parameters of input and output data actually used as well as timing constraints. For this process the UML Profile for Modeling and Analysis of Real-time and Embedded Systems (MARTE) [45] and UML for Modeling Quality of Service and Fault Tolerance Characteristics and Mechanisms [46] are used creating an Application Service Component Description (ASCD) document. Using the ASCD, the developer can proceed to the specification of the workflow, defining desired QoS for the whole of the application, as well as defining rules that are used by the Evaluator service to create events. Thereafter, the PaaS is able to benchmark the application and generate the rules needed for the mapping of high level parameters, used by the developer, to low-level parameters [47]. During this phase the Workflow Enactor, the Evaluator and the Monitoring services are also taken into consideration and are treated as part of the application. Therefore resource requirements are also generated for these and their execution is guaranteed by the IaaS provider. 4.5.2. Negotiation Phase The user of an application creates a request towards the PaaS Provider containing the application workflow and high-level requirements, using templates created by the application developer. The document created also includes rules that are used by the Evaluator service to generate events. This is automatically transformed by the PaaS to a request containing low-level requirements towards the IaaS provider in the form of a Virtual Service network Description (VSND) document. If the request can be satisfied a cost is returned to the user and if this is accepted the required resources are reserved for use within the predefined time-frame. Moreover SLAs are signed both between the customer and the PaaS provider (Application SLA A-SLA) as well as between the PaaS provider and the IaaS provider (Technical SLA T-SLA). 4.5.3. Execution phase The execution phase starts when the user logs into the service-based platform portal and asks for the start of the application. All the components that make up the workflow 12

have already been deployed within VMUs and configured according to the user’s requests. From then on the user is able to use the application for the time that the resources have been reserved. The execution phase will be described in more detail in the following paragraphs. 5. Implementation The Workflow Management mechanism needed developed effort on all levels of the platform. It can be classified as a control driven model, as the connections between the tasks represent the transfer of control from proceeding to following tasks. Nevertheless, data transfers are also supported, since the output of a task can feed directly as input to following tasks. The modeling of an application into a workflow happens during the publication phase. The developer describes each component, using the service engineering tools provided by the platform as was described previously. The outcome of the process contains the Workflow Description Document. 5.1. Workflow Description Document The language used to describe the workflow is loosely derived from WS-BPEL. WSBPEL is a language to model Web Service (WS) based business processes. It is based on XML and is used to represent the interactions between a process and its partners using Web Services that are represented as WSDL services. Moreover, BPEL is fast becoming the industry norm as it is extensively used as a workflow execution language and uses exclusively WS interfaces. 5.1.1. Basic construct The basic unit of the implemented specification language is the invoke construct (depicted in Listing 1) which is the invocation of a specific component of the application and is defined as follows: The name tag is used to identify the action. The partnerlink tag contains the ID the operation refers to. This is used by the Workflow Enactor to find the IP to use in order to invoke the service, as this is not known at design time and is made available by the IaaS. The legacy tag is used to define whether the invoked component is implemented as a service or not. In the latter case it is assumed that a service wrapper is used. This means that a special mechanism has been implemented by which the application developer provides an executable script written in any language (perl, shell script etc) which is executed by a service residing on the remote machine. This service is called, effectively wrapping the legacy code in service oriented manner. The security tag is optional and can be used to define the level of security desired for the communication between the Workflow Enactor and the service. If it is not defined no security is used. The message option defines message-level security based on the WS-SecureConversation specification [48] while the transport option defines transport level security where TLS is used. The operation tag is used to define the operation to be executed on the called service and in case of legacy code it must match the name of one of the provided scripts. The input and output tags are used for data management. The source and destination URLs are specified as GridFTP [49] URLs (for remote files) or as file URLs for files local to the service. The event tag is used to define events 13

which are generated by the Evaluator and the actions that can be taken in such cases. A wide range of events can be defined originating both form the application as well as from the Infrastructure provider. The Evaluator service is the responsible for generating events that are known to the Workflow Enactor. The error tag is used to define how erroneous events are handled. Three choices are provided: i) retry specifies that the Workflow Enactor should try to invoke the service again. A maximum of two retries are allowed after which the workflow fails, ii) fail specifies that the workflow should fail and the Workflow Manager should be notified and iii) custom which can be used for the application developer to specify custom handling events. Currently invocation to other services can be defined. "name of the invoke" "ID of the ASC to be executed" "true/false" "message/transport" "operation to be called on service" "source of input" "destination of input" "source of output" "destination of output" "eventid" ... ... "retry/fail/custom" "retry/fail/custom" ... ...

Listing 1: Invoke Construct

5.1.2. Control Flow Constructs The implementation supports classic control flow mechanisms such as “sequence”, “while”, “wait” and more advanced, such as “flow”. Sequence. The sequence construct is used to define a set of tasks that should be executed in sequential order and can be nested into other constructs, such as flow or while. 14

Boolean experession componentID ... ... ...

Listing 2: While Construct "Wait1" P0Y0M0DT0H0M0S

Listing 3: Wait Construct While. The while construct is used to define a classic while loop. The component ID tag is used to specify the component of the workflow which holds the variable used in the condition tag. The syntax is depicted in Listing 2. Wait. The wait construct (presented in Listing 3) is used to pause the execution of a workflow for a specified amount of time. Its definition follows: The for tag has the format of Years -Months -Days -Hours -Minutes -Seconds. Flow. The flow construct (presented in Listing 4) is used to define a set of tasks that should be executed concurrently. The execution of the workflow pauses until all tasks defined within the flow have ended. Any type of construct can be nested within a flow, such as invoke, sequence and flow. Even though the platform relies heavily on QoS requirements, we choose not to include these in the Workflow Description language. The reason for this choice is primarily to make the Workflow Enactor as fast as possible. Thus, the responsibility for monitoring the adherence of the service to the QoS expectations has been moved to a separate service, namely the Evaluator service. This makes the Workflow Enactor service lighter and thus faster. Moreover, this allows for different evaluation services to be used with the Workflow Enactor. 5.2. Service Implementation and Interactions 5.2.1. General Implementation As it has been described previously, the implementation follows a layered approach. In the PaaS layer, all components are hosted on dedicated physical hosts and have been ... ... ...

Listing 4: Flow Construct 15

realized as Globus Toolkit 4 (GT4) [33] services. GT4 is an open source Grid middleware that provides the necessary functionality required to build and deploy fully operational Grid Services. It includes software for security, information infrastructure, resource management, data management, communication, fault detection, and portability. It supports the core specifications that define the Web Services architecture. It also supports and implements the WS-Security and other specifications relating to security, as well as the WSRF, WS-Addressing and WS-Notification specifications used to define, name, and interact with stateful resources. The proposed platform takes full advantage of the functionality provided by GT4 and ensures the compatibility of the platform with widely used specifications. On the other hand, the IaaS layer follows a virtualization approach, used by Cloud solutions, whereas all services and application components are deployed within VMUs that are hosted on physical machines running the Kernel-based Virtual Machine (KVM) virtualization engine [50], as well as a real-time scheduler. This allows the IaaS provider to manage the infrastructure in an efficient way benefiting from the scalability, agility, automation, and resource sharing features of a cloud platform. Moreover, the IaaS provider is able to provide customized and isolated execution environments to its clients. The network infrastructure is also virtualized and requirements targeting bandwidth and latency can be met. Isolation also means greater security guarantees for the customer as the VMUs can be configured to have only virtual IPs, which means that they can be accessible only from VMUs within the virtual network. 5.2.2. Service Interactions a) Publication Phase The workflow description document is created during the publication phase in its abstract form. This means that it doesn’t contain the concrete resources on which components will be deployed, but only their IDs. b) Negotiation Phase After the negotiation phase, the IaaS provider returns the information needed to make the workflow concrete to the SLA Manager through the IaaS gateway. The SLA Manager calls the Workflow Manager passing the following information: a) The Workflow Description Document, b) the IPs of all the components of the workflow along with the IPs of the Workflow Enactor and Monitoring and Evaluation services that reside within the virtualized environment, c) configuration parameters for the ASCs, d) the SLA ID to which the workflow is bound and e) the time when the reservation will be active. Upon receipt of this information the Workflow Manager updates the workflow making it concrete and creates a persistent WSRF resource which holds all the above information. c) Execution Phase Pre-execution sub-phase Prior to the actual execution of the application is the pre-execution phase. This is triggered based on the time that the reservation is active. During this phase all the VMUs are set up by IaaS provider. The Workflow Manager calls the Workflow Enactor that is specific for the application and resides within the Virtualized Environment, passing to it the workflow description document as well as the configuration parameters for all the ASCs. The Workflow Enactor uses this information to configure all the ASCs, making them ready for execution. Execution sub-phase 16

Phase: Publicat ion Mode: Off line

A pplicat ion Prov ider

IRMOS Portal

ASC Repository

ASLA Manager

1.Invoke requestASLA Template operation for permanent storage with customer ID 2.Authenticate/Authorize 3.Request ASLA Template operation for permanent storage with customer ID 5. ASLA template for Negotiate 6. ASLA template for permanent storage Permanent permanent storage Storage 7. Fill ASLA template

4.Retrieve ASLA template for permanent storage

8.Invoke negotiation operation with ASLA offer for permanent storage 10. ASLA EPR + configuration info Upload A SCs Binaries (LOOP)

9. Negotiate: See online negotiation Sequence diagram

15. Upload ASC binary with Application ID and customer ID 16.Store

17. Link to location

11. A dd A SLA EPR for permanent storage and links to binaries in A SLA template for application Upload A SLA templates

12. Upload ASLA Templates* with Application ID and customer ID 14. ACK

13.Store

*A SLA template includes abstractw orkflow description or link to its location.

Figure 4: Publication Phase

17

Phase: Negotiation Mode: Online Client

SLA Manager

IRMOS Portal

1. requestASLATemplate(appID, custID)

2. Authenticate/Authorize 3. requestASLATemplate(appID)

PES

Discovery

T‹SLA Manager IaaS SLA Manager

ASLA Negotiation

4. Select A‹SLA template 5. Fill A‹SLA template 6. negotiation(ASLAoffer, custID) 7. Authenticate/Authorize 8. negotiation(ASLAoffer) 9. Discover ISONI prices, VSND templates

Discovery & TSLA Negotiation

loop 10. EstimateOptimalResource(predictionRequest) [for each IaaS] 11.resourceSpec, estimated price 12. GetVSND(resourceSpec, VSNDtemplate) 13. Sort VSNDs by cost 14. CreateResource(TSLA)

loop [over IaaS unti l successful pre‹reservati on]

TSLA‹EPR ASLA‹EPR Assess 18. Reserve(ASLA‹EPR)

Config info for the client

ASLA‹EPR

ationToTreat 15. Invit(VSND) 16. Reserv. query De p end. Avail. TÒSLA Offer (binding VSND+ Chk Chk+ Offer expiry+ Binary upload exp.) Reserv. ack

17. Update(ASLA‹EPR, TSLA‹EPR) Reservation & Instantiation

Authenticate/Authorize 19. Reserve(ASLA‹EPR)

Config info for the client

20. Sign(TSLA‹EPR)

ASC config info, ASC EPRs 23. Call Workflow Manager with needed information

2 TÒSLA Accep (wiagree 1.thin offer expiandry,t offer 22. Reservation binary onsubmission) commit t unti wai Ack(V ND+EPR) binari es lrcvd. S ‹> valid contract

Figure 5: SLA Negotiation Phase

18

The execution phase starts when the user logs on to the platform through the IaaS Portal Service and provides the credentials needed for authentication. If it succeeds the application user is able to request the execution of the workflow. The Portal Service propagates this request along with the SLA ID, to which the application refers to, to the Workflow Manager. The Workflow Manager invokes the Monitoring Service, which is responsible for monitoring the execution of the application, and propagates to it a list of monitoring parameters. The Evaluator service is also invoked. When this is complete, the Workflow Manager invokes the Workflow Enactor and asks for the execution of the application. The Workflow Enactor uses the workflow description document to start the execution of the application. Phase: Execution Mode: Online Inside VSN

IRMOS Portal

Client

u

ASLA

Workflow

Manager

Manager

u

Monitoring /

u

Monitoring /

IaaS

Index Service

Eval ation

Workflow Enactor

Gateway

Eval ation Instance

1.Invoke exec tion

u

Index Service

ASC

ASC

Wrapper

Initialization

D

operation with ASLA EPR + c stomer I

u

u

2. A thenticate/A thorize

3. Get WfE

u

WfE EPR

4. Retrieve WfE EPR

5. exec te operation on WfM (WfE EPR , Monitoring EPR Client EPR) ,

u

6. Config re / Start Monitoring

u

7. Config re / Start Monitoring

u

8. Config re / Start Monitoring ACK

u

9. S bscribe for IaaS metering information ACK ACK ACK

u

10. Invoke exec teWorkflow operation on WfE Runtime: Start Execution 11. Start ASC ASC Monitoring info ASC Monitoring info

Du

ISONI Metering info

ring

ASC Monitoring info

workflow enactment 12. Aggregate/Store

Database 13. View monitoring info

u

u

4. Pa se/Stop exec tion with ASLA EPR +

u

D

R untime: Pause/H alt Execu tion

c stomer I

u

u

15. A thenticate/A thorize 16. Get WfE WfE EPR

u

u

Retrieve WfE EPR

17. Invoke pa se/stop exec te operation on WfE

u

18. Invoke pa se/stopWorkflow operation on WfE

19.

D

u

eactivate monitoring/eval ation

u

20. Take snapshot of ASCs on pa se

u

ACK

Loop: ASC in

Stop ASC ACK

workflow

21. Stop monitoring/eval ation ACK ACK

Figure 6: Execution Phase

In order to provide interactivity for the end user a running workflow can be paused or stopped. The user request is propagated from the Portal Service to the Workflow Manager that contacts the Workflow Enactor. In the case of the pause operation, snapshots of all VMUs executing ASCs are taken and saved in a repository. On completion, the Workflow Enactor contacts the Virtualized Environment Gateway and asks that all VMUs are put in a pause state. The user is then able to resume execution at any point in time within the reservation period. The saved snapshots can be used for fault tolerance as well and are planned to be used for the continuation of the execution under a new reservation in future implementations of the system. Moreover, the user may wish to change a running workflow. The user first pauses the running workflow. Then, using the tools described in the publication phase, the user is able to make changes to the parts of the workflow that have not yet reached the running 19

state. A new VSND is created and if the resource requirements do not change, then the updated workflow description document is propagated to the Workflow Enactor that also reconfigures the Monitoring and Evaluation Services. When this is complete the user is able to resume workflow execution. This is done in order to keep the system consistent, keeping in mind that SLA’s are in place for all the resources. The workflow needs to be paused in order to avoid a component from entering the running state while the user is in the process of changing the workflow. In case the resource requirements change then the system moves to a new SLA negotiation phase. The Workflow Enactor service is also able to receive events generated by the Evaluator. This service constantly receives information from the Monitoring Service that monitors the execution of the ASCs and collects information regarding parameters that have been defined as ‘monitorable‘ in the SLA. The SLA also contains rules that the Evaluator uses to generate events. Currently, a simple rule engine is used to evaluate events. The rules created can contain any monitorable parameter, simple comparison operators (=, >, 15)

Listing 5: A simple rule generating a reconfigure event The workflow description contains information on what steps should be taken in every known event. The steps could include invocation of other services or changes to the configuration of already running ones. An example is the case of a service producing a video stream in a known format with a predefined streaming rate. If the number of dropped frames exceeds a limit the service could be configured to produce output in another, less resource consuming format until a second service is added to the workflow to achieve the desired output. This provides the capability that a short violation of the QoS expectations will not be detrimental for the application. The Workflow Enactor is also able to react to errors through the definition of error handling steps defined in the workflow description. These can be simple, such as retry or fail, or more complex, defined by the application developer and can contain the invocation of new services or the reconfiguration of running ones. It is important to distinguish here the differences between events and errors. Events are known states of the system to which 20

the Workflow Management system can react to and have been defined by the application developer, as the example given before of a low streaming rate. Errors on the other hand are unknown states, as for example the inability to contact a specific service, which the application developer has not defined. In such cases generic steps can be taken, such as to save the state of all running VMU’s, or roll-back actions. The application developer can express these steps with the constructs of the workflow definition that have been presented earlier as well the special cases of retry and fail. Retry has been added because it is an easy way to overcome problems that might appear sporadically. The resolution of hardware errors is been left to the IaaS Provider. Phase: Execution - Mode: Online - Process: Application Event Inside VSN IRMOS Portal

SLA Negotiator

Workflow Manager

Monitoring/ Evaluator

IaaS Gateway

Workflow Enactor

Monitoring / Evalutaion Instance

ASC Wrapper

Event: Application-driven

2. Inform Workflow Enactor

1.Application event detected Apply policy

3. Inform Workflow Manager

4. Execute corresponding rule in worfklow description document 5. Apply rules

Loop

6.Reconfigure monitoring/evaluation

Figure 7: Application Event

Also, during execution of the workflow, the Workflow Enactor may need to make decisions based on variable held by the ASCs, as in example in case of a while loop. These variables are written to an Index Service that resides in the same VMU as each ASC. This Index Service is a flexible registry that publishes information as resource properties that can be retrieved with clients that use the standard WSRF resource property query and subscription/notification interfaces. The Workflow Enactor is thus able to query the Index Service and retrieve the value of particular variable that is contained in the Workflow Description Document. The Workflow Enactor propagates information concerning the state of the workflow to the Workflow Manager who subsequently provides this information through the Portal Service. 6. Evaluation The presented WfMS has been validated in public demonstrations, such as the European Commission’s ICT 2010 event [51], through three application scenarios namely Interactive eLearning, Virtual and Augmented Reality and Film Post-production. Although all three scenarios imposed requirements on the WfMS, the analysis that follows is focused on the Post-production as it was the use case with the most intensive requirements in terms of workflow execution. 6.1. Experimental setup We ran our experiments in two sites. One in the University of Stuttgart and one in the National Technical University of Athens. Four P4 - 3GHz, 1GB RAM running 21

"Start IPC" "DFT.DFP.ipc" "true" "/home/start" "retry" "ipcstate != ready" "DFT.DFP.ipc "Start IPU1" "DFT.DFP.ipu1" "true" "/home/start" "retry" "Start IPU2" ... ... "Start LB" "DFT.DFP.lb" "true" "/home/start" "retry"

Listing 6: Abstract Workflow Description Document for Post-production

22

Fedora Core 9 with kernel 2.6.27 were used to host the Platform Services, where four Athlon 64 - 2∗2,4GHz, 4GB RAM and six Core2 Quad Q9400 - 4∗2,66GHz, 8GB RAM, also running Fedora Core 9 kernel 2.6.27 were used to host the Virtual Machines. The virtualization software used was KVM. 6.2. Results Our first experiment aimed at measuring the capability of the system to become aware of and evaluate events. For this reason an ASC raised an event that was evaluated by the Evaluator Service and propagated to the Workflow Enactor that in turn acknowledged its receipt to the ASC. The elapsed time was measured. In the first case the Workflow Enactor, Evaluator and the ASC were part of the same VSN deployed on the virtualized environment. The measurement was conducted for n = 1000 times. This is shown in Figure 8. 90

80

70

Round trip rime in ms

60

50

40

30

20

10

0 0

100

200

300

400

500

600

700

800

900

1000

Iteration Number

Figure 8: Round Trip Time Within one VSN (with QoS)

In the second case the Workflow Enactor, Evaluator and ASCs were deployed on physical machines and were geographically dispersed. The results are shown in Figure 9. Average MIN MAX STDEV 95% Confidence Level 90th Percentile

Virtualization 56.25 49.57 80.77 6.22 0.38 65.48

No virtualization 242.07 211.40 463.69 19.96 1.29 254.95

Table 1: Notification Times

The above results are to be expected since in the first case the event producer and the Evaluator are not only geographically close but also use a network with guaranteed characteristics. This does not hold in the second case where all communications happen over an unreliable network. Therefore if our hierarchical approach was not followed and the Workflow Enactor, Monitoring and Evaluator services did not reside within the IaaS layer, the reaction of the platform to events and faults could not be guaranteed. 23

Figure 9: Round Trip Time Over Long Distance (No QoS)

1510.69 625.06 2721.38 710.21

No virtualization 2156.23 783.13 4831.85 1116.54

168.80

265.38

2569.97

3950.62

Virtualization Average MIN MAX STDEV 95% Confidence Level 90th Percentile

Table 2: Execution Times

Otherwise dedicated networks would need to exist between all the remote sites that the platform would wish to manage. Another experiment was conducted in order to evaluate the proposed architecture, in which the Workflow Enactor triggered the execution of a trivial application on a remote site multiple times and the round trip time needed for each command to reach the application and report back that it was executed was measured. In the first set of experiments both the Workflow Enactor and the VMU hosting the application were situated within the same VSN, which was benchmarked and specific QoS guarantees were asked for by the VE for the two VMUs and the interconnection between them. In the second set of experiments the Workflow Enactor was not part of the VSN but was deployed on a standalone physical machine. In both cases computational intensive jobs were executed in parallel after some time both in the physical machine hosting the Workflow Enactor and the one hosting the VMU, in order to simulate varying loads that are common in production systems. The experiment was conducted n = 70 times. The results are depicted in Figure 10. Clearly, when no other task is executed in parallel, the two architectures show comparable results. However, when we simulate a production system the two systems behave differently. Our approach is unaffected by the variation in load, due to the QoS guarantees provided by the virtualized environment. On the other hand, in the case that the 24

6000 W fE inside VSN W fE outside VSN

Round trip time in ms

5000

4000

3000

2000

1000

0 0

10

20

30

40

50

60

70

Iteration Number

Figure 10: Comparison of Execution Times between Virtualized and Non-virtualized Resources

Workflow Enactor is not situated within the virtualized environment the performance of the system varies greatly, which could lead to not fulfilling the requirements of the application. It should also be noted that the performance of the system is affected in both cases by the use of security, which has a high overhead in the exchange of messages between the Workflow Enactor and the application wrapper. In our case, the overhead that the Workflow Enactor introduces to the system is taken into account, due to the benchmarking phase that has taken place, and thus the requirements of the application can be fulfilled. 6.3. Post-production Use Case In this scenario an application that provides collaborative and distributed color grading as part of Film Post-production based on the Bones Digital Dailies production system [52] is used. A post-production house is contracted to perform color grading to some film shots while the film director and the producer will review that footage. The post production colorist uses a Color Correction Station to change parameters of the film shots, the effect of which the film director and the producer are able to view in real-time in their geographically dispersed viewers. The application consists of the following ASCs: an Image Processing Controller (IPC) that keeps the color correction parameters derived from the colorist, eight ASCs that are Image Processing Units (IPUs) and a Load Balancer (LB) responsible for the delivery of the stream to the views. The ASCs have been modeled by the application developer and the abstract workflow description document has been produced in the publication phase, part of which can be seen in Listing 5. In this case the VSN was consisted of 10 VMUs containing the application components as well as a VMU containing the Workflow Enactor, Evaluator and Monitoring Service. The overarching QoS requirement was that the application should produce a total rate of 24 frames per second from the LB to all the connected viewers which was achieved by the proposed architecture. 25

Figure 11: Architecture of the Post-production use case

Fault tolerance of the system was also evaluated. For this reason one of the physical machines hosting a VMU of an IPU was forcibly shut down. This resulted in two different events to be generated. First of all the application started to drop frames, which triggered the Evaluator to propagate this event to the Workflow Enactor. The workflow description document defined that in such a case a reconfiguration of the application should take place so as to use a different codec. Moreover the failure of the physical machine was acknowledged by the Infrastructure provider, which started the same VMU on a different host. The Evaluator service was notified of this event and propagated it to the Workflow Enactor in order to configure the VMU with the needed input and start the new ASC. Once the ASC was operational it was set to raise a new event which signified that the application should be restored in its original state, after which a new round of reconfiguration took place. The total time needed by the complete process can be defined as: Ttotal = max((Tevaluate +Treconf igure ), (Tackf ailure +Tbootup +Tnotif y +Tconf igurevmu + Tstartasc + Treconf igure )) We measured a Tall ≈ 149.1sec which can be broken down to Tevaluate ≈ 47msec, Treconf igure ≈ 1.9sec, Tackf ailure ≈ 1.2sec, Tbootup ≈ 79.8sec, Tnotif y ≈ 41msec, Tconf igurevmu ≈ 1.9sec, Tstartasc ≈ 64.3sec, Treconf igure ≈ 1.9sec. It is therefore evident that the process of booting up the VMU and starting the application component takes much longer than reconfiguring the application. These two tasks can be executed in parallel and reconfiguring the application to work with a different codec helps mitigate the effects of the hardware failure. Moreover, the time need for starting a new application component is application specific and could not be avoided in any case of failure, apart from using a live replica, with the resource costs that this incurs. More importantly, as has been discussed earlier, the aim of the system is to provide QoS guarantees for the running applications. In the case of the video post-production 26

(a) Grading Station

(b) Viewing Station

(c) Monitoring Figure 12: Screenshots of the Post-production application.

scenario the quality metric is the number of frames that are dropped. Therefore we conduct a series of test to determine whether the reconfiguration capability provided by the WfMS proves helpful. For this reason one of the VMUs was forcibly shut down. In the first case, no actions where defined in the workflow. The error was left on the IaaS to resolve. In the second case, the workflow description contained commands to reconfigure the application to work with another codec. A representative run can be seen in Figure 13. The application reported the dropped frames per 5 second interval. As can be seen in the case of reconfiguration of the application there are a lot of dropped frames when the application is asked to change the codec that it uses. This is due to the way the application works, as all buffers need to be flushed and the different components syn27

60

d o i

50

er p

d

no

40

ce s

5

er

30 f i i No Recon g urat on

p se ma

Reconfigurati on

20

r Fd e po

10

D r

0 0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

105

110

115

120

125

130

135

140

145

150

155

160

165

170

T ime in sec

Figure 13: Dropped frames per 5 seconds

chronized. After these small spikes the application returns to normal operation with a small number of frames dropped in each 5 second interval. On the other hand, when the application continues to function with one of the IPUs missing it is constantly dropping frames. We run the experiment 50 times and on average in the first case the number of dropped frames was 225 and on the second case 586 over a period of 160 seconds. 7. Conclusion In this paper, we presented a two-layered WfMS approach that resides both on the platform services layer but also within the virtualized environment. Taking into consideration that applications tend to become service-oriented (thus consisting of many application service components), the primary objective of the proposed approach is to invoke the application services, monitor the execution and handle events with regard to specific performance constraints as set by future internet applications, i.e. interactive soft real-time multimedia applications, as described in the corresponding descriptions. Given the need to facilitate real-time interactivity, we presented specific constructs in the workflow description document and the enhancements proposed to allow for enactment of the application services. Key to the proposed WfMS is the two-layer approach being followed, enabling an instance of the Enactor to reside within the virtualized environment in order to manage the application service components and minimize interactions with the platform services, where the Workflow Manager resides. Moreover, key elements of the architecture are considered as parts of the application and are benchmarked and modeled thus allowing for a performance predictions to be made. Based on this approach, the operation of the proposed mechanism with regard to performance constraints achieves the requirements set by soft real-time applications, since invocation of services and faults are being handled within specific timing limits. The experiments showed promising results and, therefore, the performance of the mechanism is considered to be well established, allowing the adoption of it in any heterogeneous and, especially, cloud-based system that seeks to provide QoS guarantees and facilitate real-time interactivity. Future work will focus on the scalability aspect of the proposed approach against large-scale cloud computing systems and high performance computing Grids by embracing hierarchical workflow management strategies. 28

Acknowledgement This research has been partially supported by the IRMOS EU co-funded project (ICT-FP7-214777). References [1] T. Erl, Service-oriented Architecture: Concepts, Technology, and Design, Prentice Hall, Upper Saddle River, 2005. [2] I. Foster, C. Kesselman (Eds.), The Grid: Blueprint For A New Computing Infrastructure, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999. [3] B. Hayes, Cloud Computing, Commun. ACM 51 (7) (2008) 9–11. [4] P. Mell, T. Grance, The NIST Definition of Cloud Computing, Version 15 (2009). [5] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, I. Brandic, Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility, Future Gener. Comput. Syst. 25 (2009) 599–616. [6] C. Weinhardt, A. Anandasivam, B. Blau, J. Stosser, Business Models in the Service World, IT Professional 11 (2) (2009) 28 –33. [7] G. Buttazzo, G. Lipari, L. Abeni, M. Caccamo, Soft Real-Time Systems: Predictability vs. Efficiency (Series in Computer Science), Plenum Publishing Co., 2005. [8] L. Sha, T. Abdelzaher, K.-E. Arzen, A. Cervin, T. Baker, A. Burns, G. Buttazzo, M. Caccamo, J. Lehoczky, A. K. Mok, Real Time Scheduling Theory: A Historical Perspective, Real-Time Syst. 28 (2004) 101–155. [9] D. Durkee, Why cloud computing will never be free, Queue 8 20:20–20:29. [10] A. Williamson, Has amazon ec2 become over subscribed, http://alan.blog-city.com/has amazon ec2 become over subscribed.htm (2010). [11] IRMOS Project. URL www.irmosproject.eu [12] F. Checconi, T. Cucinotta, D. Faggioli, Hierarchical Multiprocessor CPU Reservations for the Linux Kernel, in: 5th OSPERT Workshop, 2009. [13] T. Voith, M. Kessler, K. Oberle, D. Lamp, A. Cuevas, P. Mandic, A. Reifert, ISONI Whitepaper (September 2008). [14] S. Narasimhamurthy, G. Umanesan, J. Morse, M. Muggeridge, ISONI Storage Whitepaper (January 2011). [15] A. Mayer, S. Mcgough, N. Furmento, W. Lee, M. Gulamali, S. Newhouse, J. Darlington, Workflow Expression: Comparison of Spatial and Temporal Approaches (March 9, 2004 2004). [16] D. P. Spooner, J. Cao, S. A. Jarvis, L. He, G. R. Nudd, Performance-Aware Workflow Management for Grid Computing, The Computer Journal 48 (3) (2005) 347–357. [17] E. Deelman, D. Gannon, M. Shields, I. Taylor, Workflows and e-Science: An overview of workflow system features and capabilities, Future Gener. Comput. Syst. 25 (2009) 528–540. [18] E. Deelman, Grids and Clouds: Making Workflow Applications Work in Heterogeneous Distributed Environments, Int. J. High Perform. Comput. Appl. 24 (2010) 284–298. [19] Post-production. URL http://en.wikipedia.org/wiki/Post-production [20] Market and Technical Requirements Analysis, IRMOS Project. URL http://www.irmosproject.eu/Deliverables/Download.aspx?ID=15 [21] C. Lin, S. Lu, X. Fei, A. Chebotko, D. Pai, Z. Lai, F. Fotouhi, J. Hua, A Reference Architecture for Scientific Workflow Management Systems and the VIEW SOA Solution, IEEE Trans. Serv. Comput. 2 (1) (2009) 79–92. [22] V. Korkhov, D. Vasyunin, A. Wibisono, A. S. Z. Belloum, M. A. Inda, M. Roos, T. M. Breit, L. O. Hertzberger, VLAM-G: Interactive data driven workflow engine for Grid-enabled resources, Sci. Program. 15 (2007) 173–188. [23] F. Ranno, S. K. Shrivastava, S. M. Wheater, A System for Specifying and Coordinating the Execution of Reliable Distributed Aplications, Tech. rep. (1997). [24] M. Papazoglou, A. Dells, A. Bouguettaya, M. Haghjoo, Class library support for workflow environments and applications, Computers, IEEE Transactions on 46 (6) (1997) 673 –686.

29

[25] G. Joeris, O. Herzog, Towards Flexible and High-Level Modeling and Enacting of Processes, in: Proceedings of the 11th International Conference on Advanced Information Systems Engineering, CAiSE ’99, Springer-Verlag, London, UK, 1999, pp. 88–102. [26] L. Zeng, B. Benatallah, A. Ngu, M. Dumas, J. Kalagnanam, H. Chang, QoS-aware middleware for Web services composition, Software Engineering, IEEE Transactions on 30 (5) (2004) 311 – 327. [27] A. Litke, K. Konstanteli, V. Andronikou, S. Chatzis, T. Varvarigou, Managing service level agreement contracts in ogsa-based grids, Future Gener. Comput. Syst. 24 (2008) 245–258. [28] D. Kyriazis, K. Tserpes, A. Menychtas, A. Litke, T. Varvarigou, An innovative workflow mapping mechanism for Grids in the frame of Quality of Service, Future Gener. Comput. Syst. 24 (2008) 498–511. [29] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, T. Carver, M. Greenwood, K. Glover, M. R. Pocock, A. Wipat, P. Li, Taverna: a tool for the composition and enactment of bioinformatics workflows, Bioinformatics 20 (17) (2004) 3045–3054. [30] T. Fahringer, A. Jugravu, S. Pllana, R. Prodan, C. S. Jr, H. L. Truong, ASKALON: a tool set for cluster and Grid computing, Concurrency and Computation: Practice and Experience 17 (2005) 143–169. [31] I. Brandic, S. Pllana, S. Benkner, Specification, planning, and execution of QoS-aware Grid workflows within the Amadeus environment, Concurr. Comput. : Pract. Exper. 20 (2008) 331–345. [32] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D. Gannon, L. Johnsson, K. Kennedy, C. Kesselman, J. Mellor-Crummey, D. Reed, L. Torczon, R. Wolski, The GrADS Project: Software Support for High-Level Grid Application Development, International Journal of High Performance Computing Applications(JHPCA) 15 (4) (2001) 327–344. [33] The Globus Toolkit. URL http://www.globus.org. [34] B. Ludascher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E. A. Lee, J. Tao, Y. Zhao, Scientific workflow management and the Kepler system: Research Articles, Concurrency and Computation: Practice and Experience 18 (10) (2006) 1039–1065. [35] J. Eker, J. W. Janneck, E. A. Lee, L. Jie, L. Xiaojun, J. Ludvig, S. Neuendorffer, S. Sachs, X. Yuhong, Taming heterogeneity - the Ptolemy approach, Proceedings of the IEEE 91 (1) (2003) 127–144. [36] A. S. McGough, A. Akram, L. Guo, M. Krznaric, L. Dickens, D. Colling, J. Martyniak, R. Powell, P. Kyberd, C. Huang, C. Kotsokalis, P. Tsanakas, GRIDCC: A Real-time Grid workflow system with QoS, Sci. Program. 15 (2007) 213–234. [37] S. Heinzl, D. Seiler, E. Juhnke, T. Stadelmann, R. Ewerth, M. Grauer, B. Freisleben, A scalable service oriented architecture for multimedia analysis, synthesis and consumption, Int. J. Web Grid Services 5 (2009) 219–260. [38] J. Kirschnick, J. M. A. Calero, L. Wilcock, N. Edwards, Toward an architecture for the automated provisioning of cloud services, Comm. Mag. 48 (2010) 124–131. [39] L. Rodero-Merino, L. M. Vaquero, V. Gil, F. Gal´ an, J. Font´ an, R. S. Montero, I. M. Llorente, From infrastructure delivery to service management in clouds, Future Gener. Comput. Syst. 26 (2010) 1226–1240. [40] Y. Zhang, C. Koelbel, K. Cooper, Hybrid Re-scheduling Mechanisms for Workflow Applications on Multi-cluster Grid, in: Proceedings of the 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGRID ’09, IEEE Computer Society, Washington, DC, USA, 2009, pp. 116–123. [41] Y. Zhang, G. Q. Huang, T. Qu, O. Ho, Agent-based workflow management for RFID-enabled real-time reconfigurable manufacturing, Int. J. Comput. Integr. Manuf. 23 (2010) 101–112. [42] T. Chou, S. Chang, Y. Lu, Y. Wang, M. Ouyang, C. Shih, T. Kuo, J. Hu, J. Liu, EMWF for Flexible Automation and Assistive Devices, in: Real-Time and Embedded Technology and Applications Symposium, 2009. RTAS 2009. 15th IEEE, 2009, pp. 243 –252. [43] M. Weber, G. Partsch, A. Scheller-Huoy, J. Schweitzer, G. Schneider, Flexible real-time meeting support for workflow management systems, in: Proceedings of the 30th Hawaii International Conference on System Sciences: Information Systems Track-Collaboration Systems and Technology Volume 2, HICSS ’97, IEEE Computer Society, Washington, DC, USA, 1997, pp. 559–. [44] G. Kousiouris, D. Kyriazis, K. Konstanteli, S. Gogouvitis, G. Katsaros, T. Varvarigou, A ServiceOriented Framework for GNU Octave-Based Performance Prediction, in: Proceedings of the 2010 IEEE International Conference on Services Computing, SCC ’10, IEEE Computer Society, Washington, DC, USA, 2010, pp. 114–121. [45] A UML Profile for MARTE, OMG.

30

[46] UML Profile for Modeling Quality of Service and Fault Tolerance Characteristics and Mechanisms Specification v1.1. URL http://www.omg.org/spec/QFTP/1.1/PDF [47] I. Innovation, other members of the IRMOS Consortium, Models of Real-time Applications on Service Oriented Infrastructures (2009). [48] WS-SecureConversation 1.3 (March 2007). URL http://docs.oasis-open.org/ws-sx/ws-secureconversation/200512/ws-secureconversation-1.3-os.pdf [49] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, The Globus Striped GridFTP Framework and Server, in: Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference, 2005, p. 54. [50] A. Kivity, Y. Kamay, D. Laor, U. Lublin, A. Liguori, Kvm: the Linux virtual machine monitor, in: OLS 2007: Proceedings of the 2007 Ottawa Linux, 2007. [51] ICT 2010. URL http://ec.europa.eu/information society/events/ict/2010/ [52] Bones Digital Dailies (2010). URL http://www.dft-film.com/software/bones dailies.php

31

Suggest Documents