A QoS-Aware and Fault-Tolerant Workflow Composition for Grid

2 downloads 84 Views 468KB Size Report
Keywords-Grid Workflow; Web Service; QoS; Fault Tolerant. I. INTRODUCTION ..... [4] http://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html. [5] J. Yu and R.
2008 Seventh International Conference on Grid and Cooperative Computing

A QoS-Aware and Fault-Tolerant Workflow Composition for Grid Jiang Bian, Chuliang Weng, Jie Du, Minglu Li Department of Computer Science, Shanghai Jiaotong University Shanghai, China [email protected] have been widely adopted for middleware design, and despite their known advantages, they are less frequently encountered in the design of scientific applications.

Abstract—The continuous increasing of computing power in electric system computing places a threshold to the single host use and suggests an approach based on distributed computing. An emerging solution is grid technology, which allows organization to make better use of existing computing resources by providing them with a single, transparent, aggregated source of computing power. New generation of grid infrastructure, where web services are building blocks, allow management of a web services workflow. In this paper, a novel strategy – QoSAware and Fault-Tolerant Workflow Composition – has been applied to utilize web services and business process execution language for overcoming the issues about task assignment, security, flexibility and workflow management. This new strategy is lightweight web services based computing powersharing architecture, and it is not only suitable for executing computing works which are able to run in batches, but also be able to solve current issues in Web Services based Computing Application such as system resilience, fault tolerance, efficiency of job scheduling and the instability in congested network environment.

For a large range of scientific applications, the second point is addressed by workflow managers. Scientific data processing procedures often require applying many data filtering, modeling, quantification and analysis procedures. Furthermore, large data sets often have to be processed. A workflow manager can describe the processing dependencies independently from the actual scientific codes involved. The associated workflow enactor can optimize the execution on a grid infrastructure by exploiting the data and code parallelisms intrinsically expressed in the workflow. The goal of the Web Services effort is to achieve interoperability between applications by using Web standards. Web Services use a loosely coupled integration model to allow flexible integration of heterogeneous systems in a variety of domains including business-to-consumer, business-to-business and enterprise application integration. The following basic specifications originally defined the Web Services space: SOAP [SOAP 1.1], Web Services Description Language (WSDL) [WSDL 1.1], and Universal Description, Discovery, and Integration (UDDI) [UDDI]. SOAP defines an XML messaging protocol for basic service interoperability. WSDL introduces a common grammar for describing services. UDDI provides the infrastructure required to publish and discover services in a systematic way. Together, these specifications allow applications to find each other and interact following a loosely coupled, platform independent model [4]. Systems integration requires more than the ability to conduct simple interactions by using standard protocols. The full potential of Web Services as an integration platform will be achieved only when applications and business processes are able to integrate their complex interactions by using a standard process integration model. The interaction model that is directly supported by WSDL is essentially a stateless model of requestresponse or uncorrelated one-way interactions.

Keywords-Grid Workflow; Web Service; QoS; Fault Tolerant

I.

INTRODUCTION

Grids [1] have moved from the obscurely academic to the highly popular. We read about Compute Grids [2], Data Grids, Science Grids, Access Grids, Knowledge Grids, Bio Grids, Sensor Grids, Cluster Grids, Campus Grids, Tera Grids [3], and Commodity Grids. A Grid integrates and coordinates resources and users that live within different control domains—for example, the user’s desktop vs. central computing; different administrative units of the same company; or different companies; and addresses the issues of security, policy, payment, membership, and so forth that arise in these settings. Otherwise, we are dealing with a local management system. A Grid allows its constituent resources to be used in a coordinated fashion to deliver various qualities of service, relating for example to response time, throughput, availability, and security, and/or co-allocation of multiple resource types to meet complex user demands, so that the utility of the combined system is significantly greater than that of the sum of its parts. Service-Oriented Architectures (SOA) has encountered a large success both in the Grid and in the Web communities. Most recent middlewares have adopted it in order to address interoperability and extensibility problems. Although SOAs

978-0-7695-3449-7 2008 U.S. Government Work Not Protected by U.S. Copyright DOI 10.1109/GCC.2008.95

510

Kepler system [9], the Taverna workbench [10], and Triana [11]. The main interest for using grid infrastructures is to exploit the potential application parallelism thanks to the availability of the grid resources. There are three different levels of parallelism that can be exploited in service-based workflows [12]. In Grid environments, there are a large number of similar or equivalent resources provided by different third parties. Grid users can select suitable resources and use them for their workflow applications. These resources may provide the same functionality, but optimize different QoS measures. In addition, different users or applications may have different expectations and requirements. Therefore, it is not sufficient for a workflow management system to only consider functional characteristics of the workflow. QoS requirements such as time limit (deadline) and expenditure limit (budget) for workflow execution also need to be managed by workflow management systems. Users must be able to specify their QoS expectations of the workflow at the design level. Then, actions conducted by workflow systems using run-time must be chosen according to the initial QoS requirements.

Figure 1. Grid VO

WS-BPEL defines a model and a grammar for describing the behavior of a business process based on interactions between the process and its partners. The interaction with each partner occurs through Web Service interfaces, and the structure of the relationship at the interface level is encapsulated in what is called a partnerLink. The WS-BPEL process defines how multiple service interactions with these partners are coordinated to achieve a business goal, as well as the state and the logic necessary for this coordination. WSBPEL also introduces systematic mechanisms for dealing with business exceptions and processing faults. Moreover, WSBPEL introduces a mechanism to define how individual or composite activities within a unit of work are to be compensated in cases where exceptions occur or a partner requests reversal.

In Grid environments, resources span across multiple administrative domains and are not under the control of the workflow management systems. Moreover, many users are competing for limited resources. Workflow execution failures may be caused by many reasons, such as the change of resource local policy and the failure of resources and network fabric. Thus, Grid workflow management systems should be able to handle failures flexibly and support reliable executions in the presence of concurrency and failures.

This paper will be organized as following: Section 2 will briefly introduce the related work of QoS-based and Faulttolerant workflow composition. In section 3, a SOA based grid application and workflow system is proposed. QoS-aware workflow composition and fault-tolerant mechanism will be explained in section 4 and section 5. Performance evaluation is analyzed in section 6 and conclusion will be given in Section 7. II.

III.

QOS-BASED GRID FRAMEWORK

This section will briefly introduce WS-PG [13] and some issues should be further studied for improvement. WS-PG is based on the architecture of web services, it inherits the characteristics of Service Oriented Architecture (SOA) which consists of three participants that are service requester, service provider and service broker. The description of three roles is as following:

RELATED WORK

A. The role of coordinator The coordinator acts as a service broker to fairly mediate between the computing unit (service requester) and computing requester (service provider). Its major function is to maintain a list which records the URL and requirement of computing requester. This list will be created when the computing requester publishes its Web service in the coordinator. If computing unit asks for the subtasks through the coordinator, the coordinator will assign the URL of computing requester in the list to computing unit by round robin mechanism. Afterwards, the computing unit will use the specified URL to communicate with the computing requester directly. In addition, the function of account and auditing management will be implemented at the end of coordinator. This role is corresponding to the role of UDDI in SOA.

Building applications by assembling legacy codes for processing and analyzing data is very common. It allows code reusability without introducing a too high load on the application developers. The logic of such a composed application, referred to as the application workflow, is described through a set of computation tasks to perform and constraints on the order of processing such as data dependencies. Many workflow representation formats and execution managers have been proposed in the literature with very different properties [5]. The emblematic task-based workflow manager is the CONDOR Directed Acyclic Graph Manager (DAGMan) [6], on top of which the Pegasus system is built [7]. Based on the static description of such a workflow, many different optimization strategies for the execution have been proposed [8]. The service-based approach has been implemented in different workflow managers such as the

511

B. The role of computing power requester The requesters should design their experiment processes and publish their requirements with the coordinator. In addition, it will assign the subtasks to the computing unit with the workflow control capability.

D. QoS-based workflow composition In a specific Grid environment, QoS constraints and preferences are set by the user. A set of constraints (e.g. thresholds/restrictions to specific values of the parameters) along with possible user preferences, which attach a weight attribute to the QoS parameters, are also parsed into the workflow mapping mechanism.

C. The role of computing unit This role is responsible for executing computation. It will inquire the coordinator for the job while it is idle. After getting back the requester’s URL of Web services, it negotiates with the requester to download the subtasks along with the required data. When the subtask is finished, it will respond the result to the requestor and the whole procedure will continue until all subtasks are completed.

Figure 3. QoS Constraints’ categorization

The operational procedures of WS-PG are shown in Figure 2 with the following steps:

As described in the workflow model overview, QoS Constraints are taken into account for the definition and implementation of the concrete workflow and are prerequisites for the achievement of end-to-end QoS provisioning. Besides, QoS parameters are very vital to scientific research and computing in large distributed system such as Grid. In this section, we present the parameters that have been used in our model and are considered as initial QoS Constraints.

Step 1: Requestor uses BPEL online editor to design the entire task process. Step 2: Requestor registers its task to coordinator. Step 3: Coordinator confirms the request. Step 4: Computing unit registers its computing power to coordinator.

In the following paragraphs, a classification of the QoS parameters is stated as a direct consequence of the logical categorization of them, which means that constraints under the same category are sharing common properties. Currently, this work has concluded to the classification of QoS constraints in three major categories:

Step 5: Coordinator sends requestor’s addresses to computing units. Step 6: Computing Unit makes the contrast to the requestor. Step 7: Requestor transfers the tasks to computing unit.

1) User-defined parameters: which relate to requirements/constraints that the user who initiates the workflow execution process would like to pose, such as cost restrictions (e.g. maximum overall cost). 2) Application parameters: which relate to the offered QoS from the application perspective. For example, the application configuration could play a significant role to the availability of the task to be executed. 3) Resource parameters: which relate to all types of resources, including computational, storage and network resources. For example, from that perspective the network infrastructure can be regarded as a set of interacting resources that are offering a specific QoS level. The above mentioned parameters and their classification are shown in Figure 3. The users that initiate a workflow execution state their QoS requirements in two ways: “hard limits” which are expressed as constraints on the requirements; and preferences which are expressed as importance factors on the aforementioned requirements, adds an extra value to the workflow mapping process since the selection is also made considering these preferences. In order to make the latter feasible, a weight attribute is attached to the QoS parameters. Based on that and regarding as initial indicative parameters availability, cost and time, the following use cases are identified:

Step 8: Computing unit sends the detailed execution data to coordinator. Step 9: Computing unit sends back the result after execution. Step 10: Requestor combines all results while the job is completed. User Request

1: Design workflow 2: Register tasks 7: Transfer tasks 10: Combine results

1 10

Computing Resource 4: Register computing power 6: Contract with requestors 8: Store task details 9: Return results

6

9

Service Coordinator

7

3 4 5 8

2

3: Confirm requests 5: Dispatch task



Figure 2. PowerGrid framework

512

Thresholds. In this case the user sets thresholds/restrictions to the values of the parameters. This means that the overall cost should not exceed a

pre-defined budget, the availability should be at least at a specific level (expressed as a percentage value) and a deadline for time should be achieved. The thresholds may apply to one or more parameters at the same time. •

Availability optimization. In this case the user sets a high preference on the availability parameter of the services and therefore the selection is made considering as a major factor the availability parameter of them. The algorithm proceeds with the selection of the instances that are annotated with the highest availability values.



Cost optimization. In this case the user sets a high preference on the cost parameter and based on that, cost is the major factor during the selection process. The output is a concrete workflow that achieves the lower overall cost.



Time optimization. Similar to the availability and cost optimization, the selection of the service instances is made with regard to their execution time in order to achieve the lowest overall execution time for the workflow.



Optimum solution. In this case the user doesn’t set a preference and as a result the weight attributes for the four parameters are equal. The algorithm execution results to a concrete workflow where the service instances selected offer the optimum value of availability and execution time for their corresponding cost.

Figure 4. QoS-based workflow description

F. Transformation phrase The purpose of this phrase is to transform different representation of preference from experts with varied background to the same representation in order to calculate consensus degree among them. Since system performance indexes are considered as experts in this research which are rational and intuitively represented by numerical value, preference between two indexes will be represented by ordering according to their values. Workflow Model defines a workflow including its task definition and structure definition. We use abstract model. In an abstract model, a workflow is described in an abstract form in which the workflow is specified without referring to specific Grid resources for task execution. Generally it can be said that the user sends its request accompanied by vector of QoS parameters and the weights for the parameters.

After analysis every factor of QoS-defined parameter and its co relationship with user preference, in this section, a novel task assignment strategy is introduced as well as the performance evaluation.

q task =< q1task , q2task ,… , qntask >

This strategy which is a core part of PowerGrid uses loading records from information retrieval strategy to assign task to the most appropriate node. It integrates Ganglia to make decision and will go through three phrases including problem definition, transformation, and calculation to choose the best node. The following section will explain the operation of these three phrases.

n

W =< w1 , w2 ,..., wn > 0 ≤ wi ≤ 1 ∑ wi = 1 i =1

Each weight is used to show the importance of each parameter. For example, If CPU is important for one task, the user will set 1 for the CPU weight and zero for the others.

E. Problem definition phrase Since this research models resource coordination as the problem of finding the best alternative according to opinions from a group experts, the alternative set A and the expert set E will be defined as

G. Calculation phrase In this phrase, a node select algorithm is proposed to find an appropriate resource to dispatch task. When the end user inputs the QoS parameters to the Grid System, The Coordinator will find a specific resource that satisfy user’s requirement. Also, the end user can use default QoS parameters which means equal weighting [1/4, 1/4, 1/4, 1/4] will be initially applied to QoS constraints for calculation.

A = {available computing nodes} E = {CPU usage percentage; memory usage percentage; bandwidth usage percentage; average task-executing time}

If no threshold is reached, dynamic weighting adjustment strategy will be called to calculate weights for those four indexes until at least one resource is satisfy the QoS constraints. Given a Taski, if one resource satisfy

513

qkRe source × wk ≥ Threshold ∑ Taski k =1 qk n

Where Threshold is determined via user predefines or application specifies. Then, this resource site could seem as an executable resource site for this task. The coordinator will find a resource with the largest value and dispatch task to execute it. IV.

Figure 5. Fault-tolerance methods

From the systemic point of view, every node in the peer-topeer distributed computing infrastructure should have equivalent important position. In other word, every role should be able to be played on any node in the network. However, nodes in PowerGrid [14] communicating with each other through the roles fixed in the system. Therefore, the whole system would be stopped for operation if the role of coordinator is not functioning. Therefore, the backup or mirror sites should be able to take over the down machines and continue the operation when the above mentioned situation happens for the distributed computing system. Certain flexibility and fault tolerance mechanism is required.

FAULT-TOLERANT WORKFLOW MANAGEMENT SYSTEM

In Grid environments, resources span across multiple administrative domains and are not under the control of the workflow management systems. Moreover, many users are competing for limited resources. Workflow execution failures may be caused by many reasons, such as the change of resource local policy and the failure of resources and network fabric. Thus, Grid workflow management systems should be able to handle failures flexibly and support reliable executions in the presence of concurrency and failures. As shown in Figure 5, we can divided workflow failure handling techniques into two different levels, namely tasklevel and workflow-level. Task-level techniques mask the effects of the execution failure of tasks in the workflow, while workflow-level techniques manipulate the workflow structure such as execution flow to deal with erroneous conditions. Task-level techniques have been greatly studied in parallel and distributed systems. They can be cataloged into retry, alternate resource, checkpoint/restart and replication. The retry technique is the simplest failure recovery technique, as it simply tries to execute the same task on the same resource again after failure. The alternate resource technique submits failed task to another resource. The checkpoint/restart technique moves failed tasks transparently to other resources, so that the task can continue its execution from the point of failure. And the replication technique runs the same task simultaneously on different Grid resources to ensure task execution provided that at least one of the replicas does not fail.

Figure 6. Fault-tolerant workflow model

In the event of a Resource WS failure, searches are simply rescheduled on other available servers. Queuing information is stored in the grid-status database possibly to preserve the trace of failure jobs. The end-user agent is also able to show the route and the history of each job. The system also offers an interface to view the status of the grid showing a map with the distributed resources that can be selected to control their state, history, load, etc. The grid daemon is the managing component of failures. It periodically queries servers and stores information about their status in the database. Therefore when the PowerGrid server asks for free resources the ResourceAllocator web service, through the information stored in the database, will exclude those unavailable. If suddenly a resource becomes unavailable while the PowerGrid server is using it, the PowerGrid server notifies the failure to the grid daemon and requests a new resource. A workflow description is stored in the database as well as every activity of this workflow. When a workflow activity initiates, PowerGrid server will store corresponding information, and with the

Workflow-level techniques include alternate task, redundancy, user-defined exception handling and rescue workflow. The first three approaches assume there is more than one implementation for a certain computation with different execution characteristics. The alternate task technique executes another implementation of a certain task if the previous one failed, while the redundancy technique executes multiple alternative tasks simultaneously. The user-defined exception handling allows the users to specify a special treatment for a certain failure of a task in workflow. The rescue workflow technique ignores the failed tasks and continues to execute the remainder of the workflow until no more forward progress can be made. Then, a rescue workflow description called rescue DAG, which indicates failed nodes with statistical information, is generated for later submission.

514

execution of the workflow, the information status is changed if needed. So, if one resource site where the activity will execute is down, the daemon would monitor this and search the resource center and find an appropriate resource as substitute.

120

Fault-Tolerant Disabled Fault-Tolerant Enabled

100

PERFORMANCE EVALUATION

Successfully Num

V.

In order to better explain the significance of the proposed workflow composition and model, we conducted several experiments by using workflow-enabled Power Grid [13] System as test bed and use actually Power Service as test application. Every task in the workflow requires a certain type of service for processing. Figure 7 shows the experimental results of QoS-based workflow composition with a comparison of a QoS-disabled workflow model.

20

VI.

Execution Time(sec)

800 600 400 200 0 60

80

100

120

60

80

Service Instance Num

100

120

CONCLUSION AND FUTURE WORK

In this paper, a QoS-based and fault-tolerance workflow management system is proposed which is not only resolve the issues in PowerGrid such as system resilience, fault tolerance, efficiency of job scheduling and the instability in congested network environment, but also efficiently execute computation intensive works. It sufficiently utilizes the characteristics of grids integration with the bandwidth flow control mechanism by weighted priority based task selection. This novel QoSbased workflow scheduling algorithm that minimizes the turnaround time of execution while meeting the QoS requests. We also described task dependency and overall duplicate assignment for optimized execution planning and efficient runtime rescheduling. In future work, we will further enhance our scheduling method to handle more QoS parameters such as pricing. We want to implement new rules to support many scenarios for matching and scheduling and dynamically each time selects the best one in relation with environmental conditions.

1000

40

40

Figure 8. Comparison of Fault-Tolerant Enabled VS Disabled

1200

20

40

0

QoS-disabled Workflow Composition QoS-enabled Workflow Composition

0

60

20

In our first experiment, we compare our proposed QoSbased workflow composition and with our scheduling algorithm denoted in section 4. This scheduling algorithm always finds a best node suitable for computing specific type of task while searching for an answer. The experiment uses at most 140 workflow instances and calculates the execution time with compare to two types of workflow composition. Result shows that with a QoS constraint workflow composition, when the service instances number increase, it would be give a better performance than QoS-disabled workflow composition.

1400

80

140

Service Instance Num

ACKNOWLEDGEMENT Figure 7. Comparison of QoS-enabled VS QoS-disabled Workflow Composition

The work reported in this paper is supported by National Natural Science Foundation of China (No.90612018, No.90715030, and No.60503043).

In another experiment, a comparison of Fault-Tolerant Enabled and Fault-Tolerant Disabled workflow composition is present. In Figure 8, we can see that with a fault-tolerant enabled workflow composition, the successfully executed service instance number is larger than fault-tolerant disabled. This difference shows sharply when the service instance number increase.

REFERENCES [1] [2] [3]

[4] [5] [6]

515

I. Foster, C. Kesselman, The Grid: Blueprint for a Future Computing Infrastructure, Morgan Kaufmann Publishers, USA, 1999. W. Leinberger, V. Kumar, Information power Grid: The new frontier in parallel computing IEEE Concurrency 7 (4) (1999) 75–84. Charlie Catlett, "The Philosophy of TeraGrid: Building an Open, Extensible, Distributed TeraScale Facility," ccgrid, p. 8, 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'02), 2002. http://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html J. Yu and R. Buyya. A taxonomy of scientific workflow systems for grid computing. ACM SIGMOD Record, 34(3):44–49, Sept. 2005. M. Livny, Direct Acyclic Graph Manager (DAGMan), http://www.cs.wisc.edu/condor/dagman.

[7]

E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, K. Blackburn, A. Lazzarini, A. Arbree, R. Cavanaugh, S. Koranda, Mapping Abstract Complex Workflows onto Grid Environments, Journal of Grid Computing (JGC) 1 (1)(2003) 9–23. [8] J. Blythe, S. Jain, E. Deelman, Y. Gil, K. Vahi, A. Mandal, K. Kennedy, Task Scheduling Strategies for Workflow-based Applications in Grids, in: CCGrid, Cardiff, UK, 2005, pp. 759–767. [9] B. Ludäscher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger, M. Jones, E.A. Lee, J. Tao, Y. Zhao, Scientific Workflow Management and the Kepler System, Concurrency and Computation: Practice & Experience. [10] T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M.R. Pocock, A. Wipat, P. Li, Taverna: A tool for the composition and enactment of bioinformatics workflows, Bioinformatics journal 17 (20) (2004) 3045–3054. [11] I. Taylor, I. Wand, M. Shields, S. Majithia, Distributed computing with Triana on the Grid, Concurrency and Computation: Practice & Experience, 17 (1–18). [12] T. Glatard, J. Montagnat, X. Pennec, Efficient services composition for gridenabled data-intensive applications, in: IEEE International Symposium on High Performance Distributed Computing (HPDC’06), Paris, France, 2006, pp. 333–334.

[13] Jiang Bian, Jie Du, Chuliang Weng, Minglu li. PowerGrid: a serviceoriented grid application platform. Journal of Huazhong University of science and Technology, Natural Science. Z2(2007) [14] E. Deelman et al, “Mapping Abstract Complex Workflows onto Grid Environments”, Journal of Grid Computing, Vol.1:25-39, 2003. [15] T. Eilam et al, “A utility computing framework to develop utility systems”, IBM System Journal, Vol. 43(1):97-120, 2004. [16] T. Fahringer et al, “ASKALON: a tool set for cluster and Grid computing”, Concurrency and Computation: Practice and Experience, 17:143-169, Wiley InterScience, 2005. [17] A. Mayer et al, “ICENI Dataflow and Workflow: Composition and Scheduling in Space and Time”, In UK e-Science All Hands Meeting, Nottingham, UK, IOP Publishing Ltd, Bristol, UK, September 2003. [18] M. Surridge, S. Taylor, D. De Roure, E. Zaluska, Experiences with GRIAindustrial applications on a web services Grid, in: Proceedings of the First International Conference on e-Science and Grid Computing, IEEE Press, 2005, pp. 98–105. [19] G. Bochmann, A. Hafid, Some Principles for Quality of Service Management, Technical Report, Université de Montreal, 1996. [20] R.J. Al-Ali, K. Amin, G. von Laszewski, O.F. Rana, D.W. Walker, M. Hategan, N.J. Zaluzec, Analysis and provision of QoS for distributed Grid applications, Journal of Grid Computing (2004) 163–182

516

Suggest Documents