Jinjun Chen, Yun Yang. CITR â Centre for Information Technology Research. Faculty of Information and Communication Technologies. Swinburne University of ...
Necessary and Sufficient Checkpoint Selection in Grid Workflow Systems Jinjun Chen, Yun Yang CITR – Centre for Information Technology Research Faculty of Information and Communication Technologies Swinburne University of Technology PO Box 218, Hawthorn, Melbourne, Australia 3122 {jchen, yyang}@ict.swin.edu.au
Abstract. In grid workflow systems, existing representative checkpoint selection strategies, which are used to select checkpoints for verifying fixed-time constraints at run-time execution stage, often select some unnecessary checkpoints and ignore some necessary ones. Consequently, overall temporal verification efficiency and effectiveness can be severely impacted. In this paper, we propose a new strategy that selects only necessary and sufficient checkpoints dynamically along grid workflow execution. Specifically, we introduce a new concept of minimum time redundancy as a key reference value for checkpoint selection. We also describe how to obtain minimum time redundancy dynamically along grid workflow execution and investigate its relationships with fixedtime constraint consistency. Based on these relationships, we present our strategy and rigorously prove its necessity and sufficiency. The quantitative evaluation further demonstrates that compared to existing representative strategies, our strategy can improve overall temporal verification efficiency and effectiveness significantly. Covered topics: Process verification and validation, Workflow management systems
1 Introduction In the grid architecture, a grid workflow system is facilitated to support modelling, redesign and execution of large-scale sophisticated e-science and e-business processes in a variety of complex scientific and business applications such as climate modelling, astrophysics, high energy physics, international finance and insurance [3, 15, 18]. These processes are modelled as grid workflow specifications at build-time stage which normally contain a large number of activities that are computation, data or transaction intensive [6, 12, 14], then instantiated at run-time instantiation stage by an instantiation grid service [6, 16], and finally executed at run-time execution stage by grid services which are coordinated by the grid workflow engine [6, 16]. The gird workflow engine itself is a high-level grid service [6, 16]. In reality, complex scientific or business processes are often time constrained, such as those in applications of climate modelling, medical surgery, disaster recovery and international finance [3, 5, 9]. Consequently, when corresponding grid workflow specifications are defined at
1
build-time, fixed-time constraints are often set as well [3, 5, 9]. A fixed-time constraint at an activity is an absolute time value by which the activity must be completed [3, 5, 9]. For example, a climate modelling grid workflow must be completed by the scheduled time [3], say 6:00pm, so that the weather forecasting can be broadcast on time at a later time. Here, 6:00pm is a fixed-time constraint. After fixed-time constraints are set, temporal verification must be conducted so that we can identify any temporal violations and consequently can take proper handling actions in time. At build-time and run-time instantiation stages, temporal verification is static because there are no any specific execution times. Each fixed-time constraint needs only be verified once with the consideration of all covered activities. Therefore, we need not decide at which activities we should conduct the verification. At run-time execution stage however, activity completion durations vary and consequently, we may need to verify each fixed-time constraint many times at different activities. However, conducting the verification at every activity is not efficient as we may not have to do so at some activities such as those that can be completed within allowed time intervals. So where should we conduct the temporal verification? The activities at which we conduct the verification are called checkpoints [7, 8, 10, 17, 19]. This is the subject of research field on CSS (Checkpoint Selection Strategies) [7, 8, 10, 17, 19]. Some representative checkpoint selection strategies have been proposed [7, 8, 10, 17, 19], which are detailed in Section 2. However, they often suffer the limitations of selecting unnecessary checkpoints and ignoring necessary ones. Unnecessary checkpoints would result in some unnecessary temporal verification, which eventually impacts the overall verification efficiency. Ignored checkpoints mean some necessary verification would be omitted, which eventually impacts the overall verification effectiveness. Clearly, neither is desirable. Therefore, in this paper, we develop a new strategy that guarantees checkpoints selected dynamically along grid workflow execution are not only necessary but also sufficient. Specifically, Section 2 details related work and problem analysis for checkpoint selection. Section 3 represents some time attributes of grid workflows. Section 4 introduces minimum time redundancy which serves as a key reference value for our strategy, and also develops a method on how to obtain minimum time redundancy dynamically along grid workflow execution. Section 5 investigates relationships between minimum time redundancy and fixed-time constraint consistency in depth, and based on these relationships, presents our new strategy and rigorously proves its necessity and sufficiency for checkpoint selection. Section 6 conducts a comprehensive comparison and evaluation to demonstrate that our strategy can achieve much better temporal verification efficiency and effectiveness than the existing representative strategies. Finally, Section 7 concludes our contributions and points out future work.
2 Related Work and Problem Analysis for Checkpoint Selection Different representative checkpoint selection strategies have been proposed in the literature. [13] takes every activity as a checkpoint. We denote this strategy as CSS1. [19] sets checkpoints at the start time and end time of each activity. We denote this
2
strategy as CSS2. [17] takes the start activity as a checkpoint and adds a checkpoint after each decision activity is executed. We denote this strategy as CSS3. [17] also mentions another checkpoint selection strategy: user-defined static checkpoints such as user-prescribed static time points. We denote this strategy as CSS4. All of CSS1, CSS2, CSS3 and CSS4 predefine checkpoints before grid workflow execution. However, since activity completion durations vary, we may not have to conduct temporal verification at some of these predefined checkpoints such as those that can be completed within allowed time intervals. Therefore, CSS1, CSS2, CSS3 and CSS4 may select some unnecessary checkpoints. Hence, CSS1, CSS2, CSS3 and CSS4 are not efficient enough for temporal verification. In addition, although CSS1 and CSS2 do not ignore any checkpoints at the heavy cost of inefficiency, CSS3 and CSS4 may ignore some checkpoints as we may need to conduct temporal verification at some other activities rather than the checkpoints predefined by CSS3 or CSS4. Therefore, CSS3 and CSS4 are not effective enough for temporal verification. Our earlier works [7, 8, 10] have attempted to improve this situation, but they still have some deficiencies. Specifically, [7] selects an activity as a checkpoint when its completion duration exceeds its maximum duration. We denote this strategy as CSS5. [8] selects an activity as a checkpoint when its completion duration exceeds its mean duration. We denote this strategy as CSS6. [10] introduces a minimum proportional time redundancy for each activity and selects an activity as a checkpoint when its completion duration is greater than its mean duration plus its minimum proportional time redundancy. We denote this strategy as CSS7. However, in Section 6, we will see that CSS5 may still ignore some necessary checkpoints while CSS6 and CSS7 may select some unnecessary ones. Hence, CSS5 is still not effective enough and CSS6 and CSS7 are still not efficient enough for temporal verification. Regarding the above limitations of the existing representative checkpoint selection strategies, we may ask: “Can we develop a checkpoint selection strategy that only selects necessary yet sufficient checkpoints in order to achieve optimised efficient and effective temporal verification?” In this paper, we answer the question positively by presenting such a strategy.
3 Timed Grid Workflow Representation According to [4, 9, 11, 13], based on the directed graph concept, a grid workflow can be represented by a grid workflow graph, where nodes correspond to the activities and edges correspond to the dependencies between them. Here, we assume that the grid workflow is well structured. The structure verification is outside the scope of this paper and can be referred to [1, 2]. We borrow some concepts from [9, 13, 17] such as maximum, mean and minimum durations as a basis to represent activity time attributes. We denote the ith activity of a grid workflow as ai. For each ai, we denote its maximum duration, mean duration, minimum duration, run-time start time, run-time end time and run-time completion duration as D(ai), M(ai), d(ai), S(ai), E(ai) and Rcd(ai) respectively. The mean duration M(ai) indicates that statistically ai can be completed around its mean duration. Other time attributes are self-explanatory. Normally, we
3
have d(ai) ≤ M(ai) ≤ D(ai), d(ai) ≤ Rcd(ai) ≤D(ai). The detailed discussion on how to obtain and set D(ai), M(ai) and d(ai) is outside the scope of this paper and can be referred to [9, 13, 17]. If there is a fixed-time constraint at ai, we denote it as FTC(ai) and its value as ftv(ai). If there is a path from ai to aj (i≤j), we denote the maximum duration, mean duration, minimum duration, run-time completion duration between them as D(ai, aj), M(ai, aj), d(ai, aj) and Rcd(ai, aj) respectively [13, 17]. For the convenience of discussion, we consider only one execution path in a grid workflow without losing generality. As for a selective or parallel structure, each branch is an execution path. So we can apply the results achieved in this paper to it. For an iterative structure, from start to end, it is still an execution path. Hence, we can also apply the results achieved in this paper to it. Correspondingly, between ai and aj, D(ai, aj) is equal to the sum of activity maximum durations, M(ai, aj) is equal to the sum of activity mean durations, and d(ai, aj) is equal to the sum of activity minimum durations. Besides the above time attributes, four temporal consistency states have been identified and defined by [9] which are SC (Strong Consistency), WC (Weak Consistency), WI (Weak Inconsistency) and SI (Strong Inconsistency). We summarise their definitions for run-time instantiation and execution stages below as our new checkpoint selection strategy is based on them and is related to those two stages. The definitions for build-time stage and detailed discussion on the four states can be found in [9]. Definition 1. At run-time instantiation stage, FTC(ai) is said to be of SC if D(a1, ai) ≤ ftv(ai) - S(a1), WC if M(a1, ai) ≤ ftv(ai) - S(a1) < D(a1, ai), WI if d(a1, ai) ≤ ftv(ai) S(a1) < M(a1, ai), and SI if ftv(ai) - S(a1) < d(a1, ai). Definition 2. At run-time execution stage, at checkpoint ap (p ≤ i), FTC(ai) is said to be of SC if Rcd(a1, ap) + D(ap+1, ai) ≤ ftv(ai) - S(a1), WC if Rcd(a1, ap) + M(ap+1, ai) ≤ ftv(ai) - S(a1) < Rcd(a1, ap) + D(ap+1, ai), WI if Rcd(a1, ap) + d(ap+1, ai) ≤ ftv(ai)-S(a1) < Rcd(a1, ap) + M(ap+1, ai), and SI if ftv(ai) - S(a1) < Rcd(a1, ap) + d(ap+1, ai). According to [9], along grid workflow execution, for WI and SI, corresponding exception handling is triggered to adjust them to SC or WC. Therefore, checkpoint selection only needs to focus on selecting checkpoints for verifying previous SC and WC fixed-time constraints to check their current consistency.
4 Minimum Time Redundancy In this section, we introduce the concept of minimum time redundancy. It will serve as a key reference value for developing our new checkpoint selection strategy. According to Section 3, checkpoint selection is actually for SC and WC fixed-time constraint verification. Correspondingly, minimum time redundancy consists of minimum SC and WC time redundancy. The former is for SC fixed-time constraints and the later is for WC ones. First, we introduce SC and WC time redundancy from one fixed-time constraint in Section 4.1. Then, we introduce minimum SC and WC time redundancy from multiple ones in Section 4.2. After that, in Section 4.3, we discuss how to obtain minimum SC and WC time redundancy dynamically along grid workflow execution.
4
4.1 SC and WC Time Redundancy At run-time execution stage, considering an SC fixed-time constraint, say FTC(ai), at ap (p D(ap) + MTRSC(ap-1), then Rcd(a1, ap-1) + M(ap, aj)=Rcd(a1, ap-1) + M(ap) + M(ap+1, aj) ≤ Rcd(a1, ap-1) + D(ap) + M(ap+1, aj)