2015 IEEE International Conference on Web Services
Heuristic Recovery of Missing Events in Process Logs Wei Song1, Xiaoxu Xia1, Hans-Arno Jacobsen2, Pengcheng Zhang3, Hao Hu4 1
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China 2 Application and Middleware Systems Research Group, Technical University of Munich, Munich, Germany 3 College of Computer and Information, Hohai University, Nanjing, China 4 State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
[email protected]
(i.e., loops), the number of recoveries is even infinite. Following the minimum change principle in improving data quality [12], [13], we focus on finding the optimal recovery which is minimally different from the original incomplete sequence. This is a rational choice, because the probability of missing more events is much smaller than that of missing less events. Unfortunately, it is an NP-hard problem to find the minimum recovery of an incomplete sequence in a predefined process specification [7]. On the other hand, the process system continuously produces event sequences in real-time. Hence, the generated sequences can be regarded as a kind of big data. For these reasons, it is of great importance to efficiently recover the incomplete sequences. In order to fulfill the efficient recovery, the state-of-the-art approach (i.e., the branching approach [7]) first maps a process specification modeled as a Petri net into a homomorphic specification (called branching net) which does not contain OR-joins (i.e., merges of different alternative braches), and then utilizes advanced indexing and pruning techniques to reduce the search space. In the spite the fact that the branching approach is much more efficient than the exhaustive search, there is still much room for the improvement. In this paper, we present a more efficient approach for the minimum recovery. Considering the fact that a process specification may be composed by several independent subprocesses, the minimum recovery of an incomplete sequence can only be in one of the sub-processes. With this in mind, we propose heuristics to identify the sub-process that can generate the minimum recovery, while other sub-processes that fail to generate the minimum recovery are pruned without further investigation. In this way, the search space is significantly reduced. According to the qualified sub-process, we propose a linear-time algorithm of trace replaying to find the minimum recovery. This algorithm is promising in practice. For one thing, it avoids enumerating all possible event interleavings to find the minimum recovery. For another, it further decreases the time cost by combining the conformance checking [5] and subsequent sequence recovery into one single step. To evaluate our approach, we perform experiments by applying our approach and the state-of-the art approach (i.e., the branching approach [7]) to a data set which consists of 36 real-world process specifications and their incomplete event sequences. The experimental results demonstrate that our approach achieves high accuracy for the recovery, and is more efficient than the branching approach.
Abstract—Event logs are of paramount significance for process mining and complex event processing. Yet, the quality of event logs remains a serious problem. Missing events of logs are usually caused by omitting manual recording, system failures, and hybrid storage of executions of different processes. It has been proved that the problem of minimum recovery based on a priori process specification is NP-hard. State-of-the-art approach is still lacking in efficiency because of the large search space. To address this issue, in this paper, we leverage the technique of process decomposition and present heuristics to efficiently prune the unqualified sub-processes that fail to generate the minimum recovery. We employ real-world processes and their incomplete sequences to evaluate our heuristic approach. The experimental results demonstrate that our approach achieves high accuracy as the state-of-the-art approach does, but it is more efficient. Keywords- missing events; heuristic recovery; Petri nets; process decomposition; trace replaying
I.
INTRODUCTION
Contemporarily, business processes of enterprises are event-driven [1], [2]. For one thing, the executions of business processes are triggered and piloted by events. For another, business processes continuously generate large volume of event data when they are in execution. Accordingly, event logs play an increasingly crucial role in today’s companies. However, the quality of event logs still remains a big problem, which seriously affects the analysis results of many business intelligence functionalities, e.g., complex event processing [3], provenance analysis [4], and process mining [5], to name a few. Missing events are the major reasons for the low-quality event logs [6], [7]. In practice, there are several scenarios which could cause event missing in process logs, such as omitting manual recording, temporary system failures, and hybrid storage of executions of different processes [6], [7], [8], [9]. We are not surprised that it is hard to recover the missing events without any predefined knowledge. In this paper, we study the problem of automatic recovery of missing events in the light of a priori process specification. Without losing generality, we employ a special kind of Petri nets [10], that is, workflow nets [11], to represent process specifications. For an incomplete event sequence, there can be multiple recoveries because of parallel routings (i.e., concurrencies) and alternative routings (i.e., choices) in the process specification. If the specification involves iterative routings 978-1-4673-7272-5/15 $31.00 © 2015 IEEE DOI 10.1109/ICWS.2015.24
105
In a nutshell, the key contribution of our work is threefold: We present a linear time algorithm which is based on trace replaying to generate the minimum recovery of an incomplete event sequence from a qualified sub-process with no choices. z We present a heuristics recovery approach for process specifications with choices and loops. More specifically, we first decompose the process into several subprocesses, and then present heuristics to prune unqualified sub-processes that fails to generate the minimum recovery, thus significantly accelerating the recovery process. z We evaluate our approach by employing a real-world data set to compare our approach with the state-of-theart approach. The experimental results show the effectiveness and efficiency of our approach. The remainder of this paper is organized as follows. Section II introduces the background of the research problem. Section III proposes our sequence recovery approach. Section IV shows our experimental results. Section V reviews related work. Section VI concludes the paper and envisions the future work.
TABLE I.
z
II.
Notation (P, T, F) •n, n• M [t M
|σ|, |T| S(σ) Append(σ, t)
FREQUENTLY USED NOTATIONS
Description A Petri net with place set P, transition set T The preset and postset of a place (transition) Firing t lead a Petri net from marking M to M
The number of transitions (events) in σ, T The transition set of a sequence σ Add a transition t at the end of a sequence σ
approach is sub-processes, and only one sub-process (without choices) can be left after pruning. This does not only significantly reduce the search space, but also facilitates using trace replaying technique to find the minimum recovery. Both the time complexity analysis and the experimental results demonstrate that our approach is more efficient than the branching approach. III.
BACKGROUND
In this paper, we employ Petri nets to denote process specifications, so we first introduce some notions of Petri nets [10]. Based thereon, the problem of sequence recovery is formulated. Table 1 summarizes notations frequently used in this paper.
RELATED WORK
Although data recovery is intensively studied in the field of data mining [14], recovering process logs is a novel research topic. It is notable that log recovery is highly relevant to conformance checking of business processes, so we first review related work on conformance checking. As a kind of process mining technique, conformance checking focuses on calculating the consistency degree (ranging from 0 to 1.0) between a process and its event log [15], [16], [17], [18]. It has two major applications. First, it is used to determine to what extent an event sequence conforms to a predefined process. Second, it is employed to evaluate the quality of the discovered process using a process mining technique. Different from conformance checking, event repairing focuses on how to recover an incomplete sequence using the process specification. The alignment approach is an advanced conformance checking technique that is relevant to log recovery [18], [19]. This approach aims at aligning the events in a sequence to the events in the process. Although not mentioned in the paper, we believe that this approach can also be used for log recovery. That is, once the optimal sequence alignment is found, the minimum recovery can be determined. Recently, an enhanced alignment approach is presented, which realizes the recovery by combining stochastic Petri nets, alignments, and Bayesian networks [6]. It uses path probabilities to determine the most likely missing events and leverages Bayesian networks to compute the most likely timestamp for each inserted event. Nevertheless, this approach is inefficient, because it enumerates and searches over all possible complete firing sequences in the specification. To address the efficiency problem, Wang et al. propose in [7] a branching framework in which advanced indexing and pruning techniques are developed to reduce the search space. Different from this approach, the pruning granularity of our
A. Preliminaries Definition 1 (Petri nets). A Petri net is a triple PN = (P, T, F) where P is a finite set of places, T is a finite set of transitions such that P T=, and F (P × T) (T × P) is a set of directed arcs, called the flow relation. For any node xPT, x = {y| yFx} and x = {y| xFy}. PN = (P, T, F) is a free-choice Petri net iff t1, t2T, t1 t2 ≠, implies t1= t2. Tokens are introduced to express the state of a Petri net. The distribution of tokens over the places is often called the state (marking) of the net. A transition is enabled if each of its input places has one token. An enabled transition t can be fired, which involves consuming a token from each of t, performing the associated action, and producing a new token in each of t. Therefore, the firing of a transition t leads a Petri net from a marking M to a new marking M'. This step can be denoted by M [t M . Similarly, the notation M [σ M represents that the firing sequence σ leads the Petri net from marking M to marking M . Definition 2 (Process specification [9]). A process specification is a free choice Petri net S = (P, T, F) which has a unique source place psourceP, psource = , and a unique sink place psinkP, psink = . Each place or transition is on a path from psource to psink. Definition 3 (Complete firing sequence). Given a process specification S = (P, T, F) whose initial marking is M0 and final marking is Mf, a complete firing sequence ρT of S is a transition (event) sequence which leads S from M0 to Mf, i.e., M0 [ρ Mf.
106
can be neglected from the recovery perspective, so these recoveries are regarded to be equivalent. Definition 6 (Equivalent recoveries). Given a process specification S = (P, T, F), two recoveries σ' and σ'' of an incomplete sequence σ is equivalent if and only if S(σ'') = S(σ') and |σ'| = |σ''|. It is noted that one minimum recovery is sufficient for our purpose. Since the incomplete sequence σ only lacks some events, we can use a trace replaying technique to obtain the minimum recovery of σ. The main idea is to replay the sequence σ in S from the initial marking M0 to the final marking Mf. During replaying, if the event σ[i] cannot be replayed at the current marking Mc, any enabled transition of S could be selected to be fired. After the final marking Mf is reached, all missing events of σ are inserted in appropriate positions. The detailed recovery procedure is summarized in Algorithm 1.
B. Problem Statement Ideally, an event log records all complete firing sequences of a process. Unfortunately, some events may be missing in some complete firing sequences, and our objective is to recover them. Definition 4 (Recovery). Given a process specification S = (P, T, F), a recovery of an incomplete sequence σ is a complete firing sequence σ' of S and σ is a subsequence of σ'. The subsequence requirement indicates that for any events σ[i], σ[j] (1≤i f(σ'')+ f(σ'''') = 0.42. V.
B. Experimental Results and Analysis Experimental results on causal nets. For specifications without choices and loops, our approach does not use heuristics. Thereby, we concentrate on comparing our trace replaying technique (Algorithm 1) with the backtracking technique (i.e., Algorithm 1 in [7]) of the branching approach. We illustrate the average accuracies of both approaches on various missing rates of events in Figure 8(a) and the average runtime overheads of both approaches on various logs sizes in Figure 8(b). From Figure 8(a), we observe that both approaches perform well for the recovery, that is, the recovery accuracy of either approaches remains 100% regardless of the changes of event missing rate. From Figure 8(b), we can see that both approaches scales well with the increase of log sizes. It is worth mentioning that the branching approach takes about twice time of our approach, That is, the replaying algorithm of our approach is faster than the gap filling algorithm of the branching approach. Experimental results on specifications with choices. For process specifications with choices but no loops, our approach leverages process decomposition and Criterion 1 as well as Algorithm 1 for the recovery. On the other hand, the branching approach apply advanced index and pruning techniques on branching net to find the minimum recovery.
EVALUATION
In this section, we evaluate the effectiveness and efficiency of our heuristic approach through experiments in which we compare our approach with the state-of-the-art approach, i.e., the branching approach [7]. All experiments are performed on a computer with Inter(R) 2.5GHz CPU and 6GB memory running Window 7 and JDK 1.7. A. Experimental Setup Data set. We employ a total of 36 process specifications (PNML format) of a Chinese company (Dongfang Boiler)
1
110
http://code.google.com/p/beehivez/
1.0 0.9
Recovery accuracy
Recovery accuracy
1.0
0.9
0.8
Branching approach Our approach
0.8 0.7
0.7
Branching approach Our approach
0.6 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
Event missing rate
0.4
0.5
0.6
0.7
0.8
0.9
Event missing rate
(a)
(a) 280
160 240
140 200
Time (ms)
Time (ms)
120 100 80
160 120 80
60
Branching approach Our approach
40
Branching approach Our approach
40
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000
The number of sequences
The number of sequences
(b)
(b)
Figure 9. Performance on specifications with choices.
Figure 8. Performance on casual net specifications.
Thereby, the second experiment is conducted to determine which approach is more accurate and more efficient when process specifications involve choices. Figure 9(a) illustrates the average recovery accuracies of these two approaches on various missing rates of events. It shows that the accuracy results of both approaches are similar and the accuracies of both approaches decrease with the increase of event missing rate. Figure 9(b) illustrates the average runtime overheads of the two approaches on various log sizes. It reveals that our approach is much faster than the branching approach, because our approach employs heuristics (Criterion 1) to prune unqualified causal nets, which significantly reduces the search space. Experimental results on specifications with loops. For process specifications with loops, our approach leverages process decomposition and Criterion 2 as well as Algorithm 2 for the recovery. Figure 10(a) illustrates the average recovery accuracies of the branching approach and our approach on various missing rates of events. It can be seen that the accuracy results of these two approaches are similar. The accuracies of both approaches decrease with the increase of event missing rate. Figure 10(b) shows the average runtime overheads of both approaches on various log sizes. We observe that our approach is more efficient than the branching approach, because our approach employs heuristics (Criterion 2) for pruning unqualified subprocesses of the specification, which reduces the time overhead significantly.
C. Threats to Validity Construct validity. In the experiments, we study the recovery capability of the branching approach and our approach by considering whether the recovered sequence is equivalent to the original complete sequence before certain rate of events is removed. Since both approaches follow the minimum recovery principle, it is hard for them to recover the right sequence due to loops especially when the missing rate is high. From this viewpoint, in the experiments, we generate incomplete sequences from complete ones that execute each loop at most once. To reduce the threat caused, we plan to study the recovery accuracy of both approaches with respect to the number of loop iterations in the incomplete sequences. External validity. The threats to external validity come from the data set used for the experiment. We employ only 36 real-world process specifications and their incomplete sequences to conduct the experiments. The structures of these processes are not very complex, so the experimental results may not be generalized to other complicated processes. To reduce this threat, we plan to evaluate our approach by using more processes, both real-world and synthetic. VI.
CONCLUSION
In this paper, we present a novel approach to recovering missing events in process logs. In order to reduce the redundant recoveries with regard to parallel routings, we
111
[2]
1.0
[3]
Recovery accuracy
0.9 0.8
[4]
0.7 0.6
[5]
Branching approach Our approach
[6]
0.5 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Event missing rate
[7]
(a) 320
[8]
280
Time (ms)
240
[9]
200
[10]
160
[11]
120 Branching approach Our approach
80
[12]
40 100 200 300 400 500 600 700 800 900 1000 The number of sequences
[13]
(b) Figure 10. Performance on specifications with loops. [14] [15]
present a linear time algorithm which leverages trace replaying to efficiently find a minimum recovery. To reduce the non-optimal (non-minimum) recoveries with regard to alternative and iterative routings, we first decompose the process into different sub-processes, then utilize heuristics to prune unqualified sub-processes, and finally leverage trace replaying to find a minimum recovery. The experimental results on real-world processes demonstrate that our approach achieves high accuracy, and is more efficient than the state-of-the-art approach. Currently, our approach recovers missing events by using only process specifications. In the future, we plan to investigate how to incorporate recovery histories (recovery logs) for the subsequent recoveries. We believe that it will further accelerate the sequence recovery by leveraging both process specifications and recovery logs.
[16]
[17]
[18]
[19]
[20]
[21]
ACKNOWLEDGMENT The authors are very grateful to Dr. Lijie Wen from Tsinghua University for his helpful comments on this work. This work is supported in part by the National Natural Science Foundation of China under Grant No. 61202003, No. 61202097, and the “973” Program of China under Grant No. 2015CB352202.
[22]
[23]
REFERENCES [1]
[24]
A. Vera-Baquero, R. Colomo-Palacios, and O. Molloy, “Business Process Analytics Using a Big Data Approach,” IT Professional, 2013, 15(6), pp. 29-35.
112
L. Baresi and S. Guinea, "Event-based Multi-level Service Monitoring," In Proc. of ICWS, 2013, pp. 83-90. L. Ding, S. Chen, E. A. Rundensteiner, J. Tatemura, W.P. Hsiung, and K. S. Candan, “Runtime semantic query optimization for event stream processing”, In Proc. of ICDE, 2008, pp. 676-685. D. Deutch, Y. Moskovitch, and V. Tannen., “A Provenance Framework for Data-Dependent Process Analysis,” In Proc. of VLDB, 2014, pp. 457-468. W.M.P. van der Aalst, Process Mining: Discovery, Conformance and Enhancement of Business Processes. Berlin: Springer-Verlag, 2011. A. Rogge-Solti1, R. S. Mans, W. M. P. van der Aalst, and M. Weske, "Improving Documentation by Repairing Event Logs," In Proc. of PoEM, 2013, pp. 129-144. J.M. Wang, S.X. Song, X.C. Zhu, X. M. Lin, "Efficient Recovery of Missing Events," In Proc. of VLDB, 2013, pp. 841-852. X.M. Liu, H. Liu, and C. Ding,"Incorporating User Behavior Patterns to Discover Workflow Models from Event Logs," In Proc. of ICWS, 2013, pp. 171-178. X.M. Liu, "Unraveling and Learning Workflow Models From Interleaved Event Logs," In Proc. of ICWS, 2014, pp. 193-200. T. Murata, “Petri Nets: Properties, Analysis and Applications,” Proceedings of the IEEE, 1989, 77(4), pp. 541-580. W.M.P. van der Aalst, “The Application of Petri Nets to Workflow Management,” Journal of Circuits, Systems, and Computers, 1998, 8(1), pp. 21-66. P. Bohannon, M. Flaster, W. Fan, and R. Rastogi, “A Cost-based Model and Effective Heuristic for Repairing Constraints by Value Modification,” In Proc. of SIGMOD, 2005, pp. 143-154. S. Kolahi and L. V. S. Lakshmanan, “On Approximating Optimum Repairs for Functional Dependency Violations,” In Proc. of ICDT, 2009, pp. 53-62. M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, 2000. A.Rozinat and W.M.P. van der Aalst, “Conformance Checking of Processes Based on Monitoring Real Behavior,” Information Systems, 2008, 33(1), pp. 64-95. M. Weidlich, A. Polyvyanyy, N. Desai, J. Mendling, and M. Weske, “Process Compliance Analysis Based on Behavioural Profiles,” Information Systems, 2011, 36(7), pp. 1009-1025. A. Adriansyah, B.F. van Dongen, and W.M.P. van der Aalst, “Conformance Checking Using Cost-Based Fitness Analysis,” In Proc. of EDOC, 2011, pp. 55-64. M. de Leoni, F. M. Maggi, and W. M. P. van der Aalst, “Aligning Event Logs and Declarative Process Models for Conformance Checking,” In Proc. of BPM, 2012, pp. 82-97. W.M.P. van der Aalst, A. Adriansyah, and B.F. van Dongen, “Replaying History on Process Models for Conformance Checking and Performance Analysis,” Wiley Interdisc. Rew.: Data Mining and Knowledge Discovery, 2012, 2(2), pp. 182-192. J.Q. Li, Y.S. Fan, and M.C Zhou, “Timing Constraint Workflow Nets for Workflow Analysis,” IEEE Trans. Systems, Man, and Cybernetics, 2003, 33(2), pp. 179-193. J.Q. Li, Y.S. Fan, and M.C Zhou, “Performance Modeling and Analysis of Workflow,” IEEE Trans. Systems, Man, and Cybernetics, 2004, 34(2), pp. 229-242. W. Song, X.X. Ma, C.Y. Ye, W.C. Dou, and J. Lű, “Timed Modeling and Verification of BPEL Processes Using Time Petri Nets,” In Proc. of QSIC, 2009, pp. 92-97. A. Polyvyanyy, J. Vanhatalo, and H. Völzer, "Simplified Computation and Generalization of the Refined Process Structure Tree," In Proc. of WS-FM, 2010, pp.25-41. H. M. W. Verbeek, J. C. A. M. Buijs, B. F. van Dongen, W. M. P. van der Aalst, “XES, XESame, and ProM 6,” In Proc. of CAiSE Forum 2010, pp. 60-75.