Time-interval process model discovery and validation ... - Springer Link

0 downloads 0 Views 1003KB Size Report
Jul 2, 2010 - Abstract A process management technique, called process mining, received much attention recently. Process mining can extract organizational ...
Appl Intell (2010) 33: 54–66 DOI 10.1007/s10489-010-0240-5

Time-interval process model discovery and validation—a genetic process mining approach Chieh-Yuan Tsai · Henyi Jen · Yi-Ching Chen

Published online: 2 July 2010 © Springer Science+Business Media, LLC 2010

Abstract A process management technique, called process mining, received much attention recently. Process mining can extract organizational or social structures from event logs recorded in an information system. However, when constructing process models, most process mining searches consider only the topology information among events, but do not include the time information. To overcome the drawbacks, a time-interval genetic process mining framework is proposed. First, time-intervals between events are derived for all event sequences. A discretization procedure is then developed to transform time-interval data from continues type to categorical type. Second, the genetic process mining method which is based on global search strategy is applied to generate time-interval process models. Finally, a precision measure is defined to evaluate the quality of the generated models. With the measure, managers can select the best process model among a set of candidate models without human involvement. Keywords Process mining · Genetic algorithms · Time-interval · Model quality

1 Introduction With the progress of information and web technology, many business functions and activities (tasks) are now integrated in a process model. A process model not only explicitly expresses what tasks are involved, but also how tasks are communicated with each other. By monitoring the process C.-Y. Tsai () · H. Jen · Y.-C. Chen Department of Industrial Engineering and Management, Yuan Ze University, Jhongli City, Taiwan, R.O.C. e-mail: [email protected]

model, managers can easily monitor and control the progress of daily operations and projects in enterprises. Although process models encompass many advantages, building complex process models such as Enterprise Resource Planning (ERP) and Supply Chain Management (SCM) systems is a time-consuming job. In practice, process modeling is conducted by a small group of people such as clerks, IT specialists, or outside consultants. They may not fully understand the requirements and constraints of all activities so that repeated modification and correction is avoidable. This makes the modeling time relatively long and tedious. For these reasons, a process management technique, called process mining, has been studied recently. The basic idea is to extract process, control, data, organizational, and social structures from event logs recorded by an information system without human involvement. Cook et al. is the first ones to mine the model from event log in the context of software engineering [6]. They introduced the RNet, KNet and Markov methods for process discovering. Subsequently, Cook and Wolf extended their study for discovering process model on concurrent process [7]. They used particular metrics: entropy, event type counts, periodicity and causality to discover process models from event stream. Agrawal et al. is the first ones to apply the process mining in business field [1]. Their process models are based on the workflow graph containing nodes and edges where nodes represent activities and edges represent the relations between activities. Van Der Aalst et al. [16, 18] proposed the α-algorithm which is a major approach in process mining for discovering the process models (Petri nets) by ordering relations between events (tasks) from the event logs. De Medeiros et al. [8] pointed out the limitations existed in the αalgorithm such as short loop, invisible tasks, duplication task, noise, implicit place and non-free choice. De Medeiros et al. [2] proposed an extension approach of α-algorithm,

Time-interval process model discovery and validation—a genetic process mining approach

55

Fig. 1 An example Petri net

called α + -algorithm, to handle the short loops. Wen et al. proposed α ++ -algorithm to deal with non-free-choice constructs [19]. Additionally, Ren et al. [13] proposed the βalgorithm to deal with the short loops limitation existed in the α-algorithm. Van Der Aalst et al. [17] made a survey of process mining and introduced some typical process mining approaches and tools. Although the above process mining approaches can construct process models in an efficient way, three major problems are identified. First, the process models generated by those approaches reveal only the topology information but not the time information. That is, how long between occurring tasks are unknown. However, the tasks appeared in the same sequence but in different time-intervals should be considered as different behaviors. For example, customers who bought product A then product B in a week should be regarded as different behavior that customers who bought product A then product B within three to four months. Without considering time-interval information between tasks, the process models cannot reveal the right information to help managers take the right actions at the right time. Another major problem in most process mining techniques is the difficulty to deal with the noisy logs well. To construct a process model based on event log, most process mining techniques adopt a local search strategy because of computational complexity constraints [18]. Although local search strategy can speed up the modeling time, the problems of invisible tasks, duplicate tasks, non-free choice, and loops cannot be handled well [8]. The other problem is that the existing process mining researches do not discuss the issue of process model quality evaluation. In fact, most process mining techniques require users specify a set of parameters in order to generate a process model. Different parameter settings will generate different process models. Without a clear definition for evaluating the goodness of a process model, the process model selection becomes a difficult task. To overcome the drawbacks mentioned above, a timeinterval genetic process mining framework is proposed. Time-interval information between tasks is considered during the model construction so that behaviors of different time relationships can be distinguished. In addition, a process mining algorithm based on a global search strategy, called genetic process mining [3, 9], is applied to avoid the problem of a local search. Finally, a precision measure is defined and used to evaluate the quality of process models. With the measure, managers can select the best process

model among a set of candidate models without human involvement. The remainder of this paper is organized as follows. Section 2 introduces the basic concepts of the Petrinet. Section 3 details the proposed time-interval genetic process mining framework. Section 4 provides an implementation case to demonstrate the feasibility of the proposed framework. In addition, a set of experiments and in-depth discussion are presented in the section also. The conclusion and further research of this study are shown in Sect. 5.

2 Petri net This section introduces Petri net notations that are used to explain the semantics of the representation of the genetic process mining method in Sect. 3.2. A Petri net (PN) is a common graphical and mathematical description tool for process models. It is adaptive to model the states of concurrent, parallel, asynchronous and distributed controls in systems [10, 12]. By analyzing Petri nets, the structure and behavior of a system can be easily monitored and controlled [4]. The basic components in a Petri net include places, transitions, arcs and token. Therefore, the formal definition for a Petri net is a triple (P , T , F ), where P is a finite set of places, T is a finite set of transitions (P ∩ T = φ), and F ⊆ (P × T ) ∪ (T × P ) is a set of arcs [3]. For example, as shown in Fig. 1, the Petri net is composed of four transitions (X, Y , Z, and W ), five places, and nine arcs. The structures of a Petri net can be classified into four types according the relations among the components (place, transition and arc). The four structure types are causality (sequential), parallelism (parallel), choice (conditional) and iteration (loop). In addition, the parallelism (parallel) structure contains the AND-split and AND-join relation, the choice (conditional) structure contains the OR-split and OR-join relation. The four structure types are shown in Fig. 2. When a transition is enabled, it is fired. 3 The time-interval genetic process mining framework The proposed time-interval genetic process mining framework starts with the time-interval analysis from an event log. The analysis discretizes the time information in the log from continuous data type to categorical data type. After completing the time-interval analysis, the event sequences

56

C.-Y. Tsai et al.

Fig. 2 The four structure types in a Petri net

Fig. 3 The proposed time-interval genetic process mining framework

are divided into a training dataset and a testing dataset. The training dataset is used to construct a time-interval process model, while the testing dataset is used to evaluate the quality of the constructed process model. To avoid the lo-

cal search problem occurring in most process mining techniques, a genetic process mining method using global search strategy is applied to construct the process model. Then, to evaluate the quality of the constructed process models, a

Time-interval process model discovery and validation—a genetic process mining approach

57

Table 1 The example event sequences

Table 2 The discretized time-interval event sequences

Case id

Event sequence

Case id

Event sequence

case 1

((A, 08:30), (B, 09:16), (C, 09:58), (D, 10:18), (F, 10:57))

case 1

(A, I3 , B, I3 , C, I1 , D, I2 , F)

case 2

((A, 08:45), (B, 09:50), (C, 10:15), (D, 10:33), (F, 11:12))

case 2

(A, I4 , B, I1 , C, I0 , D, I2 , F)

case 3

((A, 09:15), (C, 09:47), (B, 10:30), (D, 11:03), (F, 11:30))

case 3

(A, I2 , C, I3 , B, I2 , D, I1 , F)

case 4

((E, 10:50), (J, 11:15))

case 4

(E, I1 , J)

precision measure composed of two measurements is developed. The proposed framework is visually shown in Fig. 3. 3.1 Time-interval analysis An event log records a series of events in which each event (task) is represented by a case identifier, a task identifier and a timestamp. The cases with the same identifier are aggregated as an event sequence represented as ((a1 , t1 ), (a2 , t2 ), (a3 , t3 ), . . . , (an , tn )) where aj is a task that occurs at time stamp tj , and 1 ≤ j ≤ n and tj −1 ≤ tj for 2 ≤ j ≤ n. The time-interval between task i and task i + 1 is obtained by ti = ti+1 − ti . The event sequence is then transferred to time-interval event sequence represented by (a1 , t1 , a2 , t2 , . . . , tn−1 , an ) [15]. At the same time, the maximum time-interval (max_time) and minimum timeinterval (min_time) can be found from all t. Next, the continuous time-interval data are discretized into categorical time-interval data. This discretization process is important because the following two reasons. First, data volume can be dramatically reduced so that the process mining procedure can be simplified. Second, the generated process model is easy to interpret and manage. In this study, the continuous time-interval data are discretized into k + 2 categories, TI = {I0 , I1 , . . . , Ik , I∞ }, by the following rules: I0 : denotes t satisfying 0 ≤ t ≤ min_time, I1 : denotes t satisfying min_time < t ≤ min_time + r, Ij : denotes t satisfying min_time + j × r < t ≤ min_time + (j + 1) × r, Ik : denotes t satisfying max_time − r < t ≤ max_time, I∞ : denotes t satisfying max_time < t ≤ ∞, where k is set by the user and r = (max_time−min_time)/k. For example, four event sequences with original time stamps are shown in Table 1. Next, all time-intervals between adjacent events are evaluated. Among all timeinterval values, the maximum time and minimum time are 65 and 18 respectively. If k is set as 4, r = (65 − 18)/4 = 11.75. Thus, the discretized time-intervals can be set as I0 : 0 ≤ t ≤ 18, I1 : 18 < t ≤ 29.75, I2 : 29.75 < t ≤ 41.5, I3 : 41.5 < t ≤ 53.25, I4 : 53.25 < t ≤ 65 and I∞ : 65 < t ≤ ∞. Based on the above definition, the event sequences with discretized time-intervals are represented in Table 2.

3.2 The genetic process mining method Genetic process mining method uses genetic algorithms to discover the process models from the event logs [3, 9]. Using genetic algorithms to discover the process models can overcome the limitation existed in current local search process mining techniques [11, 14]. The main steps of genetic process mining method for discovering process model are visually shown in Fig. 4 and are described in the following sections. 3.2.1 Initial population and individuals The initial population is composed of random generated individuals. Each individual is encoded using a causal matrix [9]. A causal matrix is built with a causality relation C. The causality relation represents the relation between two tasks executed from task a1 to task a2 , where a1 and a2 are tasks in the event log. In the causal matrix of an individual, the INPUT(a) set indicates what tasks are directly preceded by task a and the OUTPUT(a) set indicates what tasks are directly followed by task a. Table 3 shows a causal matrix that represents a possible process model in Fig. 5. As shown in Table 3, task A is the first task so that INPUT(A) = {}. In addition, OUTPUT(A) = {{B, C, G}} indicates that only one of B, C or G will be executed after task A finished. Additionally, the causality relation C in a causal matrix is derived based on the dependency measure. The idea of dependency measure is to estimate the frequency of sequence (task a1 to task a2 and task a2 to task a1 ) appears in the logs. Dependency measure is defined in (1) [3, 9].

D(a1 , a2 , L) =

⎧ L2L(a1 ,a2 ,L)+L2L(a2 ,a1 ,L) ⎪ L2L(a1 ,a2 ,L)+L2L(a2 ,a1 ,L)+1 ⎪ ⎪ ⎪ ⎪ if a1 = a2 and L2L(a1 , a2 , L) > 0 ⎪ ⎪ ⎪ ⎨ follows(a1 ,a2 ,L)−follows(a2 ,a1 ,L) follows(a1 ,a2 ,L)+follow(a2 ,a1 ,L)+1

⎪ if a1 = a2 and L2L(a1 , a2 , L) = 0 ⎪ ⎪ ⎪ ⎪ follows(a1 ,a2 ,L) ⎪ ⎪ ⎪ ⎩ follows(a1 ,a2 ,L)+1 if a1 = a2

(1)

where L: an event log, a1 and a2 : the tasks appeared in the event log, L2L(a1 , a2 , L): the number of times the substring “a1 a2 a1 ” appears (length-two loop) in the event log L, follows(a1 , a2 , L): the number of times that the substring “a1 a2 ”appears in the event log L.

58

C.-Y. Tsai et al.

Fig. 4 The main steps in the genetic process mining method

Table 3 The causal matrix of an individual TASK

INPUT(a)

OUTPUT(a)

A

{}

{{B, C, G}}

B

{{A}}

{{D}, {E}}

C

{{A}}

{{H }}

D

{{B}}

{{F }}

E

{{B}}

{{F }}

F

{{D}, {E}}

{{H }}

G

{{A}}

{{H }}

H

{{F, C, G}}

{}

The pseudo-code for setting the causality relation is shown in Fig. 6 [3, 9]. In the pseudo-code, p is a value that will affect the probability that two tasks are going to have a causality relation set. If the random selected value r is smaller than the value of D(a1 , a2 , L)p calculated by dependency measure, there is a causality relation from a1 to a2 , denoted as (a1 , a2 ) ∈ C. After the causality relation of tasks is obtained, the Boolean expressions will be randomly combined for the elements in INPUT(a) and OUTPUT(a) to create individuals. 3.2.2 Fitness measure

method. The “completeness” fitness method attempts to obtain the fitness value by comparing each individual (possible process model) with event sequences in the log. In addition, the partial accuracy of an individual and the number of problems occurred in the individual are also calculated in the “completeness” fitness method. Problems occur in an individual mean where an AND-split expression should be an OR-split, an OR-join expression should be an AND-join, an OR-split expression should be an ANDsplit and an AND-join should be an OR-join. These replacements of AND/OR-split and AND/OR-join are problems. While an individual is compared by an event sequence, the “completeness” fitness method will calculate which tasks are fitted in this individual and which tasks with problems existed in individual. The “completeness” fitness method is represented in (2) [3, 9]. The value range of PF complete function is (−∞, 1]. PF complete (L, CM) =

allParsedActivities(L, CM) − punishment numActivitiesLog(L)

(2)

additionally punishment

The fitness measure estimates how an individual fits into the event log. The fitness measure can be evaluated using two methods. One is “completeness” fitness method PF complete and another is “preciseness” fitness PF precise

=

allMissingTokens(L, CM) numTracesLog(L) − numTracesMissingTokens(L, CM) + 1

+

allExtraTokensLeftBehind(L, CM) numTracesLog(L) − numTracesExtraTokensLeftBehind(L, CM) + 1

Time-interval process model discovery and validation—a genetic process mining approach

59

Fig. 5 The process model based on the individual in Table 3

input: An event log L, a power value p, the dependency function D. output: A causality relation C. S ← set of tasks in L. C ← φ. For each event tuple (a1 , a2 ) in S × S do: Randomly select a number r between 0 (inclusive) and 1.0 (exclusive). (b) IF r < D(a1 , a2 , L)p then:

(1) (2) (3) (a)

C ← C ∪ {(a1 , a2 )}. (4) Return the causality relation C. Fig. 6 The steps of generating the causality relation

where L: an event log, CM: the causal matrix (an individual), allParsedActivities(L, CM): the total number of tasks in the event log L fitted the causal matrix CM (individual) without problems, numActivitiesLog(L): the number of tasks in L, allMissingTokens(L, CM): the number of missing tokens in all event sequences, allExtraTokensLeftBehind(L, CM): the number of tokens that were not consumed after the parsing has stopped plus the number of tokens of the end place minus 1 (because of proper completion), numTracesLog(L): the number of sequences in L, numTracesMissingTokens(L, CM) and numTracesExtraTokensLeftBehind(L, CM): respectively indicate the number of event sequences in which tokens were missing and tokens were left behind during the parsing. As shown in (2), the allMissingTokens function is used to punish an OR-split expression instead of an AND-split expression in process model and an AND-join instead of an OR-join expression in process model. The allExtraTokensLeftBehind function is used to punish an OR-join expression instead of an AND- join expression in process model and an AND-split instead of an OR-split expression in process model.

Although the individuals can fit the event sequences, the undesired extra behaviors may also contain in individuals. The extra behavior means that the event sequences express in a process model more than expressed in the event log. Therefore, the “preciseness” fitness PF precise method is used to find out how many extra behaviors generated in an individual by comparing this individual with other individuals in the current population. In the “preciseness” fitness method, the number of the enable tasks at every reachable marking is counted for finding the extra behaviors. The reachable marking indicate that tokens in the enable tasks of a model. The reason is that the more extra behaviors exist in individuals, the more number of enabled tasks increase. The calculation “preciseness” fitness method is represented in (3) [3, 9]. PF precise (L, CM, CM[]) =

allEnabledActivities(L, CM) max(allEnabledAvtivities(L, CM[]))

(3)

where CM[]: a population which contains the causal matrix CM, allEnabledActivities(L, CM): the number of activities that were enabled during the parsing of the log L by the causal matrix (or individual) CM, allEnabledActivities(L, CM[]): apply allEnabledActivities(L, CM) to every element in the sets of causal matrices (or population) CM[], max(allEnabledActivities(L, CM[])): returns the maximum value of the amount of enabled tasks that individuals in the given population (CM[]) had while parsing the log(L). After the “completeness” fitness method and the “preciseness” method are defined, integration between them is conducted and become the fitness measure in the genetic process mining. Because of understanding how an individual can fit in event sequence is more important than understanding how many extra behaviors expressed in the individuals, the fitness measure in genetic process mining method is defined in (4) where ζ is a punishment weight for extra behavior and is (0, 1]: Fittness(L, CM, CM[]) = PF complete (L, CM) − ζ × PF precise (L, CM, CM[])

(4)

60

C.-Y. Tsai et al.

input: Two parents (parent1 and parent2 ), crossover rate. output: Two possibly recombined offsprings (offspring1 and offspring2 ). (1) offspring1 ← parent1 and offspring2 ← parent2 (2) With probability “crossover rate” do: (a) Randomly select a task a to be the crossover point of the offsprings. (b) Randomly select a swap point sp1 for I1 (a). The swap point goes from position 0 to n − 1, where n is the number of subsets in the condition function I1 (a). (c) Randomly select a swap point sp2 for I2 (a). (d) remainingSet1 (a) equals subsets in I1 (a) that are between position 0 and sp1 (exclusive). (e) swapSet1 (a) equals subsets in I1 (a) whose position equals or bigger than sp1 . (f) Repeat steps 2d and 2e but respectively use remainingSet2 (a), I2 (a), sp2 and swapSet2 (a) instead of remainingSet1 (a), I1 (a), sp1 and swapSet1 (a). (g) For every subset S2 in swapSet2 (a) do: (i) With equal probability perform one of the following steps: (A) Add S2 as a new subset in remainingSet1 (a). (B) Join S2 with an existing subset X1 in remainingSet1 (a) (C) Select a subset X1 in remainingSet1 (a), remove the elements of X1 that are also in S2 and add S2 to remainingSet1 (a). (h) Repeat Step 2g but respectively use S1 , swapSet1 (a), X2 and remainingSet2 (a) instead of S2 , swapSet2 (a), X1 and remainingSet1 (a). (i) I1 (a) ← remainingSet1 (a) and I2 (a) ← remainingSet2 (a). (j) Repeat steps 2b to 2h but use O(a) instead of the I (a). (k) Update the related tasks to a. (3) Return offspring1 and offspring2 . Fig. 7 The pseudo-code of the crossover procedure

3.2.3 Stopping criteria

The stop criteria are used to determine when the algorithm terminates. In the genetic process mining method, stopping criteria are set as following: (i) an individual with fitness equals to 1 is found; (ii) the number of maximum generation n is executed; (iii) the best fitness value among half population has not changed. If the stopping criteria are not reached, the genetic operators of the algorithm are used to create a new population.

3.2.4 Tournament selection

The evolution theory of Darwin indicates that the best individual will be preserved and should be used to create new offspring. Therefore, a percentage of the best individuals (the elite) is directly copied to the next population, while the other individuals in the population are generated via crossover and mutation. Two parents produce two offsprings. To select one parent, a tournament is played in which a number of individuals in the population are randomly drawn and the fittest one always wins.

3.2.5 Crossover To generate better individuals, crossover is conducted to combine existing individuals in the current population while the crossover rate is decided. In crossover procedure, the sets of causality relation in individuals are recombined to generate new individuals. Therefore, the search space contains any combinations of causality relation in all individuals. An individual is allowed to add or lose tasks in the INPUT(a) and OUTPUT(a) set, exchange the causality relation with other individuals. First, in crossover procedure, task a is selected randomly to be the crossover point. Second, two individuals (offspring1 and offspring2 ) are copied by parents (parent1 and parent2 ). Third, a swap point is also randomly selected separately from INPUT(a) set of offspring1 and offspring2 , notated sp1 and sp2 . The crossover point (the selected task) will be recombined by exchanging remainingSet(a) and swapSet(a) subsets divided from the swap point, where remainingSet(a) contains the subsets whose position appear in INPUT(a) set before the swap point and swapSet(a) contains the subsets whose position appear in INPUT(a) from the swap point to the end. In the same way, the swap point, remainingSet(a) and swapSet(a) are executed in the OUTPUT(a) set of offspring1 and offspring2 also. The detailed pseudo-code for

Time-interval process model discovery and validation—a genetic process mining approach

input: An individual, mutation rate. output: A possibly mutated individual. (1) For every task t in the individual do: (a) With probability mutation rate do one of the following operations for the condition function I (a): (i) Select a subset X in I (a) and add a task a

to X, where a belongs to the set of tasks in the individual. (ii) Select a subset X in I (a) and remove a task a from X, where a belongs to X. If X is empty after a removal, exclude X from I (a). (iii) Redistribute the elements in I (a). (b) Repeat Step 1a, but use the condition function O(a) instead of I (a). Fig. 8 The pseudo-code of the mutation procedure

the crossover procedure is shown in Fig. 7 [3, 9]. Noted that, the subsets in remainingSet(a) and swapSet(a) are notated as X and S respectively.

Table 4 A testing dataset Case id

Event sequence

case 1

(A, I3 , D, I2 , E)

case 2

(A, I5 , D, I3 , C, I1 , E)

case 3

(A, I1 , B, I2 , E)

case 4

(A, I3 , C, I2 , D, I2 , E)

case 5

(A, I2 , B, I5 , D, I2 , E)

The prediction accuracy for a process model is composed of two parts. The first part is to evaluate whether segments in event sequences can be predicted correctly or not in the constructed process model, while the second part is to evaluate whether the whole event sequence can be predicted correctly or not in the constructed process model. To fulfill the need, a segment is defined as a substring “a1 Ii a2 ” where a1 and a2 are tasks in an event sequence and Ii is a discretized timeinterval between the two tasks. The segment precision of a process model is defined in (5). PSsegment =

SumAccurancyTraces(Accuracy) numTracesLog(Ltesting )

3.2.6 Mutation

additionally,

The mutation procedure inserts new characters in parents for creating the variable individuals. Mutation is important to prevent the population from stagnating at any local optimal solutions. In the mutation, the existing causality relations of a population may be changed. Each task of an individual (an offspring) can be mutated by the probability of the mutation rate. The mutation point is a task in an individual. Therefore, the mutation operations can be applied to INPUT(a) and OUTPUT(a) set in the mutation point by randomly selecting one of the following ways to mutate [5]: (i) add a task to the randomly selected subset, (ii) remove a task from the randomly selected subset, (iii) randomly recombine the selected subset. The detail steps for mutating an individual are summarized in Fig. 8, where the INPUT(a) and OUTPUT(a) of offspring are noted as I (a) and O(a) respectively [3, 9].

Accuracy =

3.3 The quality evaluation for process model It is clear that different genetic parameters such as population size, the number of maximum generation, crossover rate and mutation rate will generate different process models. To select the best model among them, this research defines a precision measure to evaluate the quality of constructed process models. The basic idea behind the precision measure design is that a high quality process model should be able to correctly predict the behaviors of unseen data. Therefore, if the prediction accuracy of a process model is high, the quality of the process model is high.

61

numParsedSubstring(a1 , Ii , a2 , PM) numSubstring(a1 , Ii , a2 , PM)

(5)

(6)

where a1 and a2 : the tasks appear in the event sequence, Ii : the discretized time-interval between tasks, PM: the process model constructed by the training dataset, Ltesting : the event sequences in the testing dataset event log, numParsedSubstring(a1 , Ii , a2 , PM): the number of the substring “a1 Ii a2 ” parsed the constructed process model PM, numSubstring(a1 , Ii , a2 , PM): the number of the substring “a1 Ii a2 ” in the event sequences of the testing dataset, Accuracy: the ratio of the fitted segments in an event sequence, SumAccuracyTraces(Accuracy): the summarized accuracy of all event sequences, numTracesLog(Ltesting ): the total number of cases in the testing dataset. For example, Table 4 shows a testing dataset and Fig. 9 is the time-interval process model constructed by the genetic process mining method. Substrings (A, I3 , D) and (D, I2 , E) can be retrieved from case 1. Since the segment (D, I2 , E) is the only substring that parsed the process model, numParsedSubstring(a1 , Ii , a2 , PM) = 1 and numSubstring(a1 , Ii , a2 , PM) = 2 for case 1. For case 2, numSubstring(a1 , Ii , a2 , PM) = 3 since three segments of (A, I5 , D), (D, I3 , C) and (C, I1 , E) are retrieved. However, no segment in case 2 fits the process model. Therefore, numParsedSubstring(a1 , Ii , a2 , PM) = 0. After all cases are conducted in the same way, Accuracy for each sequences are 1/2, 0, 1/2, 3/3, 1/3 respectively. In addition, SumAccuracyTraces(Accuracy) = 1/2 + 0 + 1/2 +

62

C.-Y. Tsai et al.

Fig. 9 An example time-interval process model

3/3 + 1/3 = 7/3 and numTracesLog(Ltesting ) = 5. Finally, PSsegment = (7/3)/5 = 0.4666. Another measure is to evaluate the process model using the whole event sequence. That is, an event sequence is regarded as one unit when evaluating the quality of the process model. If any segment in an event sequence cannot be found in the process model, the evaluation is stopped. The whole sequence precision of a process model is defined as: allParsedSequences(Ltesting , PM) PCcomplete = numTracesLog(Ltesting )

Visitor id

(8)

where μ is the importance weight for PCcomplete and is set as the range in [0, 1]. In addition, the weight of PSsegment is suggested to be set as the value higher than the weight of PCcomplete since a whole sequence fitted in a process model is more difficult than segment sequences fitted in a process model.

4 Implementation This section illustrates a case study to demonstrate how the proposed time-interval process model improves the process management. In addition, a set of the experiments are conducted to study the performance and affection caused by variant parameter settings.

Event sequence

1

(22 , 20.00 , 25 , 25.00 , 26)

2

(26 , 21.54 , 25 , 17.69 , 22 , 26.32 , 27)

3

(25 , 21.11 , 22 , 33.57 , 27)

4

(22 , 16.67 , 25 , 20.00 , 26)

5

(22 , 15.00 , 25 , 17.91 , 27)

6

(22 , 14.35 , 25 , 16.87 , 27)

7

(22 , 22.48 , 27)

8 .. .

(26 , 27.89 , 25 , 25.26 , 22 , 31.16 , 27) .. .

27

(22 , 14.35 , 25 , 16.87 , 27)

28

(26 , 18.82 , 25 , 15.88 , 22 , 22.48 , 27)

29

(22 , 21.11 , 25 , 26.67 , 26)

(7)

where PM: the process model constructed by the training dataset, Ltesting : the event sequences in the testing dataset, allParsedSequences(Ltesting , PM): the number of event sequences which are parsed in the constructed process model, numTracesLog(Ltesting ): the total number of event sequences in the testing dataset. After segment precision and whole sequence precision are derived, they are integrated into a single precision measure to evaluate the quality the process model. The precision measure is defined in (8). Precision = PSsegment × (1 − μ) + PCcomplete × μ

Table 5 The time-interval event sequences in the Wild West area

4.1 Case description The visitor behavior in the Leofoo Viliage Theme Park, one of largest amusement parks in Taiwan, is studied in this research. There are 38 major amusement facilities in the park and the park is divided into areas of Wild West, South Pacific, Arabia Kingdom and African Safari. For simplification, only the visiting behavior in Wild West area is discussed in the following sections. When a visitor visits a facility, an event is occurred. Therefore, an event sequence is the route that a visitor visits a set of amusement facilities inside the Wild West area. Table 5 shows parts of time-interval event sequences in the area. Based on the time-interval information from Table 5, it is clear to identify that max_time is 45.36 and min_time is 13.33. If k is set as 5, the set of discretized time-intervals TI = {I0 , I1 , . . . , I5 , I∞ } can be obtained where I0 : 0 ≤ t ≤ 13.33, I1 : 13.33 < t ≤ 19.736, I2 : 19.736 < t ≤ 26.142, I3 : 26.142 < t ≤ 32.548, I4 : 32.548 < t ≤ 38.954, I5 : 38.954 < t ≤ 45.36 and I∞ : 45.36 < t ≤ ∞. The solution quality of the proposed method might be affected by four GA parameters: population size, the num-

Time-interval process model discovery and validation—a genetic process mining approach

63

Fig. 10 The experiments for the four parameters in the genetic process mining method

Fig. 11 The time-interval process model having the highest precision

ber of maximum generation, crossover rate, and mutation rate. Therefore, a set of 5-fold cross-validation experiments is conducted to find the best parameter combinations for the

genetic process mining method. Based on the suggestion of [19], the initial population size is set as 100, the number of maximum generation as 1000, crossover rate as 0.8,

64

C.-Y. Tsai et al.

Table 6 The fitness and precision of the constructed process models k

Fitness

Precision

1

0.97924

1.00000

3

0.98310

0.90000

5

0.98310

0.90000

7

0.96259

0.86667

9

0.96259

0.86667

Average

0.97412

0.90667

and mutation rate as 0.2. In addition, the punishment weight ζ in (4) and is the importance weight μ in (8) are set as 0.02 and 0.4 respectively. As shown in Fig. 10(a), when the population size is smaller or larger than 150, Precision decreases. Therefore, the population size is set as 150. Figure 10(b) shows the Precision value is steady after the number of maximum generation is greater than or equals to 800. Therefore the number of maximum generation is set as 800 in the following discussion. Figure 10(c) shows when the crossover rate is higher or smaller than 0.3, Precision value will decrease. Therefore, the crossover rate will be set as 0.3. Figure 10(d) shows that the higher the mutation rate is, the worse the Fitness and Precision are. The reason is that mutation procedure in the genetic process mining will add or remove the event to individuals. Based on the Fig. 10(d), the mutation rate is set as 0.01 according to the experiment result. It is interesting that Fitness value is always higher than Precision values in all figures. The reason is that Fitness value is estimated by the training dataset, but Precision value is estimated by the testing dataset. Therefore, a process model having higher Fitness value does not guarantee the process model having higher Precision value. Figure 11 shows the time-interval process model with highest prediction accuracy (highest Precision value) when the population size is set as 150, the number of maximum generation is set as 800, crossover rate is set as 0.3, and mutation rate is set as 0.01. 4.2 Time-interval value setting To observe how the time-interval value k affects the result, a set of experiments in which time-interval values are changed from 1 to 9 are conducted. Figure 12 shows the generated process model when k is 1, Fig. 11 shows the process model when k is 3 or 5, and Fig. 13 shows the process model when k is set as 7 or 9. The fitness and precision values of these process models are summarized in Table 6. As shown in Table 6, the highest Precision is obtained when k is 1. However, when k is 1, the process model considers only three time-intervals (I0 , I1 , I∞ ) so that it is hard to understand the visitor behaviors. For example, as shown in Fig. 12, there is only one time-interval I1 (13.33 < t ≤

45.37) between facilities 25 and 26. However, when k is 7 or 9 as shown in Fig. 13, there are three time-interval I1 (13.33 < t ≤ 17.91), I2 (17.91 < t ≤ 22.49) and I3 (22.49 < t ≤ 27.07) between facilities 25 and 26. It is clear that the model in Fig. 13 provides more information about how much time visitors will take from facilities 26 to 25 when comparing to the model in Fig. 13. As mentioned above, if smaller k is set, the prediction accuracy (Precision value) is high. However, the process model is hard to differentiate behaviors in detail. On the other hand, if larger k is set, the prediction accuracy is low. Therefore, how to set the time-interval value depends on how a manager attempts to understand the visiting behaviors of visitors. 4.3 Discussion To show the benefits of the proposed methods, a process model without time-interval information is built using the same experiment setting and dataset, as shown in Fig. 14. Although the topology information among events can be found in Fig. 14, no time-interval information between events is disclosed. However, time information between events is a very important factor to improve the service quality in our implementation. For example, as shown in Fig. 11, the time-intervals between facilities 22 and 27 is I2 or I4 . When park managers further checked the physical layout between facilities 22 and 27, they found a normal path from facilities 22 to 27 should take only I2 . The reason that visitors took I4 to move from facilities 22 and 27 is the inappropriate direction sign design. The confused direction sign guided visitors take a much longer path to arrive facility 27. Therefore, the direction sign between the two facilities is redesigned to avoid redundant walk. With the proposed timeinterval model, visitors not only save their valuable time but also increase service satisfaction of the park. The propose time-interval model can also benefit visitors to complete their trip on time. Let’s take the model in Fig. 11 as an example. The visiting sequences (22, {I1 , I2 }, 25, I1 , 27) and (25, {I1 , I2 , I3 }, 22, {I2 , I4 }, 27) can be found in the model. If a visitor decides to follow the sequence of Facility 22, Facility 25, and Facility 27, the minimum total time is in the range of 33.066 and 45.878 and the max total time is in the range of 39.482 and 52.878. However, if visitors decide to follow the sequence of Facility 25, Facility 22, and Facility 27, the minimum total time will be in the range of 33.066 and 45.878 and the maximum total time is in range of 58.69 and 71.02. With those total time information, visitors can decide their visiting sequence if the total visiting time is limited and known in advance.

Time-interval process model discovery and validation—a genetic process mining approach

Fig. 12 The process model constructed with 1 time-interval

Fig. 13 The process model constructed with 9 time-intervals

Fig. 14 The process model without time-interval information

65

66

5 Conclusions Process mining technology plays an important role to explore a complicated process model. Nowadays, most process mining algorithms consider only the topology information among events, but do not include the time information in their algorithms. To overcome the drawbacks, a timeinterval genetic process mining framework is proposed. Time-interval information between tasks is involved during the process modeling so that behaviors with different time relationship can be distinguished. In addition, a genetic process mining algorithm based on global search strategy is applied to avoid the problem caused by local search. Finally, a precision measure is defined and used to evaluate the quality of time-interval process models. With the measure, managers can select the best time-interval process model among a set of candidate models and make the right decision at the right time. The proposed time-interval process mining framework can be improved in the following ways. First, different precision measures should be explored when evaluating the quality of the time-interval process models. Second, the genetic process mining method is used to construct the process models. However, some invisible tasks appear in the model. Why invisible tasks happen and how to deal with invisible tasks might be an interesting topic. Third, the number of time-interval value is manually decided currently. Further research should focus on the development of an automatic method to balance the tradeoff between model prediction accuracy and model usage convenience.

References 1. Agrawal R, Gunopulos D, Leymann F (1998) Mining process models from workflow logs. In: Proceedings of the 6th international conference on extending database technology: advances in database technology, Valencia, pp 469–483 2. Alves de Medeiros AK, Van Dongen BF, Van Der Aalst WMP, Weijters AJMM (2004) Process mining: extending the αalgorithm to mine short loops. BETA Working Paper Series (WP 113), Eindhoven University of Technology, Eindhoven 3. Alves de Medeiros AK, Weijters AJMM, Van Der Aalst WMP (2004) Using genetic algorithms to mine process models: representation, operators and results. BETA Working Paper Series (WP 124), Eindhoven University of Technology, Eindhoven 4. Anisimov N, Kovalenko A, Postupalski P (1994) Compositional Petri net environment. In: Proceedings of the 1994 IEEE symposium on emerging technologies & factory automation (ETFA’94), Tokyo, Japan, pp 420–427 5. Cook JE, Wolf AL (1998) Discovering models of software processes from event-based data. ACM Trans Softw Eng Method 7(3):215–249 6. Cook JE, Wolf AL (1998) Event-based detection of concurrency. In: Proceedings of the 6th ACM SIGSOFT international symposium on foundations of software engineering. Lake Buena Vista, Florida, USA, pp 35–45 7. Cook JE, Du Z, Wolf AL (2004) Discovering models of behavior for concurrent workflows. Comput Ind 53(3):297–319

C.-Y. Tsai et al. 8. De Medeiros AKA, Van Der Aalst WMP, Weijters AJMM (2003) Workflow mining: current status and future directions. In: Lecture notes in computer science, vol 2888. Springer, Berlin, pp 389–406 9. De Medeiros AKA, Weijters AJMM, Van Der Aalst WMP (2007) Genetic process mining: an experimental evaluation. Data Min Knowl Discov 14(2):245–304 10. Kurkovsky S, Loganantharaj R (2005) Extension of Petri nets for representing and reasoning with tasks with imprecise durations. Appl Intell 23(2):97–108 11. Lee KK, Yoon WC, Baek DH (2006) A classification method using a hybrid genetic algorithm combined with an adaptive procedure for the pool of ellipsoids. Appl Intell 25(3):293–304 12. Murata T (1989) Petri nets: properties, analysis and applications. Proc IEEE 77(4):541–580 13. Ren C, Wen L, Dong J, Ding H, Wang W, Qiu M (2007) A novel approach for process mining based on event types. In: Proceeding of the IEEE international conference on services computing (SCC 2007), pp 721–722 14. Tsai CY, Liou JJH, Huang TM (2008) Using a multiple-GA method to solve batch picking problem: considering travel distance and order due time. Int J Prod Res 46(2):6533–6555 15. Tsai CY, Shieh YC (2009) A change detection method for sequential patterns. Decis Support Syst 46(2):501–511 16. Van Der Aalst WMP, Van Dongen BF (2002) Discovery workflow performance models from timed logs. In: Lecture notes in computer science, vol 2480. Springer, Berlin, pp 45–63 17. Van Der Aalst WMP, Van Dongen BF, Herbst J, Maruster L, Schimm G, Weijters AJMM (2003) Workflow mining: a survey of issues and approaches. Data Knowl Eng 47(2):237–267 18. Van Der Aalst WMP, Weijters T, Maruster L (2004) Workflow mining: discovering process models from event logs. IEEE Trans Knowl Data Eng 16(9):1128–1142 19. Wen L, Van Der Aalst WMP, Wang J, Sun J (2007) Mining process models with non-free-choice constructs. Data Min Knowl Discov 15(2):145–180 Chieh-Yuan Tsai is a professor in the Department of Industrial Engineering and Management at Yuan Ze University, Taiwan. He received his M.S. and Ph.D. degrees in Department of Industrial and Manufacturing Systems Engineering from the University of Missouri-Columbia. His research activities include data mining, product data management (PDM), customer relationship management (CRM), RFID applications, and e-Commerce.

Henyi Jen is an assistant professor in the Department of Industrial Engineering and Management at Yuan Ze University, Taiwan. He received his M.S. degree from University of Texas at Austin and Ph.D. degree from Purdue University. His research interests involve process modeling, simulation optimization and service process design.

Suggest Documents