Process Mining as First-Order Classification Learning ... - Springer Link

1 downloads 0 Views 417KB Size Report
section, we make use of Tilde [12], a first-order decision tree learner available in the ACE-ilProlog data mining system [13]. Tilde is a first-order generalization.
Process Mining as First-Order Classification Learning on Logs with Negative Events Stijn Goedertier1 , David Martens1 , Bart Baesens1,2 , Raf Haesen1,3 , and Jan Vanthienen1 1

2

Department of Decision Sciences & Information Management, Katholieke Universiteit Leuven, Belgium {myFirstName.myLastName}econ.kuleuven.be School of Management, University of Southampton, United Kingdom 3 Vlekho Business School, Belgium

Abstract. Process mining is the automated construction of process models from information system event logs. In this paper we identify three fundamental difficulties related to process mining: the lack of negative information, the presence of history-dependent behavior and the presence of noise. These difficulties can elegantly dealt with when process mining is represented as first-order classification learning on event logs supplemented with negative events. A first set of process discovery experiments indicates the feasibility of this learning technique.

1

Introduction

Event logs of information systems such as ERP, Role Based Access Control, and Workflow Management systems conceal an untapped reservoir of knowledge about the way people conduct every-day business transactions. The vast quantity of available events, however, makes it difficult to analyze event logs using only descriptive statistics. Process mining, in contrast, is the automated construction of process models from event logs [1,2]. Process models that have been discovered through process mining enable organizations to compare the behavior in the event log with the business conduct it would expect from its employees and other stake holders. The latter can be helpful in the context of regulatory compliance or in the context of business process redesign and optimization. Currently, many algorithms have been developed to describe or predict control-flow, data or resource-related aspects of processes. An important but difficult learning task in process mining is the discovery of sequence constraints from event logs, referred to as Process Discovery [3,4]. Other process learning tasks involve, for instance, learning allocation policies [5] and social networks [6]. Process mining faces many difficulties. One difficulty is that it is often limited to the much more difficult setting of unsupervised learning because negative information about state transitions that were prevented from taking place is often not available in the event log and consequently cannot guide the search problem. Moreover, much of the behavior displayed in processes is non-local, history-dependent behavior. While a history of related events is a potentially A. ter Hofstede, B. Benatallah, and H.-Y. Paik (Eds.): BPM 2007 Workshops, LNCS 4928, pp. 42–53, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Process Mining as First-Order Classification

43

strong predictor and is readily available in process logs, the inclusion of such non-local, historic events in the hypothesis space of process mining algorithms poses many difficulties with regard to search space complexity and hypothesis visualization. Another difficulty is that process mining algorithms often overfit the noise in event logs. In this paper these difficulties are addressed by representing process mining as first-order classification learning on event logs supplemented with negative events. We describe a technique to add artificial negative examples to a process log. Additionally, we show how first-order classification learners allow to search for patterns among multiple event rows in the event log and thus allow for detecting history-dependent behavior. The proposed representation is expressive enough to cover many learning tasks in process mining including Process Discovery. The remainder of this article is structured as follows. First an introduction is provided to first-order classification and it is shown how process mining can be represented as a binary classification problem. In section 3 the problem of lacking negative information is discussed and an algorithm is proposed to supplement event logs with artificial negative examples. In section 4 the proposed technique is applied to Process Discovery. Section 5 provides a brief overview of related work.

2

First-Order Classification Learners

Classification learning is learning how to assign an instance to a predefined class or group according to its known characteristics. The result of a classification learning is a model that makes it possible to classify future instances based on a set of specific characteristics in an automated way. Classification techniques are often used for credit scoring [7,8] and medical diagnostic. In process mining classification learning has, for instance, been used for “Decision Mining” [9] and Process Discovery [10]. In this article process mining is represented as a classification problem that models the conditions under which an event can take place (a positive event) or not (a negative event). In this respect, it is useful to think of a process instance as a trajectory in a state space that is span by the domains of the different possible activities, events and business concepts. Declarative classification rules can be used to classify whether a state transition at a give state is allowed (a positive event) or not (a negative event). Each activity in a process instance can undergo a number of distinct state transitions that are recorded as events, for instance: – create(AId,BId,PId): creates a new activity instance AId with business identifiers BId in the context of the parent activity PId. As a result a created event is added to the state of the process instance. – assign(AId,AgentId): the assignment of activity AId to an agent AgentId that is recorded as an assigned event.

44

S. Goedertier et al.

– addFact(AId,F), removeFact(AId,F), updateFact(AId,F1,F2): add, remove or update a business fact F in the state space. This is recorded respectively as a factAdded, a factRemoved or a factUpdated event. – complete(AId): requests the completion of activity AId, recorded as an event of the type completed. To use first-order learners on an event log, the log has to be represented as a logic program of ground facts. In our experiments, an activity event is represented as an atom event(AId,AT,BId,ET,AgentId,PL,TS), with following arguments: – – – – – – –

AId a unique non-business identifier for the activity AT represents the activity type (e.g. applyForLicense) BId represents a unique business identifier of the activity ET represents the event type (e.g. created, assigned, completed,...) AgentId represents the worker that brings about the activity state transition PL is a list of parameters that pertain to the event TS is a time stamp

In this article we use an extended version of the “Driver’s License” example [4]. This example is a non-trivial Process Discovery problem with non-local non-free choice that has been extended with a parallel task ‘obtainSpecialInsurance’ and loop ‘applyForLicense’ as displayed in Fig. 1(a). The sample event log below this paragraph represents the activity life cycle of an ‘applyForLicence’ activity. event(act92,applyForLicense,driver2,created,worker1,[concept(act92,hasParent,act90)],1). event(act92,applyForLicense,driver2,factAdded,worker1,[concept(driver2,hasRole,driver)],3). event(act92,applyForLicense,driver2,assigned,worker1,[concept(act92,assignedTo,driver2)],6). event(act92,applyForLicense,driver2,factAdded,driver2,[concept(driver2,hasAge,26)],9). ... event(act92,applyForLicense,driver2,completed,driver2,[],17). ... event(act99,doTheoreticalExam,driver2,created,worker1,[concept(act96,hasParent,act90)],56). ...

Most classification learners are called propositional or uni-relational classification learners, because they can only perform classification based on the information within a single row of a dataset. In contrast, first-order or multirelational classification learners can learn classification patterns based on multiple rows within one or more tables of a dataset. For the purpose of discovering history-dependent patterns, this multi-relational property is much desired, as it allows learning based on global information in the event log. Alternatively the event history of an event log instance could in part be represented as extra propositions (extra columns), for instance including all immediately preceding event information as extra columns in the event log. However, if we want to relate an event to any other previously occurred event within a process instance, it is no longer possible to represent these historic events as extra columns of an event table as the dimensions of the input space would exponentially increase. High dimensional input spaces are typically hard to handle by classical data mining techniques, a problem known as ‘the curse of dimensionality’ [11].

Process Mining as First-Order Classification A0

45

start

N S(A0 , A1 ) A1 applyForLicense

applyForLicense

N S(A1 , A2 )∧N S(A1 , A3 ) A2

attendClasses DriveCars

attendClasses RideMotorBikes

A3

N S(A2 , A4 )∨N S(A3 , A4 ) N S(A2 , A9 )∨N S(A3 , A9 ) A9

obtainSpecial Insurance

N S(A2 , A5 )

A4

doTheoretical Exam

N S(A3 , A6 )

N S(A9 , A5 )∧N S(A9 , A6 ) A5

doPracticalExam DriveCars

N S(A4 , A5 )∧N S(A4 , A6 )

doPracticalExam RideMotorBikes

A6

N S(A5 , A7 )∨N S(A6 , A7 )

A7

getResult

N S(A7 , A1 )∧N S(A7 , A8 ) ∧N S(A7 , A10 ) A8

receiveLicense N S(A8 , A10 )

A10

end

A10

end

(a) A Petri net representation activity

precondition

A0 start A1 applyForLicense

A2 A3 A4 A5 A6 A7 A8

true N S(A0 , A1 ) ∨ ) < 3 ∧ N S(A7 , A1 ) ( count(Astarted 1 ∧ N S(A7 , A8 ) ∧ N S(A7 , A10 ) ) attendClassesDriveCars N S(A1 , A2 ) ∧ N S(A1 , A3 ) attendClassesRideMotorBike N S(A1 , A2 ) ∧ N S(A1 , A3 ) doTheoreticalExam N S(A2 , A4 ) ∨ N S(A3 , A4 ) doPracticalExamDriveCars N S(A4 , A5 ) ∧ N S(A4 , A6 ) ∧ N S(A9 , A5 ) ∧ N S(A9 , A6 ) ∧ N S(A2 , A5 ) doPracticalExamRideMotorBike N S(A4 , A5 ) ∧ N S(A4 , A6 ) ∧ N S(A9 , A5 ) ∧ N S(A9 , A6 ) ∧ N S(A3 , A6 ) getResult N S(A5 , A7 ) ∨ N S(A6 , A7 ) receiveLicense N S(A7 , A1 ) ∧ N S(A7 , A8 ) ∧ N S(A7 , A10 )

A9 obtainSpecialInsurance A10 end

N S(A2 , A9 ) ∨ N S(A3 , A9 ) N S(A8 , A10 ) ∨ ) >= 3 ∧ N S(A7 , A1 ) ( count(Astarted 1 ∧ N S(A7 , A8 ) ∧ N S(A7 , A10 ) )

(b) A representation with activity preconditions Fig. 1. An extended version of the Diver’s License example [4]

46

S. Goedertier et al.

The hypotheses to be tested by first-order learners are described in terms of language constructs and constraints. The latter is called the language bias L of the learning task. The effectiveness by which a multi-relational learner can be applied to a learning task depends in part on the chosen language bias. When searching for a hypothesis, multi-relational learners refine the current hypothesis using the information of the language bias. Too simple refinements result in new hypotheses that have little or no extra explanatory power. Too complex refinements might result in too large a hypothesis space, making search inefficient. Another requirement for the language bias is that expressions in the chosen language can be transformed into graphical models such as Petri nets. Therefore, we use a simple event operator, that, in combination with conjunction (,), disjunction (;) and negation-as-failure (not) provides a reasonably expressive language bias that yields good results in learning non-local classification problems. This operator is called the N S operator and is defined as follows in Prolog: ns(AT1,AT2,BId,Now) :event(_AId,AT1,BId,completed,_AgentId,_Parameters,Time1), Time1 < Now, not(eventFromTill(AT2,BId,completed,Time1,Now)). eventFromTill(AT,BId,ET,From,Till) :event(_AId,AT,BId,ET,_AgentId,_Parameters,Time), From < Time, Time < Till.

Each transition is characterized as an activity type – event type pair AT-ET. The statement ns(AT1,AT2,BId,Now), abbreviated as N S(AT1 , AT2 ), evaluates to true when for a given process instance BId at time Now an AT1-completedtransition has taken place, but that it has not (yet) been followed by an AT2-completed state transition. For instance, the expression N S(A0 , A1 ) is true when for a given process instance the activity A0 has completed at the time of inspection without the activity A1 being completed. Notice that the N S operator can be extended to other kinds of state transitions that record, for instance, the assignment (assigned), start (started) or skipping (skipped) of activities. Figure 1(b) shows how the preconditions in the Petri net can be represented as conjunctions and disjunctions of N S atoms. The conversion from N S preconditions to Petri nets requires the conditions to refer to local, immediately preceding events as much as possible.

3

Inducing Artificial Negative Events

Without negative information learning can be much harder. For instance, a twoyear old will have more difficulties in learning a precise definition of the concept ‘balloon’ when shown only a balloon than when presented both a ball and a balloon and pointed to their difference. Event logs rarely contain such negative information that allows to identify the distinguishing properties that characterize the underlying process model. Because of the lack of negative information, many learning tasks in process mining are in principle limited to the more difficult setting of unsupervised learning to which classification learners cannot be applied.

Process Mining as First-Order Classification

47

To make process mining a supervised learning problem suitable for classification, we propose to include negative information in the event log in the form of negative events. A negative event reports that a state transition could not take place. For each positive activity event type one can think of a negative one. For instance, for the event types created and assigned the event types createRejected and assignRejected can be conceived. Learning the classification rules that predict whether, given the state of a process instance, a particular state transition can occur, then boils down to learning the classification rule that predicts when either a positive or a negative event occurs. In this way, we have formulated process mining tasks such as Process Discovery and authorization rule learning as classification problems. Sometimes, process logs naturally contain negative events. An access log, for instance, contains information about the workers that have obtained authorization, and information about the workers who were refused authorization to perform a particular task. In many cases, however, information systems do not reveal their internal functioning in terms of negative events. For instance, when a WfMS creates a number of work items and assigns them to several work trays, it will not expose the work items it did not create or provide information about the work trays to which it could not allocate a work item. Negative examples can be introduced by replaying the positive events of each process instance event trace ti and by checking whether a state transition of interest  could occur. At each event e(i,k) ∈ ti , it is tested for each possible activity state transition of interest  whether there exists up to that point k similar traces tj : ∀l, l < k, similar(e(i,l) , e(j,l) ) in the event log in which at that point a state transition e(j,k) has taken place that is similar to , as denoted by a similarity operator similar(e(j,k) , ). If such a state transition does not occur in similar traces, this is an indication that the state transition should be prevented from occurring. Consequently, a negative event can be added at this point k in the event trace ti . On the other hand, if a similar trace is found in which the state transition  does occur, this behavior is present in the event log and no negative event is generated. More formally, this process of adding negative examples can be described as follows: 1 2 3 4 5

For each process instance ti in the event log For each event e(i,k) in ti For each activity state transition  of interest if  tj : ∀l, l < k, similar(e(i,l) , e(j,l) ) ∧ similar(, e(j,k) ) then recordNegativeEvent(ti ,k,,π)

To avoid an imbalance in the proportion of negative versus positive events the addition of negative events can be manipulated with a negative event injection probability π. In the above pseudo code it is possible that i = j. Evidently, ti is similar to itself. If at point k in ti a state transition of interest  occurs, this provides enough evidence not to include  as a negative event in ti . The procedure for injecting negative events does not supplement the event log with noisy negative events in the presence of noisy positive ones. Noise represents additional, low-frequent behavior that originates from log errors or the occurrence of exceptions outside the scope of the process mining task. Although

48

S. Goedertier et al.

compared to noise-free logs, less negative events are added because of the additional noisy behavior, our negative event induction technique does not lead to the addition of noisy negative events. Supplementing event logs with artificial negative events, adds the completeness assumption [3] that all possible trajectories in the process model to be learned have corresponding process instances in the event log. Formulated differently, adding artificial negative examples to an event log on which later on classification is performed, forces a classification learner to conclude that trajectories that do not occur in the original event log, should not occur in the induced process model. This is a much desired property, as it is the intention of process mining to induce a process model that only portrays the behavior in the event log. For instance, when mining a control-flow model for the purpose of delta-analysis, the induced process model should preferably cover all the presented process instances, but no more than the presented ones. Without this completeness assumption as inductive bias, it is not possible to learn from positive examples only as any model in which the preconditions impose no constraints would cover all process instances in the event log. The completeness assumption also results in requiring a large number of process instances. This is particularly the case when the underlying process model contains a lot of concurrent (parallel) activities. A possible solution is to limit the number of possible activity events in the log, for instance, by only considering activity completed events. Another solution is to to leave out or regroup a number of concurrent tasks in the event log. In order to capture the behavior in the event log, this behavior must be present in the event log. The requirement for sufficient data is similar to the data requirements in any other process mining technique. In the above procedure, similarity is a relative notion that depends on the learning task at hand. Three factors play a role in determining whether two process instances ti and tj are similar: the contents of the event traces ti , the definition of the event similarity operator similar(e(i,l) , e(j,l) ) and the depth k. It is important to notice that the event traces ti should only include those events that are relevant to the learning task at hand. For instance, when learning sequence constraints among activities, the events involving scheduling, assignment and data manipulation are likely to be left out of consideration with regard to the induction of negative events. Likewise, the event similarity operator similar(e(i,l) , e(j,l) ) is task-dependent. For instance, in the context of Process Discovery it might be acceptable to say that the events event(act92,applyForLicense,driver1,completed,driver2,[],17) event(act403,applyForLicense,driver3,completed,driver3,[],17)

are similar. However, in the context of authorization rule learning, this might not be the case, as in the first event the activity is completed by a different agent (driver2) than indicated by the business id (driver1) of the case. The depth k is another factor in determining whether two process instances are similar. Arguably, the above procedure, requiring two process instances to have exactly the same trace up to point k, imposes too strong of a similarity requirement. In some cases, it might be sufficient to have particular chunks (part of a loop)

Process Mining as First-Order Classification

49

of the traces in common. This can be solved by a careful preprocessing of the event log or by an adaptation of the procedure. Both adaptations are likely to be problem dependent.

4

Process Discovery as Learning Preconditions

Process Discovery involves the discovery of the process control flow from event logs [3,4] and has been the main focus of process mining. Several algorithms have been proposed, such as the α [3] algorithm and a genetic algorithm [4]. In this section, we make use of Tilde [12], a first-order decision tree learner available in the ACE-ilProlog data mining system [13]. Tilde is a first-order generalization of the well-known C4.5 algorithm for decision tree induction [14]. Like C4.5, Tilde [12,13] obtains classification rules by recursively partitioning the dataset according to logical conditions that can be represented as nodes in a tree. This top-down induction of logical decision trees (Tilde) is driven by refining the node criteria according to the provided language bias L. Unlike C4.5, Tilde is capable of inducing first-order logical decision trees (FOLDT). A FOLDT is a tree that holds logical formula containing variables instead of propositions. Below this paragraph, a decision tree is depicted that represents the learned precondition of the activity receiveLicense. To learn such preconditions, the language bias L of Tilde was restricted to the ns(AT1,AT2,BId,Now) event operator N S and aggregate operator count(AT,BId,Now) that counts the number of occurrences of an activity type within a specific process instance. Tilde’s C4.5 gain ratio was used as a heuristic for selecting the best branching criterion. In addition, Tilde’s C4.5 post pruning method was used with a standard confidence level of 0.25. canStartReceiveLicense(BId,Time,-C) ns(getResult,applyForLicense,BId,Time) ? +--yes: ns(getResult,receiveLicense,BId,Time) ? | +--yes: ns(getResult,end,BId,Time) ? | | +--yes: [started] | | +--no: [startRejected] | +--no: [startRejected] +--no: [startRejected]

The artificial event log was generated with 450 process instances from the process model in Fig. 1(a) with a maximum of three allowed loops. As is common, learning was performed on a training set, whereas the reported performance was done on the test set (out-of-sample performance), as to provide an objective measure for the predictive performance on new, unseen examples. The test log is created as follows. The entire event log, consisting of about 7300 activity completed events is first supplemented with about 7000 negative completeRejected events by applying the above described procedure with a negative event injection probability π of 100%. After this procedure the first 350 process instances (the first 350 drivers) are removed from event log to retain a test set of 100 process instances. To correctly evaluate the proposed learning technique, it is important that the negative events in the test set accurately indicate the state transitions that are not present in the event log. For this reason, the negative events in the test

50

S. Goedertier et al.

log are created with information from the entire event log. Should the negative event injection procedure be applied on the 100 selected process instances only, it is possible that additional, erroneous negative events are injected because it is possible that some behavior is not present in the test set. To avoid that the injected negative events in the test set become dependent on the sampling policy used, the negative events in the test set are generated using all information in the event log. The same procedure cannot be applied to come up with the training log. The training log is composed of the first 350 process instances. The log consisting of some 5300 completed events was supplemented with some 4400 negative completeRejected events on the basis of training log events only. To test the performance of first-order activity precondition learning under noise, the training set has been modified with different types of noise. After adding noise, the noisy training sets were supplemented with negative events also with a negative event injection probability π of 10%. Alves de Medeiros et al. describe six noise types [4, p. 41]: missing head, missing body, missing tail, swap tasks, remove task, and mix all. For reasons of brevity we report performance results with swap tasks, identified as being the most difficult [4], and mix all, which a combination of all other noise types. The used noise levels of 10% and 30% are higher than the 5% and 10% levels reported in [4]. In Process Discovery it is important that the discovered preconditions allow almost every event trace in the log (completeness) but preferable no more event traces that do not occur in the log (preciseness) [4]. Rather than using accuracy as a performance measure, we therefore propose two performance measures that are more suitable to the problem domain of Process Discovery: – true positive rate T P or completeness: the frequency of correctly classified positive events in the test set. This probability can be estimated as + + total /Epositive , where Epositive is the amount of correctly follows: T P = Epositive total classified positive events and Epositive is the total amount of positive events. – true negative rate T N or preciseness: the frequency of correctly classified negative events in the test set. This probability can be estimated as follows: − − total /Enegative , where Enegative is the amount of correctly clasT N = Enegative total sified negative events and Enegative is the total amount of negative events. Notice that the true negative rate gives an accurate idea of the preciseness of the learned precondition as negative events are precisely representatives for traces that are not in the sample log. In Table 1 we report these evaluation measures for each precondition learned under different noise circumstances. Under zero noise conditions, one can observe perfect completeness and preciseness each activity precondition in Table 1. However, rather than favoring local preconditions the decision tree induction algorithm often favors preconditions with immediate discriminating power. For the moment, this non-preference for local conditions complicates the construction of a graphical model from the learned preconditions. Under conditions of noise, it is observed with regard to the completeness criterion that every induced precondition portrays a perfect recall

Process Mining as First-Order Classification

51

Table 1. Out-of-sample performance of the learned preconditions. Both completeness TP and preciseness TN is given as in the following pattern: TP;TN. Activity Type no noise A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 A10

start applyForLicense attendClassesDriveCars attendClassesMotorBikes doTheoreticalExam doPracticalExamDriveCars doPracticalExamMotorBikes getResult receiveLicense obtainSepcialInsurance end

1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00

Noise Type 10% 10% 30% 30% mix all swap task mix all swap task 1.00;1.00 1.00;0.91 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;0.91 1.00;1.00

1.00;1.00 1.00;0.91 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;1.00 1.00;0.83 1.00;0.91 1.00;1.00 1.00;1.00

1.00;1.00 1.00;0.91 1.00;1.00 1.00;1.00 1.00;0.91 1.00;0.92 1.00;0.92 1.00;0.92 1.00;1.00 1.00;0.72 1.00;1.00

1.00;1.00 1.00;0.91 1.00;0.83 1.00;0.83 1.00;0.82 1.00;1.00 1.00;1.00 1.00;0.83 1.00;0.82 1.00;0.69 1.00;0.82

of the positive events. With respect to the preciseness criterion it is observed that the preconditions relax, allowing negative events to take place and thus scoring lower on the preciseness criterion. For example, under 30% swap tasks noise, the induced activity precondition for the parallel task obtainSpecialInsurance deteriorates to 0.69, indicating that 31% of the identified negative events are not classified correctly. The reason is that the extra behavior that is introduced by the added noise has also in part been included in the preconditions. 30% is nonetheless a high noise level, and it can be seen that under the 10% noise level (the highest noise level reported in [4]) the learned preconditions still have an almost perfect recall of the negative events. This robustness to noise can be attributed to the robustness of Tilde’s C4.5 tree induction algorithm. The eleven preconditions could always be learned in under half an hour. In general, first-order classification problems potentially have an extremely large search space. However, we have tried to limit the hypothesis space by limiting the language bias L to the two aforementioned language constructs. The greedy search strategy performed by Tilde’s C4.5 top down induction of decision trees also contributes to this computational efficiency result.

5

Related Work

Process mining can be seen as an application of the machine learning of grammars [15,16]. Gold has shown that important classes of recursively enumerable languages cannot be learnt from positive examples only [15]. Instead, a complete presentation of both positive and negative examples is required for grammar learning to distinguish from an infinite number of grammars that fit the positive examples. In grammar learning, the hypothesis space is often expressed as production rules, automata or regular expressions. In this paper, however, a

52

S. Goedertier et al.

different hypothesis space is used. Moreover, the possibility of noise is taken into account. Several authors have represented process mining as classification learning. For instance, Maruster et al. [10] were among the first to investigate the use of rule-induction for Process Discovery. The authors use propositional rule induction techniques on a table of direct metrics for each process task in relation to the other process tasks, which is generated in a pre-processing step. This transformation is needed to deal with the absence of negative examples and to use the uni-relational classification learner RIPPER [17]. In contrast, the multirelational nature of first-order classification learners allows to directly perform classification on the event log and is capable of dealing with non-local dependencies. Rozinat et al. [9] discuss the use of uni-relational classification for the purpose of “decision mining”. In decision mining so-called decision points are semi-automatically identified in process logs, and the classification problem consists of determining which case data properties lead to taking certain paths in the processes. Ferreira and Ferreira apply a combination of ILP learning and partial-order planning techniques to process mining [18]. Rather than generating artificial negative events, negative examples are collected from the users who indicate whether a proposed execution plan is feasible or not. By iteratively combining planning and learning, a process model is discovered that is represented in terms of the case data preconditions and effects of its activities. In addition to this new process mining technique, the contribution of this work is in the truly integrated BPM life cycle of process generation, execution, re-planning and learning. Alves de Medeiros et al. [4] point out the difficulties that process mining algorithms have when only taking into account local information. The authors have implemented a genetic algorithm for Process Discovery. No negative examples are introduced, but this problem is circumvented by the incorporation of both a completeness and preciseness measure in the fitness function that drives the genetic algorithm towards suitable models.

6

Conclusion

In this paper we have demonstrated the feasibility of process mining as a firstorder classification learning on event logs supplemented with artificial negative events. A first set of Process Discovery experiments has shown promising results on a non-trivial learning problem with loop, parallelism and non-local non-free choice constructs. In the experiment without noise, a model was discovered with perfect completeness and preciseness, indicating the suitability of the proposed language bias for Process Discovery. Additional experiments have indicated the technique to be robust to noise. With this paper we certainly do not claim to have solved all the important problems, but we think to have pointed out the potential of applying the techniques of first-order learners and negative event induction to process mining in general.

Process Mining as First-Order Classification

53

References 1. van der Aalst, W.M.P., van Dongen, B.F., Herbst, J., Maruster, L., Schimm, G., Weijters, A.J.M.M.: Workflow mining: A survey of issues and approaches. Data & Knowledge Engineering 47(2), 237–267 (2003) 2. van der Aalst, W., Reijers, H., Weijters, A., van Dongen, B., de Medeiros, A.A., Song, M., Verbeek, H.: Business process mining: An industrial application. Information Systems 32(5), 713–732 (2007) 3. van der Aalst, W., Weijters, A., Maruster, L.: Workflow mining: Discovering process models from event logs. IEEE Transactions on Knowledge and Data Engineering 16(9), 1128–1142 (2004) 4. de Medeiros, A.A., Weijters, A.J., Aalst, W.M.: Genetic process mining: An experimental evaluation. Data Mining and Knowledge Discovery 14(2), 245–304 (2007) 5. Ly, L.T., Rinderle, S., Dadam, P., Reichert, M.: Mining staff assignment rules from event-based data. In: Bussler, C., Haller, A. (eds.) BPM 2005. LNCS, vol. 3812, pp. 177–190. Springer, Heidelberg (2006) 6. van der Aalst, W., Reijers, H., Song, M.: Discovering social networks from event logs. Computer Supported Cooperative Work 14(6), 549–593 (2005) 7. Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., Vanthienen, J.: Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the Operational Research Society 54(6), 627–635 (2003) 8. Martens, D., Baesens, B., Gestel, T.V., Vanthienen, J.: Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research 183(3), 1466–1476 (2007) 9. Rozinat, A., van der Aalst, W.M.P.: Decision mining in ProM. In: Dustdar, S., Fiadeiro, J.L., Sheth, A.P. (eds.) BPM 2006. LNCS, vol. 4102, pp. 420–425. Springer, Heidelberg (2006) 10. Maruster, L., Weijters, A.J.M.M., van der Aalst, W.M.P., van den Bosch, A.: A rule-based approach for process discovery: Dealing with noise and imbalance in process logs. Data Mining and Knowledge Discovery 13(1), 67–87 (2006) 11. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. AddisonWesley, Reading (2005) 12. Blockeel, H., De Raedt, L.: Top-down induction of first-order logical decision trees. Artificial Intelligence 101(1-2), 285–297 (1998) 13. Blockeel, H., Dehaspe, L., Demoen, B., Janssens, G., Ramon, J., Vandecasteele, H.: Improving the efficiency of inductive logic programming through the use of query packs. Journal of Artificial Intelligence Research 16, 135–166 (2002) 14. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA (1993) 15. Gold, E.M.: Language identification in the limit. Information and Control 10(5), 447–474 (1967) 16. Angluin, D.: Inductive inference of formal languages from positive data. Information and Control 45(2), 117–135 (1980) 17. Cohen, W.: Fast effective rule induction. In: Prieditis, A., Russell, S. (eds.) Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, pp. 115–123. Morgan Kaufmann, San Francisco (1995) 18. Ferreira, H., Ferreira, D.: An integrated life cycle for workflow management based on learning and planning. International Journal of Cooperative Information Systems 15(4), 485–505 (2006)

Suggest Documents