Big Data meets Process Mining: Implementing the ... - Semantic Scholar

Big Data meets Process Mining: Implementing the Alpha Algorithm with Map-Reduce Joerg Evermann

Ghazal Assadipour

Memorial University of Newfoundland

Memorial University of Newfoundland

[email protected]

[email protected]

ABSTRACT Process mining is an approach to extract process models from event logs. Given the distributed nature of modern information systems, event logs are likely to be distributed across different physical machines. Map-Reduce is a scalable approach for efficient computations on distributed data. In this paper we present the design of a Map-Reduce implementation of the Alpha process mining algorithm, to take advantage of the scalability of the Map-Reduce approach. We provide a experimental results that show the performance and scalability of our implementation.

H.2.8 [Database Management]: Database Applications— data mining; H.3.4 [Information Storage and Retrieval]: Systems and Software—distributed systems; H.4.1 [Information Systems Applications]: Office Automation—workflow management

1 2 3 3 1 1 2 4 2 2

A A A B B C C A B D

Time Stamp 1 2 3 4 5 6 7 8 9 10

Case

Activity

5 4 1 3 3 4 5 5 4

A C D C D B E D D

Time Stamp 11 12 13 14 15 16 17 18 19

this paper, we describe how the Alpha algorithm [4] can be implemented using Map-Reduce. Such an implementation takes advantage of the natural fit between distributed event logs and distributed computations.

2.

THE ALPHA ALGORITHM

Event logs minimally contain information about events referring to a case (process instance), the event or activity type (an instance of a task within a case), and a timestamp. The Alpha process mining algorithm rediscovers one possible workflow net [4] from an event log under the assumption of no noise.

Keywords Map-Reduce, Process Mining, Alpha Algorithm, Workflow Management

INTRODUCTION

Definition 1. (Trace) A trace σ = t1 . . . tn is a temporally ordered sequence of events for one case (process instance) in an event log.

Modern information systems are increasingly distributed, with replicated instances deployed on multiple physical machines. They produce event logs that capture the actions of their users, for example page requests of web-servers and business-object method calls in ERP systems. Process mining is the analysis of event logs. Specifically, process discovery is that area of process mining that deals with the identification of the processes followed by system users, for example the process of ordering a product on an e-commerce web-site, or the process of scheduling a manufacturing order in an ERP system [3]. Given the distributed nature of event logs, it is natural to look for a distributed way to mine these for processes. In

Our example log in Table 1 contains five traces, of which only three are unique: σ1 = σ3 = ABCD, σ2 = σ4 = ACBD, σ5 = AED The Alpha algorithm first identifies ”causal” log relationships between activities [4], based on four types of ordering relations. Definition 2. (Log-based ordering relations) Let T be a set of tasks and W be an event log over T . Let a, b ∈ T : • a >w b iff there is a trace σ = t1 t2 t3 . . . tn−1 in W such that σ ∈ W and ti = a and ti+1 = b for i ∈ {1, . . . , n − 2}

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’14 March 24-28, 2014, Gyeongju, Korea. Copyright 2014 ACM 978-1-4503-2469-4/14/03 ...$15.00.

http://dx.doi.org/10.1145/2554850.2555076

Activity

Table 1: An example workflow log

Categories and Subject Descriptors

1.

Case

• a →w b iff a >w b and b 6>w a • a kw b iff a >w b and b >w a • a#w b iff a 6>w b and b 6>w a

1414

The first relation (>w ) is the basic temporal order in the event logs (sequence of activities) from which the other relations are computed. The second relation (→w ) represents a possible causal order between activities. The third relation (kw ) represents potential parallelism. The last relation (#w ) represents activities that never follow each other directly. From these log relations, the Alpha algorithm generates a workflow net (PW , TW , FW ). Only the computation of the log-based ordering relations occurs on the event logs. The subsequent computations of PW , TW and FW are performed on data sets, whose upper bound is a a function of the number of distinct activity types in the event logs, rather than the size of the event logs themselves. Thus, it is primarily the computation of the ordering relations that can benefit from the the distributed computation model provided by Map-Reduce.

3.

Figure 1: The first Map-Reduce logical data flow The reduce() function determines the log order of each pair of activities in the trace. By determining, for every two activities, whether they directly follow each other, reduce() computes the first log relation in Def. 2, >W . For example, consider the trace for case one in our example as emitted by shuffle(): (1, ((A, 1)(B, 5)(C, 6)(D, 13))). The trace consists of four activities and hence sixteen (4×4) possible event pairs. We assume that event IDs are ordered, so that (A, B) is different from (B, A). We can then write each pair of event IDs in a canonical form, e.g. lowest first. Thus, we need to consider only 10 distinct event pairs: (A, A), (A, B), (A, C), (A, D), (B, B), (B, C), (B, D), (C, C), (C, D), (D, D). For each pair, we must determine whether it (e.g. (A, B)) is in >W or 6>W and whether its ”inverse” (e.g. (B, A)) is in >W or 6>W . The reason for associating the information about >W and 6>W in a single pair like (A, B), rather than associating this information with (A, B) and (B, A) separately, is to allow the subsequent reducer in the next stage of our implementation to operate on the combined information as this event pair will form the key for a subsequent shuffle(). Consider the first event pair (A, B). According to the trace, A is immediately followed by B, while the opposite is incorrect, so that we write [AB, (F + , N F − )]. In our notation ’F ’ stands for ”follows” and ’N F ’ for ”not follows”, expressing the relations >W and 6>W respectively. We use the superscripts + and − to indicate the order of the pairs in the relation. The superscripted + shows whether the second event follows (not follows) the first one, while − means that the first event follows (not follows) the second one (i.e. signifies the ”inverse” of the event pair). Thus, (AB, (F + , N F − )) means that (A, B) is in >W and (B, A) is in 6>W . The reduce() function computes the log relations for each case (trace) and emits them as key value pairs, where the composite key is the event pair and the value is a log relation, either F + , F − , N F + or N F − (Fig. 1). Formally,

IMPLEMENTATION

Map-Reduce is a programming approach for large scale data processing in a distributed computing environment. In the first phase, the map() function accepts a series of (InKey, InValue) pairs from an input reader and provides as output a series of (OutKey, OutValue) pairs. In the intermediate phase, this output is presented to shuffle(), which sorts it by key and collects the values for different keys, providing output as a series of tuples of the form (OutKey, OutValue1, OutValue2, ...). In the second phase, the reduce() function takes this as input and provides as output a series of (Out2Key, Out2Value) pairs. The important aspect in implementing a Map-Reduce based algorithm is the combination of suitable sequences of map() and reduce() function with appropriate data types for keys and values. To compute the log-based ordering relations for the Alpha algorithm, we require two sets of mappers and reducers, following each other.

3.1

Map-Reduce for >W and 6 >W The first Map-Reduce set computes the traces and the log relation >W . As Definition 2 shows, further computations also require information about pairs of activities that are not in >W , i.e. are in the relation 6>W . The mapper cannot assume complete case information as information for each case may not be entirely contained on a single Map-Reduce node. Hence, the absence of an activity pair in >W does not for each node imply that it is in 6>W . This is why we explicitly keep track of activity pairs in 6>W for each node. The input is a log file in the form of Table 1. The TextInputFormat presents a line of the file to map(), which is parsed and emitted as a series of tuples. Formally:

reduce1 :(CaseID, set(Event, TimeStamp)) → set((Event, Event), (Boolean, Boolean))

3.2

Map-Reduce for kW , #W and →W The multiple reduce nodes in the first Map-Reduce sequence leave the information about each pair of event IDs scattered across different nodes. Hence, we use an identity map() function to read this information and a (combine()) function to combine log ordering information for each pair of event IDs and to remove redundant information:

map1 : (Int, Text) → set(CaseID, (Event, TimeStamp)) The set of tuples is input to shuffle(), which collects the set of activities for each case ID, i.e. a trace (Def. 1), as input to reduce(). We assume that a trace is small enough for the reducer to operate on it as a whole. Time stamp information is retained because shuffle() makes not guarantees about the order of the values presented to reduce(). Formally:

map2 :((Event, Event), (Boolean, Boolean))

shuffle1 :set(CaseID, (Event, TimeStamp)) → (CaseID, set(Event, TimeStamp))

→ ((Event, Event), (Boolean, Boolean))

1415

We used the PLG process log generator1 to create a process model containing 47 tasks. We then used PLG to create an event log with 10,000 randomly created traces from this model. We replicated this event log 500 times. The total log file size was approximately 80GB, a size where the use of the Map-Reduce approach becomes viable. We processed these logs using the Amazon Elastic MapReduce (EMR) service, provisioned with 10 medium instances (2 threads, 3.75GB) as Hadoop task nodes. Input, intermediate results and outputs were stored on Amazon S3. By default, EMR uses Snappy compression of the map output, and configured 2 map tasks slots and 1 reduce task slot per node. The results and number of processed records at each stage are shown in Table 2. The total job execution time was 3 hours and 6 minutes. While the job execution time can be reduced using an increasing number of parallel map and reduce tasks, each reduce task in stage 2 produces a workflow net, which must be merged with others. In practice, a trade-off must be found between the number of final reducers and the number of required merges.

Figure 2: The second Map-Reduce logical data flow Phase map 1 reduce 1 map 2 combine 2 reduce 2

CPU time (h:m:s.milli) 2:15:26.395 6:32:9.181 40:10:1.562 19:17:10.039

# output records 18,446,4000 4,919,394,000 4,919,394,000 52,141,510 16,864

# tasks 509 18 1125

# slots 40 10 40

25

10

Table 2: Performance results combine2 :((Event, Event), (Boolean, Boolean)) → ((Event, Event), (Boolean, Boolean))

Consider the following information that is passed unchanged by map() to combine(). +

−

+

−

5.

(AB, F ), (AB, N F ), (AB, N F ), (AB, N F ), (AB, F ), . . .

Removing the duplicate log relation values using combine(), the intermediate shuffle phase collects information for the same key (i.e. event pair) and provides the following input to reduce(): (AB, (F + , N F − , N F + )) ... The reduce() function now operates on this list of values for each event pair key and computes and emits the log relations →w , #w and kw according to Definition 2. Formally:

6.

CONCLUSION

Map-Reduce offers a scalable model for distributed computation across multiple cluster nodes and is a natural fit to the distributed nature of modern information systems and their event logs. This is the first work that applies the Map-Reduce approach to the problem of process discovery. Our future work will investigate how the Map-Reduce framework can be applied to other mining algorithms, such as the heuristic miner, and can be usefully combined with other techniques, such as log partitioning or trace clustering.

reduce2 :((Event, Event), set(Boolean, Boolean)) → set((Event, Event), LogRelation)) In the example we have: < AB, kw > ... The second data flow is illustrated in Figure 2. The output of reduce() provides the input of the later stages of the Alpha algorithm which constructs the workflow net. Two options can be considered for this stage. First, a single reducer in this phase ensures that this reducer ”sees” all event pairs and thus ensures that the output is complete. However, this limits the scalability of the implementation. Second, multiple reducers maintain the scalability of the solution, but may result in outputs that are each incomplete and may possibly contain redundant information. Thus, these outputs need to be merged. Again, two options can be considered. If the reducer outputs are sets of log relations, the small data size (of the order of O(n2 ) where n is the number of unique events) makes it easy to remove duplicates and merge the sets. A second option is to have the reducers output not the log relations but the workflow nets based on these relations. In this case, process model merging consolidates a collection of process variants into a single model [2].

4.

RELATED WORK

Reguieg et al. [1] applied Map-Reduce to process discovery but aims to discover ”event correlations” in systems where events are not explicitly associated with cases. The actual process discovery from correlated event logs is outside the scope of their approach. Their approach is complementary to ours to perform event correlation if required, followed by our own approach to mine workflow models.

+

References [1] H. Reguieg, F. Toumani, H. Motaharinezhad, and B. Benatallah. Using mapreduce to scale events correlation discovery for business processes mining. In Business Process Management, pages 279–284. Springer, 2012. [2] M. L. Rosa, M. Dumas, R. Uba, and R. Dijkman. Business process model merging: An approach to business process consolidation. ACM Trans. Softw. Eng. Methodol., 22(2):11:1–11:42, 2013. [3] W. van der. Aalst. Process mining:overview and opportunities. ACM. Trans. Manage. Inf. Syst., 3(2):7:1–7:17, 2012. [4] W. van der. Aalst, T. Weijters, and L. Maruster. Workflow mining: Discovering process models from event logs. Knowledge and Data Engineering, IEEE Transactions on, 16(9):1128–1142, 2004.

PERFORMANCE

1

1416

http://www.processmining.it/sw/plg

Big Data meets Process Mining: Implementing the ... - Semantic Scholar

Big Data meets Process Mining: Implementing the ... - Semantic Scholar

Suggest Documents

Big Data Meets Big Data Analytics - SAS

When Private Set Intersection Meets Big Data: An ... - Semantic Scholar

Data Mining Meets Collocations Discovery

Big Data: Understanding Big Data - Semantic Scholar

Reality Mining Via Process Mining - Semantic Scholar

Why Data Mining - Process mining

Big Data - Semantic Scholar

Visual Analytics Meets Process Mining - Publikationsdatenbank der ...

Process Mining Meets Abstract Interpretation - CiteSeerX

Semantic Web and Big Data meets Applied Ontology The Ontology ...

Big data mining and business intelligence trends ... - Semantic Scholar

Big Data Industry Process Definition: Big data process ...

Graph Mining Meets the Semantic Web

DeLLIS: a Data Mining Process for Fault Localization - Semantic Scholar

Context Aware Standard Process for Data Mining ... - Semantic Scholar

Graph Mining Meets the Semantic Web

A Big-Data Approach to Handle Many Process ... - Semantic Scholar

Big data for big questions - Semantic Scholar

MySQL Data Mining - Semantic Scholar

Distributed Data Mining - Semantic Scholar

Developing and implementing statistical process ... - Semantic Scholar

data mining study - Semantic Scholar

data mining techniques - Semantic Scholar

data mining techniques - Semantic Scholar