2010 Second International Conference on Communication Software and Networks
AN EFFECTIVE ALGORITHM FOR BUSINESS PROCESS MINING BASED ON MODIFIED FP-TREE ALGORITHM
Gun-Woo Kim
Seung Hoon Lee
Dept. of Computer Science and Engineering Hanyang University 1271, Sa 1-dong, Sangnok-gu, Ansan, South Korea e-mail:
[email protected]
Dept. of Computer Science and Engineering Hanyang University 1271, Sa 1-dong, Sangnok-gu, Ansan, South Korea e-mail:
[email protected]
Jae Hyung Kim
Jin Hyun Son
DBMS R&D Department Altibase Corporation 182-13, Guro-dong, Guro-gu, Seoul, South Korea e-mail:
[email protected]
Dept. of Computer Science and Engineering Hanyang University 1271, Sa 1-dong, Sangnok-gu, Ansan, South Korea e-mail:
[email protected]
Abstract— A number of business organizations are beginning to realize the importance of business process management. However, process can often go the way they were initially not designed for or a non-efficient performance process model could be designed. To solve this problem, business process mining which can be used as the basis for the business process re-engineering has been recognized to an important concept. However, current research in the domain of process mining has only focused on extracting a workflow-based process model from completed process logs. Thus, there is a limitation in expressing various types of business processes, and moreover, process discovery and log scanning take a considerable amount of time. In this paper, we present a modified FP-tree algorithm for FP-tree based business processes, which are used for association analysis in data mining. Our modified algorithm supports the discovery of an appropriate level of the process model according to the user's need without rescanning all the process logs during updating.
creation of a process model that will be used as the basis of BPR [1, 2, 3]. However, current research in the domain of process mining has focused only on extracting workflowbased process model information from event logs in a workflow-based system. In a workflow-based system, Petri net a graphical tool is used to describe an asynchronous system. However, some gateways cannot be used in Petri net (particularly, OR gateways, which represent the “Multiple Choice” and “Synchronizing Merge” pattern). Thus, process mining has limitations in expressing various types of business processes. Another disadvantage is that process discovery and log scanning are very time consuming because the process logs have to be re-scanned. In this paper, we present a modified Frequent Pattern Tree (FP-tree) algorithm for application in FP-tree based business processes, which are used for association analysis in data mining. Our modified algorithm selects the appropriate level of the process model as per the user’s requirement and also supports OR gateways without rescanning all the updated process logs.
Keywords; Process Mining, Data Mining, Business Process
I.
INTRODUCTION
Recent advances in the field of information technology have yielded effective processes that can be employed in other industries to improve productivity and to reduce production costs [1]. In particular, in industries where the competition between companies has intensified and the need for innovative business processes has increased, a numbers of business organizations have begun to realize the importance of Business Process Management (BPM) [1, 3]. However, because of a communication-gap between business analysts and system developers, the use of workflow systems used to support and design business process models makes it difficult for designing business process models that are complete and consistent from the very beginning. To resolve this problem, process mining that forms the basis for Business Process Reengineering (BPR) is used for improving the productivity of companies [2]. Process mining involves the extraction of information from event logs recorded by an information system and the 978-0-7695-3961-4/10 $26.00 © 2010 IEEE DOI 10.1109/ICCSN.2010.77
II.
RELATED WORK
A. Process Mining Process mining is closely related to Business Process Intelligence (BPI) and data/workflow mining [1, 3]. Unlike classical data mining techniques, process mining focuses on the discovery of process models from event logs, which are recorded by an information system [1]. Process mining can be categorized into three techniques.[1] The first technique is “Process Discovery”. This technique is used to extract information from the existing process to find a new process. The second technique “Conformance Checking” uses the process model that is discovered from the information obtained from the event log. It tests the fitness of the original process’s intent to evaluate the suitability. The third technique is known as “Extension”. If 119
the discovered process has low suitability, the extension technique is used to increase its suitability by extending the discovered process model using additional information from the event log. Among the three process mining techniques, the most important technique is “Process Discovery”, because the other techniques are based on the process model discovered using this technique. Therefore, current research in the domain of process mining is mostly focused on “Process Discovery’.
1. If there is no direct causal relation, the bit value is set as 0 in the direct causal matrix. The start/end node always exists in each case [5,6].
B. FP-tree (Frequent Pattern Tree) An FP-tree is used for association analysis in data mining to search for frequent patterns [4]. The FP-tree has an extended prefix-tree structure and stores crucial and quantitative information about frequent patterns [4]. Only frequent length-1 items will have nodes in the tree, and the tree nodes are arranged in such a way that nodes occurring more frequently have a better chance of sharing nodes than less frequently occurring ones. The FP-tree is a compressed representation of the input data. It is constructed by reading the data set one transaction at a time and mapping each transaction onto a path in the FP-tree. Since different transactions can have several items in common, their paths may overlap. The more the paths overlap with one another, the more detailed is the comparison that we can achieve using the FP-tree structure. If the size of the FP-tree is small enough to fit into the main memory, it will enable us to extract frequent item-sets directly from the structure stored in the memory instead of making repeated passes over the data stored on the disk. The size of an FP-tree is typically smaller than the size of uncompressed data because many transactions in market basket data often have few items in common. In the best-case scenario, where all transactions have the same set of items, the FP-tree contains only a single branch of nodes. The size of an FPtree also depends on the ordering of the items. In this paper, we propose a modified FP-tree based on the FP-tree. Our modified FP-tree has the same advantages and construction method as the FP-tree.
Figure 1. Direct Causal Matrix
III. DESIGN AND CONSTRUCTION OF A MODIFIED FPTREE FOR EXTRACTING THE BUSINESS PROCESS MODEL In this paper, we present a modified FP-tree algorithm. Business process mining relates to a different aspect of data mining, and thus, we cannot directly apply it to the FP-tree data structure. This section provides information about the essential prerequisites for constructing a modified FP-tree and also definition of our modified FPtree. Our modified FP-tree algorithm needs to fulfill several essential prerequisites mentioned below for the construction of a modified FP-tree. A. Essential prerequisites for constructing a modified FP-tree
There are some essential prerequisites that need to be fulfilled by our algorithm. These prerequisites are as follows. • All cases of the event log must have the start/end node. • The created process model must not have a duplicated task that has the same task name. • All cases of the event log should be a part of a wellstructured process. If the above prerequisites are fulfilled, our modified algorithm can be applied to business process mining. B. Design and construcution of a modified FP-tree To design our modified FP-tree, we use the same design and construction strategy as those used for an FP-tree employed for association analysis in data mining. However, there are three differences between our modified FP-tree and the FP-tree data structure. Firstly, our modified FP-tree has a start/end node. Secondly, input sequences used for constructing a modified FP-tree are non-ordered sequences. Finally, internal node links of a modified FP-tree are composed of a doubly linked list.
C. Direct Causal Matrix In order to implement our algorithm, we need to represent the relation between each task listed in the log information. Each relation corresponds to a possible process branch in the FP-tree. A direct causal matrix was created to serve several purposes in our modified FP-tree algorithm. First of all, it was used to evaluate the combinations of gateways in the process model. Next, it was used as a basis for implementing the merge algorithm. Figure 1 shows the relation of each task in the cases considered by our modified FP-tree. This direct causal matrix defines the causal relations between the tasks and start/end node. For example, let us consider case 1 of figure 1. There is a direct causal relation between A and B. Further, B and C, as well as C and D have a direct causal relation. These relations are represented with a bit value of
Figure 2. a modified FP-tree
120
Figure 2 shows the complete structure and construction order of a modified FP-tree. This figure is created by using the information from figure 1.
7: 8: if (start = null) 9: { 10: create Tree(); 11: start = c; 12: c-->next_nodelink = C; 13: node_count++; 14: } 15: for (int node_depth =0; node_depth < LogLength; node_depth++){ 16: if (N.task_name = c.task_name) 17: { 18: N.node_count++; 19: updateTree(); 20: } 21: else 22: { 23: temp = create_node(); 24: temp.node_count =1; 25: N = temp.pre_nodelink; 26: updateTree(); 27: } 28: } 29: insert([c|C], T); 30: }
Definition 1 (modified FP-tree) a modified FP-tree is a tree structure that can be defined as follows. 1) It consists of two root nodes labeled as “start” and “end”, branches of nodes, and a node header table. 2) Each node has several parameters such as node_count, task_name, nodelink, node_depth, and node_type. task_name stores details pertaining to the task that the node represents, node_count stores the number of executions represented by the portion of the path reaching this node, nodelink links the node to the next node in the modified FP-tree having the same task_name, node_depth represents the current depth of the node in the modified FPtree, and node_type represents the type of the node such as AND/OR/XOR-split/join gateway type. 3) Each entry in the header table consists of two fields, task_name and head of nodelink; the latter field points to the first node in the modified FP-tree that stores the task_name.
Figure 4. Algorithm for inserting data into the modified FP-tree
1) Extract each case in the event log. 2) Create the initial modified FP-tree which has a “start/end” node. For each case, the function insert([c|C], T) is executed as follows: “c” is the first element and “C” contains the remaining list in each case. “T” is an initial modified FP-tree. If “T” has a child “N” such that N.task_name = c.task_name, then increment N’s node_count by 1; otherwise create a new node “N”, initialize its node_count to 1, let its parent link be linked to “T”, and let its nodelink be linked to the nodes with the same task_name using the nodelink structure. If “P” is nonempty, call the insert([c|C],T) function recursively.
1: struct node 2: { 3: int node_count; 4: int node_depth; 5: keytype task_name; 6: index pre_nodelink; 7: index next_nodelink; 8: index[] parentnodes; 9: index[] childnodes; 10: }
IV.
PROCESS DISCOVERY USING A MODIFIED FP-TREE
This section provides information about the process discovery algorithm based on the modified FP-tree. We first explain the purpose of our modified algorithm and the reason for applying the algorithm. We then describe our strategy to discover the process model. Figure 3. Data structure of a modified FP-tree
A. Purpose of the modified FP-tree algorithm “The created process model must not have a duplicated task that has the same task name”. This essential prerequisite aforementioned in this paper must be fulfilled to obtain correct results for process discovery. However, when we inserted actual event log data into a modified FPtree, there were a lot of duplicate tasks that were represented as a node in the modified FP-tree. The reason for the presence of duplicated tasks was that the process model had a split/merge type of gateway. If all of the duplicate tasks are removed and a gateway type is assigned to the split nodes in the modified FP-tree algorithm, process discovery would end with correct results. Therefore, the purpose of our modified FP-algorithm is to remove the duplicate tasks and assign a gateway type to the split nodes. The algorithms for removing duplicated tasks and for assigning a gateway type are explained below:
Figure 3 shows the data structure of our modified FPtree and its pseudocode algorithm. C. Algorithm for data insertion in the modified FP-tree To complete the construction of the modified FP-tree shown in figure 2, we use an algorithm to insert actual event log data into the modified FP-tree. The algorithm is as follows. Algorithm 1 (Insertion of data into the modified FP-Tree) Input: Each case of the event log. Output: An initial modified FP-tree that has only the inserted event log data. Method: The modified FP-tree is constructed using the following steps. 1: insert ([c|C], T) 2: { 3: Node start, end; 4: PatternSet c= Null; C = Null; 5: N = childnodes; 6: Tree T;
121
B. Algorithm for removal of duplicated tasks from the modified FP-tree As mentioned in section III, our modified FP-tree algorithm has several essential prerequisites. For the second prerequisite removal of duplicated tasks, we have designed an algorithm as follows:
node_depth will be the sourceNode. This is because the correct location of a node is that nearest to the start node in the modified FP-tree. 3) If more duplicate tasks are found, then the function merge(sourceNode, targetNode, T) is performed recursively.
Algorithm 2 (Removal of duplicate tasks) Input: Initial modified FP-tree that has only the data inserted from the event log. Output: Intermediate modified FP-tree from which duplicated tasks have been removed. Method: All nodes that have the same task_name are removed using the following steps:
Figure 6 shows the removal of duplicate tasks from our modified FP-tree using algorithm 2.
1: checkingMerge () 2: { 3: for(nodeCount=0; nodeCount=n; nodeCount++) 4: { 5: if((n-1)th.task_name = the head of node links. task_name) 6: { 7: exitMerge(); 8: } 9: else 10: { 11: merge(sourceNode, targetNode, T); 12: } 13: } 14: }
Figure 6. Removing duplicated tasks from a modified FP-tree
Using algorithm 2, we obtain an intermediate modified FP-tree. Generally, the size of the intermediate modified FP-tree is sufficiently small, and thus, it is continuously stored in the main memory. We will update this tree when additional log information is inputted into the tree. Finally, we assign a gateway type to the split nodes to create the final modified FP-tree.
1: merge (sourceNode, targetNode, T) 2: { 3: Node A, B, C; 4: A = targetNode.pre_nodelink; 5: B = targetNode.next_nodelink; 6: C = temp_nodelink; 7: if (check_CausalMatrix (A,B)) 8: { 9: copy_info (A,B.C); 10: } 11: }
C. Algorithm for assigning a gateway type to the split nodes in a modified FP-tree To construct a complete modified FP-tree, the algorithm used (like the one shown in figure 4) to insert actual event log data into a modified FP-tree is as follows:
Figure 5. Algorithm for removal of duplicated tasks from the modified FP-tree
Algorithm 3 (Assign a gateway type to the split nodes) Input: Intermediate modified FP-tree. Output: Final modified FP-tree with a gateway type assigned to all split nodes. Method: The split/join nodes are assigned a gateway type.
1) Search for duplicate tasks using the header table in the modified FP-tree. For this, a sequential search is performed from the first task to the final task in the header table by checkingMerge(). If the n-th node, traversed through the nodelinks from the head of the nodelinks, has an (n-1)th node such that (n-1)-th.taskname = the head of nodelinks.task_name, then there are no two nodes with the same task_name; otherwise, the function merge(sourceNode, targetNode, T) is executed as follows: 2) The sourceNode will be removed from the modified FPtree using the function merge(sourceNode, targetNode, T). Then, the targetNode that has a same data and node_count in sourceNode inserted in the function merge(sourceNode, targetNode, T). The node_count of the sourceNode is added to the node_count of the targetNode. The links to the parent node “A” and the target node “B” of the nodelinks in the sourceNode are copied to the nodelinks of the child node “C” of the sourceNode. However, while copying the nodelinks and links the direct causal matrix is used as the standard. If there is a direct causal relation between node “A” and “C” or between node “B” and “C”, then the details are copied. Note that if there are many nodes that have the same task_name, the node which has the largest
1: assignGatewayType (node, T) 2: { 3: Node A, B, C; 4: int n1, n2 5: A = node; 6: B = node.childnodes[0]; 7: C = node.childnodes[1]; 8: …….. 9: n1 = B.node_count % A.node_count; 10: n2 = C.node_count % A.node_count; 11: 12: if(n1+n2 = A.node_count) 13: { 14: A.node_type = XOR; 15: } 16: else if (((B.node_count + C.node_count)/A.childCount()) = A.node_count) 17: { 18: A.node_type = AND; 19: } 20: else if (((B.node_count + C.node_count)/A.childCount() ) != A.node_count) 21: { 22: A.node_type =OR; 23: } 24: }
Figure 7. Gateway type granting algorithm in modified FP-tree
1)
122
Search for split nodes. Nodes that have two or more child nodes are called split nodes. Nodes that have
2)
as per the BPMN specification is insufficient to represent the business process from the modified FP-tree. Thus, there is scope for future work.
two or more parent nodes are called join nodes. The depth-first search techinique is used. When the split node is found in the modified FP-tree, the function assignGatewayType(node, T) is performed. The node_count will be the basis for assigning the gateway type. If a parent node “A” has two child nodes “B” and “C”, the gateway type assigned to node “A” is decided as follows: Let n1 and n2 be defined as
VI.
In this paper, we propose an adaptive and compact data structure and an algorithm for process discovery based on this structure. This structure is called a modified FP-tree and is based on the existing FP-tree structure. However, if loop patterns in the event log, our results of process discovery are not good. Moreover, if the original event log model has many parallel nodes, the modified FP-tree will have n x n node branches. In a future work, we shall focus on extracting loop patterns in the business process model. This is of great importance in business processes because loop patterns appear frequently in these processes [8]. Currently, we are finalizing the extraction of activity loop patterns in which only one task appears repeatedly. We use a multi-level mapping table to resolve this problem.
(The number of B’s node-count) % (The number of A’s node-count) = n1 (The number of C’s node-count) % (The number of A’s node-count) = n2
Using n1 and n2, gateway type is assigned using the following table. TABLE I.
CRITERIA FOR ECICIDING GATEWAY TYPES
Sources for Decision (n1 + n2) = Num of A’s node-count
Gateway Type XOR
(Num. of B’s node-count + Num. of C’s node-count) / (Num. of A’s child nodes) = Num of A’s node-count
AND
(Num. of B’s node-count + Num. of C’s node-count) / (Num. of A’s child nodes) != Num of A’s nodecount
OR
ACKNOWLEDGMENT This work was supported by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korea government (MOST) (No. R01-2007-000-20135-0) This work was supported by the IT R&D program of MKE/IITA. [2008-F038-01, Development of Context Adaptive Cognition Technology]
3) The join node is paired with the split node. If there is a structure that contains a split-join node within a split-join node, the function assignGatewayType(node, T) is performed recursively.
REFERENCES
Using the above algorithm, we created the final modified FP-tree. The result of the final modified FP-tree is a complete process model and the algorithm execution end. V.
CONCLUSION AND FUTURE WORK
[1]
[2]
MAPPING MODIFIED FP-TREE TO BUSINESS PROCESS MODELING NOTATION (BPMN) PROCESS MODEL
The result of the discovery algorithm can be mapped to the general modeling notation. In this paper, we use the Business Process Modeling Notation (BPMN) for business process modeling using a standard representation [7].
[3]
[4]
[5]
[6]
[7]
Figure 8. Mapping modified FP-tree to BPMN-based model [8]
Figure 8 shows the mapping between the modified FPtree and the BPMN-based process model. Previously described node characteristics of the modified FP-tree will be used to determine the notation to be used by the BPMNbased process model. However, the BPMN notation used
123
W.M.P. van der Aalst and A.J.M.M. Weijters. “Process Mining: A Research Agenda”, Computers in Industry, vol 53(3), pp. 231-244, 2004. Xie Yi-wu, Li Xiao-wan, and Chen Yan, “The Research on the usage of Business Process Mining in the implementation of BPR”, Proceedings of the 2007 IFIP Internation Conference on Netwrok and Parallel Computing Workshops, pp. 995-1000, 2007 Wil M.P. van der Aalst, “Trends In Business Process Analysis : From Verification to Process Mining”, Proceedings of the 9th International Conference on Enterprise Information Systems (ICEIS 2007), Medeira, Portugal, pp. 12-22, 2007. Jiawei Han, Jian Pei, and Yiwen Yin, “Mining Frequent Patterns without Coordinate Generation: A Frequent-Pattern Tree Approach”, Proceedings of 2000 ACM SIGMOD Internation Conference Management of Data (SIGMOD’00), Dallas, Tx, pp.1-12,2000 W.M.P. van der Aalst, “Genetic Process Mining”, 26th International Conference on Applications and Theory of Petri Nets (ICATPN 2005), G. Ciardo and P. Darondeau, LNCS 3536, pages 48-69, 2005. W.M.P. van der Aalst, “Finding Structure in Unstructured Processes: The Case for Process Mining”, Proceedings the 7th International Conference on Applications of Concurrency to System Design (ACSD 2007), pages 3-12, Bratislava, Slovak Republic, 2007. IEEE Computer Society Press, Los Alamitos, California. Object Management Group/Business Process Management Initiative, 2008, BPMN 1.1: OMG Specification. Ana Karla Alves de Medeiros, “Process Mining Based on Clustering: A Quest for Precision”. BPM 2007 Workshops, LNCS 4928: 17–29, 2008.