This is a pre-print version of the conference paper. Please cite this paper as: Zakarija, I., Škopljanac-Mačina, F.,Blašković, B., Discovering Process Model from Incomplete Log using Process Mining, Proceedings of ELMAR-2015 57h International Symposium , 2015., pp. 117-120
Discovering Process Model from Incomplete Log using Process Mining Ivona Zakarija1, Frano Škopljanac-Mačina2, Bruno Blašković2 1
2
University of Dubrovnik, Ćira Carića 4, Dubrovnik, Croatia Faculty of Electrical Engineering and Computing, University of Zagreb, Unska 3, Zagreb, Croatia
[email protected]
Abstract - This paper gives an overview of relevant research in the area of process mining. Process mining techniques are able to extract knowledge from event logs. The major objective of process mining is to discover, monitor and improve real processes. Process mining aims to exploit event data in a meaningful way to identify and anticipate problems, and recommend countermeasures. Additionally, process mining places the existing massive volumes of data in the context of processes. Since extracting data is an integral part of any process mining procedure, data preparation or data pre-processing requires certain efforts. Examples have been given to indicate how the chosen process mining technique deals with incompleteness in the event log data. Experiments have been made on the real data collected from information system for accommodation services. Keywords – Process Mining; Big Data; Event Log; Incomplete Data; Keyword
I.
INTRODUCTION
Process mining is a novel discipline developed in the last decade and its comprehensive sets of tools and various techniques enables process analysis from event logs. The basic idea is to discover the processes by mining event logs for knowledge in order to improve and enhance business processes. In an event log each event represents one activity (i.e. a well-defined step in the process) and belongs to a certain case (a process instance). In other words, a case (trace, process variant) is a specific activity sequence. Process mining is based on approaches relying on data mining and process modeling and analysis, also placing the present vast amount of data in a process context [1]. The starting point for any process mining analytics is the set of events recorded in the log. This does not imply that events are stored in dedicated log files. They may be stored in different data sources such as database tables, message logs, transaction logs, server logs, and social media data [2]. There are vast quantities of data available nowadays. Such vast amounts of data are usually poorly structured, which makes them incomplete and unavailable. The term Big data is used for massive volumes of structured and unstructured data, so large and so complex that it is practically impossible to process them by traditional data management tools and applications. Majority of Big data technologies relies on the open code project called the Apache Hadoop [3]. Big data are characterized by four dimensions called the “4V” [4] [5] [6]. It
is rather impossible to present a complete picture of the very diverse Big data field. Nevertheless, we recommend some further reading on important new research work in [7-20]. The second section gives an overview of relevant research. Examples of application of process mining are presented in the third section. Finally, the paper is concluded in the fourth section. II.
RELATED WORK
In this section we give an overview of previous research in the area of process mining. Described research papers contain algorithms, approaches and tools used in the area. An overview of more complex issues representing research challenges is presented in [21]. Among the challenging issues there are: hidden tasks, duplicate tasks, non-free-choice constructs, loops, timing information, mining various perspectives, noise in the event log, incompleteness in log data, gathering data from heterogeneous sources, visualization of results, delta analysis. Comparison of different approaches to process mining, techniques and tools, as well as an overview of the outstanding issues, is presented in [22]. The paper makes a comparison of the said approaches from nine aspects: structure, time, non-free choice, basic loops, arbitrary loops, basic parallelism, hidden tasks, duplicate tasks and noise. Authors in [23] propose the directed graph technique which can simply represent causal relationships and cyclic relationships among the activities from event logs. In [24][25] the α algorithm is presented, based on Petri nets. It is explained how α algorithm can analyze any workflow shown by structured workflow nets (SWF-nets). In [26] limitations of process mining algorithms are given. To obtain a better insight and for better understanding of the limitations, authors analyzed a representative sample α algorithm. They established under which conditions and process constructs the algorithm works properly. Classification of the process constructs that are difficult to handle for this type of algorithms is also presented. The paper [27] proposes a new approach for automatic reconstruction of the stochastic workflow model from event logs. The authors recommend utilizing this method for social network mining. The process mining technique proposed in [28] can solve problems with the noise and can be used to validate workflow process by detecting and measuring discrepancies between prescriptive models and actual process execution.
This is a pre-print version of the conference paper. Please cite this paper as: Zakarija, I., Škopljanac-Mačina, F.,Blašković, B., Discovering Process Model from Incomplete Log using Process Mining, Proceedings of ELMAR-2015 57h International Symposium , 2015., pp. 117-120 The usage of process mining for maintenance of web pages and improvement of their functionality is one of the significant research areas as presented in [29] and [30]. In paper [31] LTL (Linear Temporal Logic) Checker is shown, the language and tool that enable verification of properties on the basis of logs. The proposed approach is based on the standard XML format and temporal logic. Also, the research in [32] focuses on analyzing students' performance data which could be used for generating personalized learning processes. The work reported in [33] presents algorithms for discovering process models based on streaming event data. The simplest way to adapt a process mining algorithm for stream mining is to collect events during specific observation periods and then applying the batch version of the algorithm to the current log. III.
EXPERIMENTS AND RESULTS
Extracting event data suitable for process mining still takes substantial effort. Event data may be incomplete, since logs contain only sample behavior. On the other hand, that data may be too detailed [2]. In order to provide a more detailed explanation of the problem, we describe an application of process mining on example guest stay monitoring process in a hotel. Example has been executed using the Disco tool [34]. Experiments have been made on the real data collected from the information system for accommodation services. In this example we used actual log data obtained from the Property Management System (PMS). A.
Guest Stay Monitoring Process The process of guest stay monitoring begins with booking a room (New reservation) for a certain period. After the guest's arrival there is the Check In from where data on board services are extracted as well as the period of stay at the hotel. During the stay along with the board services the guest may use any extra services offered by the hotel, such as telephone, room services, mini bar, etc. The costs for extra services are automatically recorded on the guest's bill through the system, such as telephone switchboard, fiscal cash register, etc., or they are recorded at the reception desk during the guest's stay. Extra service costs can be recorded during the issuance of the bill (Bill Payment). With the Bill Payment the guest checks out and the process is completed. Description of the process is shown in Fig. 1. B. Data Preparation As stated before, we used actual log data obtained from the Property Management System (PMS). The PMS is an information system for the management of accommodation facilities such as hotels, resorts, condominiums and other types of facilities intended for providing services organized around the accommodation. This information system is integrated with
Figure 1.
Guest stay monitoring process
a variety of external systems within telecommunication network. The usual retainers of the contemporary hotel offer telephone and internet traffic, intelligent rooms, the One Card System, etc. The log contains instances of the guest stay monitoring process, which represent a part of business activities of the hotel reception desk. Data streams captured from telephone exchange has been incorporated in collected log data. Events are extracted from a variety of database tables. There are three logs at different abstraction level (less or more detailed): log1, log2, and log3. Overview information about logs is shown in figures Fig. 2, Fig. 3, and Fig. 4. Let us consider collected log data. Log1 is incomplete because event Check Out is missing. Log2 contains Check Out event, but log3 is much more detailed. Event data are loaded from log exported as CSV file and after that mapping of attributes is performed to identify Case ID, Activity, Resource and Timestamp. Additional preprocessing of the data is not required because the Disco tool supports CSV format so it can be directly imported. Furthermore, the timestamp does not need to have a specific format, it is detected automatically. Resources contain 12 users who perform tasks, some of them being employees at the reception desk while some users represent the connected devices. All the cases are considered as completed therefore filtering of the log data was not necessary. C. Analysis and Discussion Disco tool is used for mining the control-flow. Since the Disco is based on the framework of the Fuzzy Miner, on all of three logs Fuzzy Miner algorithm [35] has been applied – analysis results are shown in Fig. 5. After analysis, we discuss the results with the information system developers, who
Figure 2.
Global statistics - overview information about log1
Figure 3. Global statistics - overview information about log2
Figure 4. Global statistics - overview information about log3
This is a pre-print version of the conference paper. Please cite this paper as: Zakarija, I., Škopljanac-Mačina, F.,Blašković, B., Discovering Process Model from Incomplete Log using Process Mining, Proceedings of ELMAR-2015 57h International Symposium , 2015., pp. 117-120 implemented this PMS at the various hotels. In principle, the guest cannot check out until all costs recorded on the guest's bill are paid. Furthermore, after Check Out event Bill Payment should not be performed. In Fig. 6, Fig. 7, and Fig. 8 we see how many different traces (variants) there are in the underlying process. There are 28 traces in log1; 76.89% of all cases (437 in total) are in first three variants. The most frequent traces are: •
Variant1 (New reservation, Check In)
•
Variant2 (New reservation, Check In, Bill Payment)
•
Variant3 (New reservation, Check In, Extra service added, Bill Payment)
There are 30 traces in log2; 78.97% of all cases (437 in total) are in first three variants. The most frequent traces are: •
Variant1 (New reservation, Check In, Check Out)
•
Variant2 (New reservation, Check In, Bill Payment, Check Out)
•
Variant3 (New reservation, Check In, Extra service added, Bill Payment, Check Out)
There are 29 traces in log3; 89.93% of all cases (437 in total) are in first seven variants. The most frequent traces are: •
Variant1 (New reservation, Check In, Bill Payment, Check Out)
•
Variant2 (New reservation, Check In, Extra service added, Bill Payment, Check Out)
•
Variant3 (New reservation, Check In, Bill Payment, Bill Payment, Check Out)
•
Variant4 (New reservation, Check In, Check Out, Bill Payment)
• Variant5 (New reservation, Check In, Extra service added, Extra service added, Bill Payment, Check Out) •
added, Bill Payment, Bill Payment, Check Out) • Variant7 (New reservation, Check In, Bill Payment, Extra service added, Bill Payment, Check Out) In log3 traces Variant1 and Variant3 are almost the same, because they contain repeated events Bill Payment. The same applies to the traces Variant2, Variant5 and Variant6. And also there are repeated events Extra service added in Variant5. Considering that the log contains data for travel agency guests, their costs are invoiced afterwards, accordingly the frequent trace Variant1 in all of three logs is in line with the previously described process. For these guests it is indicated that the payment will be through the travel agency. This can be seen in log3 since those events are collected separately. Traces Variant2, Variant3 in log1, log2 are aligned with the prescribed process model. Furthermore, traces (New reservation, Check In, Extra service added, Bill Payment, Check Out) in log3 with repeated events Extra service added and Bill Payment are feasible. Payment may be divided in multiple bills, and the guest can have several extra services. During the guest stay there may be changes of board services, the costs may be paid by another guest, or in the case of business trips the invoice will be issued to the company, therefore it should be indicated who covers the costs. Accordingly, valid traces are Variant4 (in log3) and those ending with event Bill Payment after Check Out. Consequently, this could mean that the guest has paid the costs for extra services, checked out, whereas the board services will be paid by the company. The resulting process models obtained by the analysis indicate that the initial predefined model should be modified in order to be more realistic. Analysis of log1 revealed that the collected data does not contain entries for guest Check Out which ends the guest's stay at the hotel. Analysis of log2 shows that the results would be more precise if the data are more detailed, indicating travel agency guests, who pays the costs (e.g. company, some other
Variant6 (New reservation, Check In, Extra service
Figure 6.
Log1 case variants
Figure 7. Log2 case variants
Figure 5.
Analysis results for log1, log2,and log3
Figure 8. Log3 case variants
This is a pre-print version of the conference paper. Please cite this paper as: Zakarija, I., Škopljanac-Mačina, F.,Blašković, B., Discovering Process Model from Incomplete Log using Process Mining, Proceedings of ELMAR-2015 57h International Symposium , 2015., pp. 117-120 guest), payment method, etc. However, log3 contains such detailed event data. Accordingly, the model obtained by application of the Fuzzy Miner algorithm on the event data in log3 corresponds best with the actual executions of events and it is in line with the reality. Although for further analysis data could be even more detailed. If we save logs for further use, back-up facilities will be overcrowded with data. However, using Big data approach we will have only the relevant data. IV.
CONCLUSION
Since the quality of a process mining result depends heavily on the quality of data in the event logs, and therefore the quality of such event logs is important. Unfortunately, in practice event logs are often merely a byproduct, although they contain hidden knowledge on the real performance of the process. Logs are filled with data, but only analyzed in cases of incidents. For example, when attempting to determine whether an error occurred due to faulty information system, or specific program module, or the human factor is to blame. We have shown on the example of guest stay monitoring process in a hotel how the chosen process mining technique is dealing with incompleteness in the event log data. Experiments have been made on the real data collected from information system for accommodation services. By comparison of process models obtained from the experiments and inspection of traces in three logs at different abstraction levels we have shown that the initial predefined model should be modified in order to be better aligned with reality. REFERENCES [1]
W. M. Van der Aalst, Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer Berlin Heidelberg, 2011. [2] W. Van Der Aalst, A. Adriansyah, A. K. A. de Medeiros, F. Arcieri, T. Baier, T. Blickle, J. C. Bose, P. van den Brand, R. Brandtjen, J. Buijs et al., “Process mining manifesto,” in Business process management workshops. Springer, 2012, pp. 169–194. [3] “Apache hadoop,” http://hadoop.apache.org/, april 2015. [4] S. Sagiroglu and D. Sinanc, “Big data: A review,” in Collaboration Technologies and Systems (CTS), 2013 International Conference on, 2013, pp. 42–47. [5] “Gartner says solving ’big data’ challenge involves more than just managing volumes of data,” http://www.gartner.com/newsroom/id/1731916, travanj 2013. [6] http://www.ibmbigdatahub.com/infographic/four-vs-big-data, travanj 2013. [7] L. Banić, A. Mihanović, and M. Brakus, “Using big data and sentiment analysis in product evaluation,” in Proceedings of the Business Intelligence Systems, MIPRO 2013, May 2013, pp. 1444–1449. [8] T. Kraska, “Finding the needle in the big data systems haystack,” Internet Computing, IEEE, vol. 17, no. 1, pp. 84–86, 2013. [9] K. Michael and K. Miller, “Big data: New opportunities and new challenges,” Computer, vol. 46, no. 6, pp. 22–24, 2013. [10] J. Pitt, A. Bourazeri, A. Nowak, M. Roszczynska-Kurasinska, A. Rychwalska, I. Santiago, M. Sanchez, M. Florea, and M. Sanduleac, “Transforming big data into collective awareness,” Computer, vol. 46, no. 6, pp. 40–45, 2013. [11] M. Wigan and R. Clarke, “Big data’s big unintended consequences,” Computer, vol. 46, no. 6, pp. 46–53, 2013. [12] E. Ewing, S. Gad, and N. Ramakrishnan, “Gaining insights into epidemics by mining historical newspapers,” Computer, vol. 46, no. 6, pp. 68–72, 2013.
[13] A. Cron, H. L. Nguyen, and A. Parameswaran, “Big data,” XRDS, vol. 19, no. 1, pp. 7–8, Sep. 2012. [14] J. Nelson, “Sketching and streaming algorithms for processing massive data,” XRDS, vol. 19, no. 1, pp. 14–19, Sep. 2012. [15] A. Machanavajjhala and J. P. Reiter, “Big privacy: protecting confidentiality in big data,” XRDS, vol. 19, no. 1, pp. 20–23, Sep. 2012. [16] R. Rubinfeld, “Taming big probability distributions,” XRDS, vol. 19, no. 1, pp. 24–28, Sep. 2012. [17] V. R. Borkar, M. J. Carey, and C. Li, “Big data platforms: What’s next?” XRDS, vol. 19, no. 1, pp. 44–49, Sep. 2012. [18] J. Heer and S. Kandel, “Interactive analysis of big data,” XRDS, vol. 19, no. 1, pp. 50–54, Sep. 2012. [19] V. Dhar, “Data science and prediction,” Commun. ACM, vol. 56, no. 12, pp. 64–73, Dec. 2013. [20] G.-H. Kim, S. Trimi, and J.-H. Chung, “Big-data applications in the government sector,” Commun. ACM, vol. 57, no. 3, pp. 78–85, Mar. 2014. [21] W. van der Aalst, “Discovering coordination patterns using process mining,” in Workshop on Petri Nets and Coordination, Bologna, Italy, 2004, p. 49–64. [22] W. van der Aalst, B. van Dongen, J. Herbst, L. Maruster, G. Schimm, and A. Weijters, “Workflow mining: A survey of issues and approaches,” Data & Knowledge Engineering, vol. 47, no. 2, pp. 237 – 267, 2003. [23] R. Agrawal, D. Gunopulos, and F. Leymann, “Mining process models from workflow logs,” in Advances in Database Technology EDBT’98, ser. Lecture Notes in Computer Science, H.-J. Schek, G. Alonso, F. Saltor, and I. Ramos, Eds. Springer Berlin Heidelberg, 1998, vol. 1377, pp. 467–483. [24] W. van der Aalst, A. Weijter, and L. Maruster, “Workflow mining: Discovering process models from event logs,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, p. 2004, 2003. [25] W. Van der Aalst, T. Weijters, and L. Maruster, “Workflow mining: Discovering process models from event logs,” Knowledge and Data Engineering, IEEE Transactions on, vol. 16, no. 9, pp. 1128–1142, 2004. [26] A. de Medeiros, W. van der Aalst, and A. Weijters, “Workflow mining: Current status and future directions,” in On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, volume 2888 of Lecture Notes in Computer Science. Springer-Verlag, 2003, pp. 389– 406. [27] H. Hu, J. Xie, and H. Hu, “A novel approach for mining stochastic process model from workflow logs,” Journal of Computational Information Systems, vol. 7, no. 9, pp. 3113–3126, 2011. [28] A. Weijters and W. Van der Aalst, “Process mining: discovering workflow models from event-based data,” in Proceedings of the 13th Belgium-Netherlands Conference on Artificial Intelligence (BNAIC 2001), 2001, pp. 283–290. [29] I. Mihai, “Web mining in e-commerce.” Annals of the University of Oradea, Economic Science Series, vol. 18, no. 4, 2009. [30] N. Poggi, V. Muthusamy, D. Carrera, and R. Khalaf, “Business process mining from e-commerce web logs,” in Business Process Management, ser. Lecture Notes in Computer Science, F. Daniel, J. Wang, and B. Weber, Eds. Springer Berlin Heidelberg, 2013, vol. 8094, pp. 65–80. [31] W. Aalst, H. Beer, and B. Dongen, “Process mining and verification of properties: An approach based on temporal logic,” in On the Move to Meaningful Internet Systems 2005: CoopIS, DOA, and ODBASE, ser. Lecture Notes in Computer Science, R. Meersman and Z. Tari, Eds. Springer Berlin Heidelberg, 2005, vol. 3760, pp. 130–147. [32] F. Skopljanac-Macina, B. Blaskovic, and Z. Skocir, “Using formal concept analysis for student assessment,” in ELMAR (ELMAR), 2014 56th International Symposium, Sept 2014, pp. 285–288. [33] A. Burattin, “Applicability of process mining techniques in business environments,” Ph.D. dissertation, University of Bologna, 2013. [34] “Disco,” http://fluxicon.com/disco/, april 2015.
This is a pre-print version of the conference paper. Please cite this paper as: Zakarija, I., Škopljanac-Mačina, F.,Blašković, B., Discovering Process Model from Incomplete Log using Process Mining, Proceedings of ELMAR-2015 57h International Symposium , 2015., pp. 117-120 [35] C. W. Günther and W. M. Van Der Aalst, “Fuzzy mining–adaptive process simplification based on multi-perspective metrics,” in Business
Process Management. Springer, 2007, pp. 328–343.