The VLDB Journal DOI 10.1007/s00778-010-0203-9
REGULAR PAPER
Event correlation for process discovery from web service interaction logs Hamid Reza Motahari-Nezhad · Regis Saint-Paul · Fabio Casati · Boualem Benatallah
Received: 19 July 2009 / Revised: 12 July 2010 / Accepted: 30 August 2010 © Springer-Verlag 2010
Abstract Understanding, analyzing, and ultimately improving business processes is a goal of enterprises today. These tasks are challenging as business processes in modern enterprises are implemented over several applications and Web services, and the information about process execution is scattered across several data sources. Understanding modern business processes entails identifying the correlation between events in data sources in the context of business processes (event correlation is the process of finding relationships between events that belong to the same process execution instance). In this paper, we investigate the problem of event correlation for business processes that are realized through the interactions of a set of Web services. We identify various ways in which process-related events could be correlated as well as investigate the problem of discovering event correlation (semi-) automatically from service interaction logs. We introduce the concept of process view to represent the process resulting from a certain way of event correlation and that of process space referring to the set of possible process views over process events. Event correlation H. R. Motahari-Nezhad (B) HP Labs, Palo Alto, CA, USA e-mail:
[email protected] H. R. Motahari-Nezhad · B. Benatallah University of New South Wales, Sydney, NSW, Australia e-mail:
[email protected] B. Benatallah e-mail:
[email protected] R. Saint-Paul CREATE-NET, Trento, Italy e-mail:
[email protected] F. Casati University of Trento, Trento, Italy e-mail:
[email protected]
is a challenging problem as there are various ways in which process events could be correlated, and in many cases, it is subjective. Exploring all the possibilities of correlations is computationally expensive, and only some of the correlated event sets result in process views that are interesting. We propose efficient algorithms and heuristics to identify correlated event sets that lead potentially to interesting process views. To account for its subjectivity, we have designed the event correlation discovery process to be interactive and enable users to guide it toward process views of their interest and organize the discovered process views into a process map that allows users to effectively navigate through the process space and identify the ones of interest. We report on experiments performed on both synthetic and real-world datasets that show the viability and efficiency of the approach. Keywords Business processes · Event correlation · Process views · Process spaces
1 Introduction Business processes (BPs) are central to the operation of any organization [10,44]. A business process is a set of coordinated activities for achieving a business objective. The quality and efficiency of the services that organizations provide to customers, citizens and employees, and the competitiveness of organizations hinges on continuous business process improvement. In the nineties, the focus of process improvement was on automation. Workflow management systems (WfMS) and other middleware technologies were used to reduce cost and improve efficiency by providing better system integration and automated enactment of operational business processes [16]. Recently, the process improvement focus has shifted to process analysis, where the goal is
123
H. R. Motahari-Nezhad et al.
to understand how a business process is performed and to identify opportunities for improvement. However, since the wide-scale automation has led to having business processes implemented over many systems, answering questions such as: What’s the status for the hiring of John Smith? What is the average hiring delay due to visa procedures for EU workers? How many people are typically involved in a hiring process? Which business units have the higher hiring costs per person? that once was easier (when all processes were implemented using one or more WfMSs) [17], becomes difficult at best. The main barrier for answering questions like those above is that the information about process execution is scattered across several systems and data sources, and in many cases, there is no well-documented information on how this information is related to each other and to the overall business process of the enterprise. In this context, a key problem is how to identify such relationships, i.e., how to correlate information elements (and in particular events related to process executions) in data sources to understand which information elements belong to the same execution (instance) of a process (e.g., detecting that a data entry in SAP and a message sent over an enterprise service bus are related to the purchase order no. 325). Note that here we do not focus on the heterogeneity issue of data sources containing the process execution information. We rather assume that the processrelated data is integrated by relying on existing data integration approaches [38]. We focus on the problem of event correlation in the context of service-based processes (those business processes realized through the interaction of a set of Web services). Event correlation is a challenging problem for several reasons: First, there are various ways in which correlation of process events could be defined. Indeed, the decision on how to correlate events is made independently and differently from one application and one domain to another. For instance, in a supply chain scenario, all events of a process instance for ordering goods may have an attribute called OrderID, and therefore events are correlated based on the value of this field. However, there is no general rule or standard, and not all messages related to the same order in different IT systems have an attribute called OrderID: events related to the same order may first be identified by a quotation number, then by the actual order number, and finally by the invoice number. In addition, more than one attribute may be needed at the same time (e.g., both OrderID and CustomerID). Therefore, the number of attributes of messages to be used as correlators and the number of possible combinations of such attributes across messages is potentially high. Furthermore, event correlation is subjective, and depending on the person interested in the correlation, the same set of events may be seen to be related to different process instances and process views. For example, events related to the shipping of some goods may be related to a given shipping process
123
instance from the perspective of a warehouse manager; however, if the goods are the results of different purchase orders, they belong to multiple purchase order instances from the perspective of the ordering manager. As a consequence, it is hard to define what is a “good” or “optimal” correlation and to devise an automated algorithm that discovers such “good” correlations. Considering the subjectivity of the correlation, another challenge is how to find interesting correlations for a user (e.g., a business process analyst) or guide her through the search space of ways of event correlation to identify the interesting ones for her. In this paper, we introduce abstractions, algorithms, and a tool for semi-automated correlation of process events. We make the following contributions: – We characterize the problem of event correlation in service-based processes and introduce the notion of correlation condition that defines which sets of events in the service log belong to the same instance of a process. A correlation condition partitions (a subset of) the events in the log into instances of a process and, therefore, allows to partition the events in the log into a set of process instances. – We introduce the notion of process view referring to the process resulting from a specific way of correlating events in service logs based on a given correlation condition. A process view is represented by the process model of the resulting process instances. If a set of process instances are given, there exist automated process discovery approaches (see [45] for a survey) to infer a process model that generates those process instances. We have used existing work (the previous work of authors [31]) to discover the process model corresponding to a set of process instances. In addition, we define the notion of process space to model possible process views across events in the data source. The notions of process view and process spaces are introduced to help users in understanding the results of event correlation at the process level. – We present algorithms for the semi-automated discovery of interesting process views from process event logs. The proposed approach starts from discovering simple conditions (defined based on a pair of attributes) followed by discovering composite conditions (comprised of two or more conditions) by adopting a levelwise approach [28]. We define heuristic-based criteria and objective metrics on the resulting process instances (e.g., size or the number of process instances) to find potentially interesting process views for users. The heuristics and metrics help in pruning the search space of possible correlation conditions. This is crucial to reduce an otherwise large search space of correlation conditions. To account for the subjectiveness of identifying interesting process views, we design the discovery of event correlations as an
Event correlation for process discovery from web service interaction logs
interactive process where the user inputs are taking into account in two ways: (i) heuristic-based measures provide configurable parameters that can be set before starting each discovery step and (ii) user feedback is sought after each step of the discovery process and is taken into account to further prune the space and direct the search toward interesting process views for the user. – To additionally account for the subjectiveness of event correlation as well as to enable the user to better understand a potentially large space of automatically discovered process views, we define a conceptual model for organizing sets of candidate views in a process map. A process map allows users to explore the process space and navigate through the set of discovered process views to identify the ones that fit their needs or refine the results. – We have implemented the proposed approach in a tool called Process Spaceship [30]. We have conducted experiments that show the viability and efficiency of the proposed approach on both synthetic (a supply chain scenario) and real-world datasets (a real-time game service). The rest of the paper is structured as follows: Sect. 2 gives a motivating example, the definitions, and the description of event correlation and process views discovery problem. Section 3 presents an overview of Process Spaceship system. Section 4 describes the back-end components of Process Spaceship including heuristics and a set of algorithms for event correlation. Section 5 explains the front-end component of Process Spaceship consisting of process map and a visual, interactive environment for exploration and refinement of discovered process views. In Sect. 6, we present implementation of Process Spaceship and experiments. We discuss related work in Sect. 7 and conclude and outline future work in Sect. 8.
2 Concepts and problem definition 2.1 Motivating example Modern business processes are rarely supported by a single, centralized workflow engine. Instead, the process is realized using a number of autonomous systems and Web services. As an example, consider the purchase order scenario depicted in Fig. 1. An order is first received by the company through a B2B hub. This hub can be a monitoring software infrastructure, an EDI receptor, or an e-commerce application such as Ariba.1 Its task is to log the reception of the new order event and verify its conformance. Once verified, the order is routed (through, e.g., a message broker) to the workflow management system that initiates the approval process. 1
www.ariba.com.
Fig. 1 Business processes in modern enterprises
The approval requires human interaction with the workflow system but is also inevitably characterized by email and document exchanges among people as part of the decision process. This may require interactions with other systems such as the ERP or the CRM. Once the order has been approved, it is sent to the invoicing and payment systems. During the process, documents (e.g., the purchase order and approval documents) may also be stored in a document management system to facilitate their collaborative editing. In the above example, the information related to the purchasing process is scattered across several independent systems and data sources. In this scenario, understanding and tracking process execution is challenging: often there is no information on how data in one system is related (correlated) to another and to the overall business process of the enterprise. For instance, the accounting system may keep track of invoicing and related payments through the invoice number, while the supply workflow may use an internal process instance numbering. Furthermore, there is a push in the enterprise for looking at processes from the perspective of users and thus at various levels of abstraction. Therefore, the key challenges in modern enterprises include finding the correlation of process-related events in data sources in order to understand them in the context of business processes, as well as enabling to look at the process definitions from the perspective of various users and at multiple levels of abstraction. 2.2 Definitions We present the following layers for understanding processrelated data and abstractions in modern enterprises: data sources layer, process instances layer, process model layer, and finally user interface/visualization layer, as illustrated in Fig. 2. Data sources layer. This layer represents data sources and systems that capture and maintain information related to process executions. At this layer, the main concept is that of process information item, which is the process-relevant unit of information stored by the source and that can be for
123
H. R. Motahari-Nezhad et al.
to a subset of messages of the log L, partially ordered by their timestamp. For example, the process instance corresponding to the above example would be represented as the sequence of p = send Or der, get I nvoice, make Payment, Shi p. Process model layer. A process model is the abstract representation of a process in the form of a graph showing all valid orders of execution of process tasks. Process models can be expressed in languages such as BPMN, Petri nets, and state machines. In this paper, we assume the finite state machine representation of process models for interactions between services, relying on the definition given in [7,31]: Fig. 2 Process-related layers in modern enterprises
example a row in a database, an event in a log, a Word document (purchase order) in a document repository, a SOAP message exchanged between two services, an email, and so on. In this paper, we consider service events, which are related to the exchange of messages between a set of Web services realizing a business process, where such exchange is monitored and captured in an event log. An event identifies the exchange (arrival or send) of a message, and therefore, we use the terms “message” and “event” interchangeably. We call a log containing events related to message exchanges in the context of process execution a process event log. We assume that the content of messages is available in the log and define a process event log as follows: Definition 1 (Process event log) A process event log is a set of messages L = {m 1 , m 2 , . . . , m n }, where each message m is represented by a tuple m i ∈ A1 × A2 ×· · ·× Ak . Attributes A1 , . . . , Ak represent the union of all the attributes contained in all messages, although each single message typically contains only a subset of these attributes and will therefore have many of its attributes undefined in its L representation. We denote by m x .Ai , the value of attribute Ai in message m x . We assume that each message m x has an attribute that denotes the timestamp at which the event (related to the exchange of m x ) has been recorded, denoted m x .τ . Messages in L are ordered by their timestamp. Process instances layer. A process instance refers to one execution of a particular process from the beginning to the end. In other words, it refers to the execution of a set of process tasks [46]. For example, one instance of the purchasing process consists of filing a specific purchase order, receiving corresponding invoice, making payment and its shipping. A process instance can be represented by a partially ordered set of observable events corresponding to the execution of process tasks. We define a process instance formally as follows: Definition 2 (Process instance) A process instance p is a sequence of messages p = m 1 , m 2 , . . . , m n corresponding
123
Definition 3 (Process model) A business process P is a deterministic state machine represented as a tuple P = (S, s0 , F, M, T ), where S is the set of states of the process, M is the set of messages supported by the services (qualified by their name), T ⊆ S × S × M is the set of transitions, s0 is the initial state, and F represents the finite set of final states. A transitions from state s to state s triggered by the message m is denoted by the triplet (s, s , m). The sequence of messages generated by traversing the state machine from s0 to a final state corresponds to a specific process instance. A process model allows generation of all valid (acceptable) process instances of a process implemented by service or a set of services. Therefore, given a set of process instances observed from the interaction of services, a process model can be devised to represent them (e.g., using process mining approaches [45]). Therefore, we can define also a process model P for a set of process instances P I = p1 , p2 , . . . , pn as a finite state machine in which any transition is labeled with a message m ∈ L, and any message sequence p = m 1 , m 2 , . . . , m n in P I is accepted by P. It should be noted that this definition of P may allow the generation of process instances that are not in P I . This is acceptable in process mining applications, as not all valid process instances may be present in the log [31,45]. Note also that in this paper, we use state machine-based formalism to represent the discovered model of the external interactions of services. This formalism is capable of representing sequential interactions between services (see [7] for the discussion about the suitability of this formalism for describing the external behavior of services). Nevertheless, our event correlation framework is generic and allows adopting other more expressive process models such as Petri nets for representing the discovered process model of the interactions of services (e.g., as used in [14]). Process views. The notion of process views in this paper refers to the representation of a process that is the result of understanding a set of process events (in a log) from a specific perspective. A process view consists of a set of process events, the set of process instances formed on these events
Event correlation for process discovery from web service interaction logs
Fig. 3 Part of process map for SCM (Supply Chain Management) dataset
(result of grouping events in a specific way), and their respective process model. Formally, we define a process view as follows: Definition 4 (Process view) A process view is a tuple v = (Lv , P Iv , Pv ) in which – Lv ⊂ L is the set of events related to process execution; – P Iv is the set of process instances of v corresponding to grouping of the events Lv via a specific method of correlating events; – Pv is the process model that generates the process instances P Iv . A process view provides a high-level representation of process executions in an enterprise. Process views may represent the processes of the enterprise at various abstraction levels (e.g., the whole enterprise, a department within the enterprise or activities of an individual user). Therefore, process views may have relationships with each other (the process view of a whole enterprise includes that of a department). We represent the set of process views corresponding to a process event log as V. As an example, Fig. 3 shows a set of process views defined in a supply chain management scenario. The scenario includes views corresponding to purchase order (OS), payment (PS), and customer relationship (CRS) systems. Symbol PO denotes the short form of OS:submitPO operation, i.e., the operation submitPO of OS system. The extended form of other symbols (CO, Inv, Pay, NP, and SR) are shown in Fig. 3. Note that the notion of process view is defined differently compared to its conventional use in the literature (e.g., in [9,26,43,48]), where a base process model is assumed and process views represent the same process model at various levels of abstraction (e.g., by applying operations such as
aggregation of nodes) or different portions of it (e.g., corresponding to different users/roles). In our context, process views include a set of events, the respective process instances and process model. Several process views can be defined on the same set of events corresponding to various ways of understanding events in the context of process execution (i.e., various ways of correlating them). Though, similar to exiting work, in our work, the process model of the process views may be defined at various levels of abstraction or views may cover different parts of a process (corresponding that of a particualr system or a user/role). Process views relationships and process map. Process views represent different perspectives of the process executions over the same set of events in the log. Therefore, they may have relationships. For example, a view may only represent a subset of the process defined by another view. The relationships between views correspond to the relationships between their respective set of process instances (and therefore their respective process models). In particular, since we assume sequential business processes related to the interaction of services, the most relevant relationships between process instances (as sequences of events) are “part-of” and “subsumption” defined as follows2 : Definition 5 (Subsumption) Process X is subsumed by process Y if P I x ⊆ P I y . This relationship allows specifying if one process is more specific than another (e.g., Retailer and CRM views are subsumed by SCM view in Fig. 3). Definition 6 (Part-of) Process X is part-of process Y if any given instance p of X is part-of some instance p of process Y . An instance p of process X is part-of instance p of process Y if all messages in p appear in the same order in p . For example, instances m 1 , m 2 and m 2 , m 4 are part-of instance m 1 , m 2 , m 3 , m 4 . Processes of OS and PS views in Fig. 3 are part-of that of Retailer view. This relationship highlights that a process is a part-of a larger (composite) service interaction model. In order to allow organizing the process views for easier exploration and understanding of their relationships, we introduce the notion of process map defined as follows: Definition 7 (Process map) A process map M = (N , A) is a labeled, directed graph (digraph) in which (i) each node N1 represents a process view v1 ∈ V, (ii) for N1 , N2 ∈ M, if P2 (of N2 ) is “subsumbed-by” or is “part-of” P1 (of N1 ), there is an arc from N1 to N2 labeled with “subsumbed-by” or “part-of”, respectively. 2
In general, relationships between process models could be more complex as [47] witnesses, depending on the expressiveness of the chosen business process model.
123
H. R. Motahari-Nezhad et al.
We propose a method for the spatial organization of process views in a process map in Sect. 5.1 to facilitate the navigation of the process views. Process space. A process space defines the scope for process views over the set of process events in an enterprise. It covers a set of data sources (i.e., process events) plus the set of process views, organized in a process map, that enable the interpretation of these events in terms of process models and event correlation. As such, a process space is the world and the process map is its model using which we can perform process exploration and analysis. We define a process space as follows: Definition 8 (Process space) A process space S is a tuple S = (L, V, M), in which L is the event log, V is the set process views defined on top of events in L, and M is the process map for the organization of process views in V. In the following, we focus on the problem of discovering the process views of an enterprise to build its process space starting from the process event logs. 2.3 Process views discovery problem The goal of process views discovery is to derive a process map M (a set of process views) starting from the set of events in L. This implies addressing three sub-problems: (i) how to correlate events into process instances for a given view; (ii) how to derive process models from a set of instances for a view (known as process mining problem [45]); (iii) how to organize process views into a map that is deciding which views and at which level of abstraction. We recognize that the heart of the problem is to find correlation between events and hence be able to transform L into a set of P I s. In this paper, we do not focus on the second subproblem (process mining). Indeed, for a given set of process instances P I , we can leverage one of the many existing algorithms for process mining [45], including our prior work [31], to discover process model P for each P I depending on which kind of process we aim at discovering or the assumption made on the input dataset. In this paper, we also do not deal with the heterogeneity of data formats in logs containing the process execution events. Instead, we rely on existing approaches in data integration [38] and assume that data are collected from the source systems and transformed into an homogeneous event format. In the following, we focus on the first and the third subproblems, i.e., event correlation and organizing process views into a process map. Event correlation problem. The event correlation problem poses a number of interesting challenges and questions: first, how to define correlation between events in L? For example, considering the SCM scenario in Fig. 3, we want to specify that PO and CO items belong to a same process instance
123
in OS view, but that NP does not. Second, how to identify, among the many possible views, which ones are interesting and lead to a map the analyst finds useful for analysis purposes? Third, how to efficiently search the space of event correlations, and their corresponding set of process instances, that potentially lead to interesting process views? We further discuss these items as follows: Correlation condition language. Addressing the issue of how to define correlation between events translates to defining methods or a language that specifies event correlation according to the values of their attribute. We define a correlation condition ψ, a predicate over the attributes of events that can verify whether two events belong to the same instance. For instance, looking at the attributes of PO and CO, we may observe that, e.g., they share the same value for the order number attribute oID. In this case, the correlation condition can be expressed as “having a common value on oID”. We need to identify the possible forms that correlation conditions can take in service-based processes. Interestingness of process instances. As mentioned earlier, identifying interestingness is “subjective”: the interestingness of a process view (representing a given way of event correlation) depends on what we want to analyze and on the perspective from which we look at the domain. While identifying interesting is subjective in general, there are certain correlations of events that may not make sense from a process perspective. The issue is how to capture properties of such correlations and consider them in defining objective measures that are used to exclude such P I s. This approach is also called identifying interestingness through finding what is not interesting [29,39]. For example, grouping of events based on itemColor attribute in an order may not yield an interesting process view. Heuristics help in identifying what is not interesting for all users; however, the issue of subjectiveness of process views has to be considered from the perspective of individual users who will be looking at the process map. Therefore, it is crucial to formulate heuristics-based interestingness measures so that user domain knowledge, input, and feedback are taken into account. Efficient discovery of interesting process views. There are many ways to correlate events based on different correlation conditions leading to different process instances and process models. Indeed, the space of correlation conditions is large because there are many attributes of events that can be considered as correlators and because all their combinations may form correlation conditions leading to potentially interesting process views (see Sect. 4.3). For example, both oID and customerID may be required to correlate messages of a view related to the ordering system into process instances corresponding to the individual purchase of each customer. There are three related challenges and questions to address: (i) how to efficiently search the space of possible
Event correlation for process discovery from web service interaction logs
correlation conditions while avoiding an exponential explosion in processing time; (ii) how to design an approach that takes into account user inputs along with heuristics to build a good process map. User inputs could be used to set values for the properties of the process instances that are likely of interest (or, alternatively, allow excluding set of process instances that are unlikely to lead to interesting process views); and (iii) how to support the user in providing feedback and driving the process of discovered process views during event correlation. Organizing and exploring process views. The purpose of efficient discovery of process views is to prune the search space of possible process views, based on heuristics and user inputs, so that a small number of potentially interesting process views are discovered. Nevertheless, a challenge is how to facilitate the job of users in exploring the set of discovered views, which in practice could be potentially large, in a process map. In particular, one challenge is identifying relationships between the process views, and another is to arrange them in the space (page) so that they can be easily explored. Another purpose of process maps is to support the refinement of the discovered process views based on user feedback. In the following, we present a system, called process views discovery system, to semi-automatically discover process views over a process event log and address the abovementioned challenges.
3 Process views discovery system We propose the design and development of a system, called process views discovery system (PVDS), that takes a set of process events as input and enables the explorative discovery of process views over them. The core functionality of a PVDS is discovering different ways by which process events, the input, can be correlated into process instances, thereby identifying different process views. In the following, we first characterize concrete forms of correlation conditions in the context of service-based processes. Then, we present an overview of Process Spaceship architecture, as a PVDS, for the discovery of correlation conditions for service-based processes.
3.1 Correlation conditions in service-based processes Events related to the execution of business processes, and particularly those implemented over Web services, are often correlated into process instances using methods used by process-related standard proposals for Web services such as BPEL, WS-Conversation, WS-CDL [4], or methods proposed by industrial software products such as IBM Websphere Process Manager [23]. These specifications propose
Table 1 Snapshots of example service logs
(a)
(b)
The events in rows shaded the same way are part of the same process instance
correlation methods based on either attributes of the message payload or attributes in the message header (message metadata). We assume that the information needed to perform the correlation is available in the log. This assumption is reasonable since messages are indeed correlated by the recipient services and, therefore, need to be present. We define a correlation condition, which is a method to correlation events, as follows: Definition 9 (Correlation condition) A correlation condition is a binary predicate defined over attributes of two messages m x and m y and denoted by ψ(m x , m y ). This predicate is true when m x and m y are correlated and false otherwise. For example, Table 1a shows a snapshot of an event log based on Definition 1. One possible correlation condition is ψ(m x , m y ) : m x .CID = m y .CID. Using a correlation condition ψ, it is possible to partition a log L into a set of process instances. Now, we can define the following two properties on the message sequence of a process instance c = m 1 , m 2 , . . . : (i) any message m x ∈ c is directly correlated with at least one other message m y ∈ c, y = x, i.e., ψ(m x , m y ) holds for some condition ψ, and (ii) all the messages m ∈ L correlated with at least one message of c are also in c (i.e., c is a maximal subset with respect to the correlation condition). To better see how a correlation condition ψ partitions the log into process instances, let us denote the set of correlated message pairs based on condition ψ in L by Rψ . We have Rψ = {(m x , m y ) ∈ L2 |ψ(m x , m y )}. For instance, for condition ψ : m x .oID = m y .OrdRef in Table 1b, we have Rψ = {(m 1 , m 3 ), (m 2 , m 4 ), (m 3 , m 5 ), (m 5 , m 6 )}.3 In this case, the set of instances P Iψ = {m 1 , m 3 , m 5 , m 6 , m 2 , m 4 } consists of two instances. As mentioned before, since the set of instances can be obtained by applying a correlation condition ψ on events in log L, we can say that a process view is characterized by correlation condition ψ. 3 When ψ is commutative, a pair (a, b), also implies (b, a). For brevity, we have assumed this property and not shown all pairs in this example.
123
H. R. Motahari-Nezhad et al.
In the following, we present the common correlation methods identified by studying standardization proposals and industrial process management tools. These methods indicate families of correlation conditions that will have to be investigated for the discovery of process views. While we focus here on the web service context, these methods for correlating messages or events are actually generic and found in other context as well. 3.1.1 Process instances using a single correlation method We first examine the case of instances where the correlation method is the same throughout the entire instance. These methods can be classified into the following three families: key-based, reference-based, and composite. Key-based correlation. In the simplest case, a unique value is used in each message to directly identify the instance to which it belongs. This value acts as a key that uniquely identifies the instance. For instance, the attribute ConvID in the log of Table 1a acts as a key since messages m 1 , m 3 , and m 6 all have the value “1” on this attribute. The corresponding correlation condition is ψkey : m x .ConvID = m y .ConvID, and we have P Iψkey (L) = {m 1 , m 3 , m 6 , m 2 , m 4 , m 5 }. This method of correlation is called a key-based correlation. Reference-based correlation. In other cases, messages of an instance are correlated using a reference with a previous message in the instance. In this case, any message (except the first) carries a reference attribute with a value equal to that of an identifier attribute in a previous message. For example, messages in Table 1b are correlated using this method, called reference-based correlation, by the correlation condition ψr e f : m x .oID = m y .OrdRef. Both the key-based and reference-based correlation methods can be modeled through the same family of correlation conditions expressing equality of attribute value into pairs of messages. We refer to conditions that belong to this family as atomic correlation conditions. They are defined as follows: Definition 10 (Atomic correlation condition) An atomic correlation condition ψ specifies that two messages are correlated if they have the same value on two of their attributes Ai and A j , i.e., ψ : m x .Ai = m y .A j . Note that we might have i = j in the case of key-based correlation. Similar to the concept of composite keys in databases, where keys may consist of more than one attribute, the method used for correlating messages may rely on several attributes. For instance, messages of an instance may be correlated using the values of attributes customer ID (ψ1 : m x .CID = m y .CID) and survey ID (ψ2 : m x .SID = m y .SID), as for messages NP and SR (Fig. 3, CRS view). Figure 1a shows a log corresponding to this scenario (assuming attribute ConvID is not present). In this case, the correla-
123
Table 2 A snapshot of the log for Retailer view in Fig. 3 (message CO is not considered for brevity purposes)
The events in rows shaded the same way are part of the same process instance. We define ψ1 : m x .OID = m y .OID, ψ2 : m x .invID = m y .invID
tion condition can be denoted as ψc = ψ1∧2 = ψ1 ∧ ψ2 . We refer to correlation conditions in this second familly as composite conjunctive (for short, conjunctive). In this example, ψ1∧2 partitions the log into three instances, i.e., P Iψ1∧2 (L) = {m 1 , m 4 , m 2 , m 5 , m 3 , m 6 }. Definition 11 (Conjunctive correlation condition) A Conjunctive correlation condition is a conjunction of more than one atomic condition. It follows the general form of ψ : ψ1 ∧ ψ2 ∧ · · · ∧ ψv where ψi s, 1 ≤ i ≤ v are atomic conditions. 3.1.2 Process instances using multiple correlation methods When a process spans multiple systems, it is not rare that the method used for correlating messages is different from one system to another. Even different pairs of messages of an instance in one system may use different correlation methods. In these cases, several correlation conditions are needed to correlate the messages of the same instance. For example, consider the log in Table 2 (the corresponding model is illustrated in Retailer view, Fig. 3). Messages of type PO and Inv are correlated using the condition ψ1 : m x .oID = m y .oID, and messages Inv and Pay are correlated using the condition ψ2 : m x .invID = m y .invID but all are part of the same instance in this view. For such instances, messages m x and m y are correlated if they satisfy either ψ1 or ψ2 (or they may satisfy both). These three messages form a unique instance under condition ψd = ψ1∨2 = ψ1 ∨ ψ2 . We call such conditions composite disjunctive conditions (or “disjunctive conditions” for short). They are defined as follows: Definition 12 (Disjunctive correlation condition) A disjunctive correlation condition is a disjunction of more than one atomic or conjunctive conditions. Conditions of this family follow the general form ψ : ψ1 ∨ ψ2 ∨ · · · ∨ ψu where ψi s, 1 ≤ i ≤ u, are either atomic or conjunctive conditions. Note that disjunctive correlation conditions can be also used to express correlation condition of processes that accept instances where different group of them is correlated using
Event correlation for process discovery from web service interaction logs
different correlation conditions. For instance, the model illustrated in SCM view of Fig. 3 allows the instances of CRM view, correlated using condition ψc , as well as instances of Retailer view, correlated with ψd (see above). The correlation condition for SCM view is expressed using the disjunctive condition ψc ∨ ψd as its set of instances is the union of the set of those of Retailer and CRM views. 3.1.3 Other correlation methods Fig. 4 The conceptual architecture of Process Spaceship
The correlation methods above do not cover the entire scope of possible methods. For instance, time constraints can be used as part of correlation condition definitions. The Choreography Description Language (WS-CDL) [4] allows to define a time limit for an instance. In terms of correlation condition, this would translate into an additional constraint on the time difference between two messages, with correlation conditions of the form, ψ : m x .Ai = m y .A j ∧ |m x .τ − m y .τ | ≤ Max Duration, where Max Duration expresses this time constraint. In the following, we focus on the identification of families of correlation conditions. We believe they reasonably cover the scope of methods most commonly used in serviceoriented environments. However, the overall approach would be the same for extending discovery to these other families of conditions or to entirely different contexts (e.g., software execution traces or EDI). 3.2 Process spaceship: overview and architecture We propose Process Spaceship as a PVDS that enables the discovery of process views and organizing them in a process map starting from a process events log. We have presented Process Spaceship prototype in [30]. This paper presents a framework (extended from [30]) and the set of algorithms for process views discovery, described in detail in the following sections, to address the challenges listed in Sect. 2.3. As shown earlier, the discovery of process views corresponds to identifying various ways that process events form process instances. Therefore, it maps into the exploration of the space of possible correlation conditions. Considering the types of correlation conditions listed in Sect. 3.1, the number of potential correlation conditions, if considered from a purely combinatorial point of view, is large: one might first try each atomic condition, that is, the a = k 2 /2 possible pairs of attributes (in a log with k attributes). Then, one might attempt to combine these atomic conditions to form conjunctive and disjunctive composite conditions. In theory, there can be c = 2a − 1 conjunctive conditions to explore and finally 2(a+c) −1 disjunctive conditions. An exhaustive search would not scale. Moreover, many of the correlation conditions produced are not interesting to the user. For instance, grouping
messages based on the total amount of a purchase is unlikely to produce an interesting process view. To explore the space of possible event conditions efficiently, we adopt a level-wise approach [28] and use a set of heuristic criteria to reduce the space of possible correlation conditions. In this approach, the set of candidate conditions is grown from atomic to composite (conjunctive and disjunctive), and at each level process views that do not satisfy objective interestingness measures are pruned. Figure 4 shows the architecture, organized into two components: a back-end, responsible for discovering interesting views (presented in Sect. 4), and a front-end, for visualization and userdriven refinement (see Sects. 5 and 6.3). In Process Spaceship, we limit the information items to process events according to the Definition 1. For process models, we use finite state machines according to Definition 3. Process Spaceship looks to discover the family of correlation conditions identified in Sect. 3.1 (atomic, conjunctive and disjunctive). We use the algorithm presented in [31] for discovering process models for a set of process instances (of a process view) that are the result of correlating events using a given correlation condition. We introduce heuristics to filter potentially non-interesting correlation conditions. The process views discovery is an interactive and userdriven process, which complements heuristic-based approach for pruning non-interesting correlation conditions: at the beginning/end of each phase (atomic condition discovery, etc) the user has the opportunity for providing input and feedback that are taken into account for guiding the next phase. In particular, (1) each discovery phase provides a set of configurable parameters and thresholds, which are manipulable by the user before each discovery step; (2) the tool presents the results of each step (discovered process views based on correlation conditions) and enables the user to provide feedback by selecting interesting correlation conditions (by looking at the resultant process views) to identify the interestingness of process views and to direct the event correlation process for the next phase. Finally, all potentially interesting process views are organized and presented in a process map, which
123
H. R. Motahari-Nezhad et al.
is a visual way to navigate views based on their relationships and levels of abstraction. Exploring and browsing a process map enable users to decide about the interestingness of discovered process views and to initiate an iterative refinement process of discovered views.
4 Semi-automated process views discovery In this section, we present the backend algorithms used in Process Spaceship. We first specify the notion of interestingness for correlation conditions. Then, we show how this notion allows exploring in a tractable way the space of possible correlation conditions.
4.1 Interestingness of correlation conditions The notion of interestingness is eminently subjective, and a variety of metrics have been proposed and used to characterize it [29]. The following observations guide our choice: First, our objective is to support the exploration of the process space and, in turn, the space of possible correlation conditions. We aim at using objective metrics rather than subjective metrics as the latter may lead to the premature rejection of correlation conditions. We prefer a high recall (implying visiting more correlation conditions) to enable preserving possibly interesting conditions. Second, we aim at designing metrics that capture the domain knowledge of users and their requirements for finding interesting process views. 4.1.1 Non-interestingness criteria for condition selection In our work, we take the approach of specifying interestingness of correlation conditions on the basis of what is not interesting [39]. We recognize that while the interestingness of a correlation condition depends on the user viewpoint, this choice relies on exploration of the observation that there exist objective criteria allowing to reject correlation conditions that are clearly, and regardless of the user, non-interesting in the context of service-oriented processes. These criteria are detailed hereafter. (A) Globally unique values are not correlators: In this paper, we focus on correlation conditions defined based on equality of values in some attributes of two or more messages (this is the most common method in service-oriented context). Hence, an attribute is a possible correlator only if it contains values that are not globally unique, i.e., they can be found in another message, whether on the same attribute (key-based correlation) or on a different one (reference-based correlation). Attributes having unique values, i.e., their values are not repeated in any attributes of other tuples, can be tagged as non-interesting. Conversely, attributes with very
123
small domains (e.g., Boolean) are not interesting either since each value will be repeated on a large number of messages and lead to few trivial partitions. To characterize these properties, we define the following two measures: distinct_ratio(Ai ): for a key-based condition on attribute Ai , it is defined as the number of distinct values of Ai (distinct (Ai )) with regard to the number of non-null values in Ai (non N ull(Ai )), i.e., distinct_ratio(Ai ) =
distinct (Ai ) non N ull(Ai )
(1)
shar ed_ratio(ψ): for reference-based conditions between attributes Ai and A j , this ratio represents the number of shared values of attributes Ai and A j w.r.t. the number of non-null values in the two attributes, i.e., shar ed_ratio(ψ) =
|distinct (Ai ) ∩ distinct (A j )| max(|distinct (Ai )|, |distinct (A j )|)
(2)
Moreover, categorical attributes (e.g., those containing error codes or currencies) that are not used for correlation can be characterized in the log by the fact the number of values does not vary much with respect to the size of the dataset. Conversely, an attribute used for correlation would have more distinct values as the dataset grows, since the dataset would contain more instances.4 Such an attribute exhibits properties similar to categorical attributes. We discuss treating such attributes in Sect. 5.2. We use this property to further filter out non-correlator attributes, by comparing their value distribution on samples of the dataset of varying sizes. Thus, if the highest number of distinct values for a categorical attribute identified in this approach is denoted by distinctmax (Ai ), we use the threshold α defined as α=
|distinctmax (Ai )| |L|
(3)
and prune key-based conditions with the ratio of distinct values smaller than α (very small value close to zero). Based on a similar reasoning, we can prune reference-based conditions with shar ed_ratio(ψ) < α. Finally, key-based conditions with distinct_ratio(Ai ) = 1 are also considered non-interesting. This is because using a key-based condition based on such attributes no messages can be correlated with each other. We refer to this criterion as non-RepeatingV alues criterion. (B) Conditions partitioning the log into instances with one or two messages, or into very few instances, are not
4
A possible exception for aforementioned properties is attributes whose values are reused periodically (e.g., each day).
Event correlation for process discovery from web service interaction logs
interesting: A process instance is a sequence of at least two messages. We expect that there are several instances present in the log. A correlation condition ψ is considered not interesting if it partitions the log into either very few long instances or into a very high number of short instances (instances with too few messages, e.g., one or two). We define measures on the length and the number of instances to recognize conditions leading to such instances. To be able to reason on the number of instances, we define P I _ratio(ψ) as the ratio of |P Iψ | to the number of messages for which attributes Ai and A j of condition ψ are defined (i.e., they are not null) as follows: P I _ratio(ψ) =
|P Iψ | non N ull(ψ)
(4)
We require that the majority of instances have at least a length of 2 and therefore P I _ratio(ψ) can be expected to be smaller than or equal to 0.5 (it is 0.5, when all instances are of length 2). We define the threshold β that can be set to 0.5 and higher values. We can safely tag correlation conditions that lead to a P I _ratio > β as non-interesting. To complement this measure, we define also another measure based on the length of instances for a condition. In fact, if there are many instances of length 1 (isolated messages), then it is not interesting. We can identify such process instance sets by examining the median of the distribution of length of instances and tag as non-interesting those with median equal to 1. Note that for forming instances according to condition ψ, messages in the log are only considered if their attributes Ai and A j have non-null values (e.g., for ψ1 in Table 2, only messages of types PO and I nv have non-null values for the attribute oID, i.e., there are 4 messages having non-null values). The above two measures are complementary, and in some cases for a non-interesting condition, both of them may be satisfied. On the other hand, we require that there are, at least, 2 instances formed in the log for a given condition. However, in practice, interesting conditions lead to a higher number of instances in the log (proportional to the number of messages in the log). Based on a heuristic observation, we expect that interesting conditions should have a P I _ratio(ψ) < α. This threshold is small so that allows for a wide range of process instance sets to be allowed. We also set a complementary measure on the length of non-interesting instances in this category. In particular, if there is an instance with a length equal to half of the number of messages, which have a non-null value for attributes of condition ψ, then the condition is not interesting. These two measures are also complementary and overlapping, and it is sufficient that one of them qualify for a condition to be considered non-interesting. In the following of this paper, we refer to this criterion as imbalanced P T criterion.
4.1.2 User input and feedback The goal of our approach is enabling the discovery of potentially interesting process views for a given user. Therefore, we provide the opportunity for the user to provide her input and feedback during various discovery steps regarding the processes they want to analyze. We leverage this information to guide the correlation conditions search and enhance the quality by increasing the likelihood of interestingness of the discovered process views for the user. For instance, feedback on which attributes or services messages are related to the processes that the user is interested in, as well as providing an estimation of the number of instances per day, and the length and/or duration (i.e., time elapsed between the first and last message) of instances could help in directing the search toward process views representing such process instances. For instance, a salesman may know that, on average, 100 orders are filed daily by customers. In particular, we enable users to specify the following three interdependent criteria: the average number of instances (denoted by avg_num), their average length, or their average duration (denoted by avg_len and avg_dur , respectively). We show how this information, if available, can be used during the discovery process and also after discovering views to effectively navigate through them. It should be noted that user input and feedback is optional. The tool can run using default settings. These default settings (for the thresholds) favor recall at the expense of precision, meaning that the tool may discover a rather large process map. This is because the user input/feedback is used to to prune subset of the correlation results that she finds irrelevant or non-interesting. However, the tool provide functionalities for the exploration, navigation, and refinement of the process map. Therefore, users are able to find processes of their interest even if they have not provided input/feedback during the discovery process; only they may spend more time exploring. 4.2 Discovery of correlation conditions As depicted in Fig. 4, we propose to explore the space of correlation conditions in three steps: first discovering atomic conditions, then conjunctive conditions (formed by applying conjunctive operators on atomic conditions), and finally disjunctive conditions (formed by applying disjunctive operators on atomic and conjunctive conditions). Figure 5 illustrates how these three steps are applied on the logs presented in Table 1a (excluding ConvID attribute) and Table 2, which correspond to CRM and Retailer views in Fig. 3, respectively. In this figure, n is used to denote the number of instances that are the result of applying this condition on the log, and len gives the distribution of length of respective instances. This example is used for illustration throughout this section.
123
H. R. Motahari-Nezhad et al.
Algorithm 1 Generation and pruning of atomic conditions
Fig. 5 Condition discovery for Retailer dataset
4.2.1 Discovering atomic conditions The approach for discovering atomic conditions is depicted in Algorithm 1. The algorithm consists of three steps: (i) Generating candidate atomic conditions (line 1). In this step, from the set of attributes in L, we generate the set of possible candidate atomic conditions ψ : m x .Ai = m y .A j , e.g., m x .CID = m y .CID, m x .CID = m y .SID, etc. (ii) Pruning non-interesting conditions based on nonRepeatingV alues criterion (lines 2 to 9). In this step, first, key-based conditions are identified and pruned (lines 2 to 5). For this purpose, distinct_ratio(Ai ) is computed for all attributes participating in a key-based condition, and then criterion non-RepeatingV alues is applied. For example, in Fig. 5, condition ψ5 is pruned as all of its values are unique, i.e., we have distinct_ratio( p I D) = 1. Next, non-interesting reference-based conditions are identified and pruned (lines 6 to 9). For this purpose, shar ed_ratio(ψ) is computed for attribute pairs participating in reference-based conditions and then criterion non-RepeatingV alues is applied. (iii) Pruning non-interesting conditions based on imbalanced P I criterion (lines 10 to 16). This step requires the following: (i) computing the set of correlated message pairs (Rψ ) for all conditions ψ (line 11), (ii) computing the set of instances P Iψ formed by correlated message pairs in Rψ (line 12), (iii) applying criterion imbalanced P I on P Iψ (lines 14 to 16). These sub-steps are explained in the following. Computing the set of correlated message pairs for a condition ψ (line 11). The set of correlated message pairs Rψ can be computed using a standard SQL query (self-join of log L) in which its WHERE clause applies the condition ψ, e.g.:
Input: A: the set of attributes Ai ∈ L, 1 ≤ i ≤ k Output: AC: the set of atomic conditions 1: AC ← the set of conditions ψ : m x .Ai = m y .A j 2: for conditions ψii : m x .Ai = m y .Ai do distinct (Ai ) 3: distinct_ratio(Ai ) = non N ull(Ai ) 4: end for 5: AC ← AC - {ψii |distinct_ratio(Ai ) distinct_ratio(Ai ) = 1} 6: for conditions ψ : m x .Ai = m x .A j , i = j do
a.id Here, it is assumed a message identifier (id) allows the unique identification of each message, and it is assigned to messages so that they are ordered based on their timestamp. Given that the correlation between messages is undirected, it is enough to look forward from each message (using condition b.id > a.id achieves this). Doing so, we make sure that the relationships between any previous message and the
123
Conjunctive conditions ψ1∧2 are computed using conjunctive operator on atomic conditions ψ1 and ψ2 , i.e., ψ1∧2 = ψ1 ∧ψ2 . If the set of atomic correlation conditions computed in the previous step is AC = {ψ1 , ψ2 , ψ3 }, then the set of possible candidate conjunctive condition is CC = {(ψ1 ∧ ψ2 ), (ψ1 ∧ ψ3 ), (ψ2 ∧ ψ3 ), (ψ1 ∧ ψ2 ∧ ψ3 )}. This corresponds to exploring the set containment lattice of AC. It is possible that some of these combinations, built using more atomic conditions from simpler ones (with fewer atomic conditions), lead to the same set of instances as that of
Event correlation for process discovery from web service interaction logs
Algorithm 2 Generation and pruning of conjunctive conditions Input: AC: the set of atomic conditions Output: CC: the set of atomic and conjunctive conditions 1: L 0 ← {}; L 1 ← AC 2: ← 1 3: while L = {} do 4: for condition ψ ∈ L do 5: compute Rψ1∧2 ← Rψ1 ∩ Rψ2 6: P Iψ ← Find I nstances(Rψ , L) 7: end for 8: if ψ has imbalanced P I (based on P Iψ ) or not Mon(ψ1∧2 ) then 9: L ← L − {ψ} 10: end if 11: CC ← CC ∪ L 12: for conditions ψ1 , ψ2 ∈ L do 13: ψ1∧2 ← ψ1 ∧ ψ2 14: if de f (ψ1∧2 )ornot I nc(Rψ1 , Rψ2 ) then 15: L +1 ← L +1 ∪ {ψ1∧2 } 16: end if 17: end for 18: ← + 1 19: end while
the simpler ones. In such cases, it is enough to find only the minimal conjunctive conditions, which is defined as follows: Definition 13 (Minimal conjuntive condition) A conjunctive condition ψ is minimal if no other conjunctive condition formed using fewer conjunction of atomic conditions partitions the log into the same set of instances. For example, assume that, in the set AC above, the conjunctive conditions ψ1∧2 and ψ1∧2∧3 partition the log in the same set of instances, then ψ1∧2∧3 is not minimal and ψ1∧2 is desired as it is easier to compute. Hence, there are two requirements for an automated approach to discover conjunctive conditions: (i) efficiently explore the set containment lattice of atomic conditions, by discovering only interesting conditions, and (ii) discover only minimal conjunctive conditions. To fulfill these requirements, in the following, we propose an algorithm by adopting a level-wise iterative approach [28]. At each level L i , more complex conjunctive conditions (i.e., formed using a larger number of atomic conditions) are grown from simpler conditions (i.e., formed using fewer atomic conditions) of the previous level L i−1 . The proposed algorithm is depicted in Algorithm 2. Each iteration of the algorithm has three phases: (i) applying conditions ψ to partition the log into instances (lines 4 to 7), (ii) candidate condition pruning (lines 8 to 10), and (iii) generation of candidate conditions for the next level (lines 12 to 17). The algorithm ensures that only minimal conjunctive conditions are discovered, as explained in the following. (i) Partitioning the log into instances for a conjunctive condition ψ1∧2 . For a candidate conjunctive condition ψ1∧2 , the first step is to compute the set of correlated message pairs Rψ1∧2 . This is defined as the intersection of the correlated
message pairs of ψ1 and ψ2 as follows: (m x , m y ) ∈ Rψ1∧2 ⇔ (m x , m y ) ∈ Rψ1 ∧ (m x , m y ) ∈ Rψ2 ⇔ (m x , m y ) ∈ Rψ1 ∩ Rψ2 This means that messages m x and m y have the same values for attribute pairs in ψ1 and ψ2 . R1∧2 can be computed as an SQL query using the INTERSECT operator over Rψ1 and Rψ2 . Computation of P Iψ1∧2 (L) is done by finding connected components of G ψ1∧2 (L), as explained in Subsect. 4.2.1. (ii) Pruning candidate conjunctive conditions. In this phase, non-interesting conjunctive conditions are identified and pruned based on the following criteria: (1) Criterion imbalanced P I (lines 8 to 10 of Algorithm 2). The number of instances for ψ1∧2 is necessarily equal or greater than that of both ψ1 and ψ2 (e.g., consider condition ψc in Fig. 5, where n = 3 and it is greater than those of ψ1 and ψ2 ). We check whether the condition P I _ratio(ψ1∧2 , L) < β is satisfied (i.e., if the conjunctive condition is potentially interesting). If the user knowledge is provided, e.g., any of avg_num, avg_len or avg_dur , we use the following approach to identify non-interesting conditions: if |P Iψ | > avg_num, or if the average length of instances is smaller than avg_len, or if the average duration of the instances is smaller than avg_dur , then it is not interesting to explore further conditions formed by conjunction of this condition and others. This is because more conjunctions result in a higher number of instances of shorter lengths and smaller durations. (2) Monotonicity of the number and the length of instances with respect to the conjunctive operator: As mentioned before, we expect that the number of instances for ψ1∧2 is greater than that of both ψ1 and ψ2 . This also implies that (most of) instances P Iψ1∧2 (L) are of smaller length than those of ψ1 and ψ2 . Therefore, if the number of instances does not increase or the lengths of (at least some) instances in P Iψ1∧2 (L) do not decrease compared to those of P Iψ1 (L) and P Iψ2 (L), then ψ1∧2 is not interesting. This is because in this case ψ1∧2 does not create a new interesting process view with respect to ψ1 and ψ2 . The slight change in the number or the length of instance (since P Iψ1∧2 (L) = P Iψ1 (L) and P Iψ1∧2 (L) = P Iψ2 (L)) can be due to imperfections in the log. This criterion is referred to as not Mon(ψ1∧2 ) in line 8 of Algorithm 2. Generating candidate conjunctive conditions. In this phase, the set of candidate conditions for the next level are generated (lines 12 to 17). The set of candidate conditions for the next level are formed using non-pruned (selected) correlation conditions of the previous level. In fact, if a condition ψ1 is pruned, i.e., it fails to satisfy the interestingness measures (e.g., the number of instances is too high or instances are too short), then the conjunctive condition built using this
123
H. R. Motahari-Nezhad et al.
and any other condition will also fail to satisfy these criteria since the resulting instances will necessarily be shorter or of the same length at best. Therefore, only selected conditions from the previous level are used. We use the following criteria to predict some candidate conditions are non-interesting without having to compute how they partition the log, which is computationally expensive (for the computational complexity of this operation, see Sects. 4.3 and 6): (1) Attribute definition constraint: Consider two atomic conditions ψ1 and ψ2 . Each of the attributes used in ψ1 (e.g., Ai1 and A j1 ) and ψ2 (e.g., Ai2 and A j2 ) may be undefined for some messages (consider ψ1 and ψ2 in Table 2). However, when considered together in a conjunction, a new constraint appears: attributes of ψ2 have to be defined whenever the attributes of ψ1 are defined. If we denote ψ1∧2 , the condition formed by the conjunction of ψ1 and ψ2 , it has the following form: ψ1∧2 : m x .Ai1 = m y .A j1 ∧ m x .Ai2 = m y .A j2 .
(5)
Hence, for any message of the log, attribute Ai1 (resp. A j1 ) is defined if and only if Ai2 (resp. A j2 ) is also defined. Therefore, we can verify whether ψ1 and ψ2 have definite values for the same set of messages. If not, this conjunction can be safely discarded and we can avoid computing its corresponding log partitioning into instances. This criterion is referred to as de f (ψ1∧2 ) in line 14 of Algorithm 2. (2) Inclusion property. If the set of messages correlated by ψ1 is included in that of ψ2 (i.e., if we have Rψ1 ⊆ Rψ2 ), then we have Rψ1∧2 = Rψ1 . Therefore, ψ1∧2 is not minimal (since it builds down to ψ1 ), and we can avoid computing its instances. Furthermore, if we have Rψ1 = Rψ2 , then ψ1 and ψ2 partition the log in the same way (they produce the same view), and it is enough to consider only one of them in all later computations. This criterion is referred to as not I nc(Rψ1 , Rψ2 ) in line 14 of Algorithm 2. 4.2.3 Discovering disjunctive conditions Similar to discovering conjunctive conditions, discovering disjunctive conditions consists in finding the set of interesting minimal disjunctive combinations of candidate correlation conditions, defined as follows: Definition 14 (Minimal disjuntive condition) A disjunctive condition ψ is minimal if no other disjunctive condition formed using fewer disjunction of atomic conditions partitions the log into the same set of instances.
123
Algorithm 3 Generation and pruning of disjunctive conditions Input: CC: the set of atomic and conjunctive conditions Output: CC: the set of atomic, conjunctive and disjunctive conditions 1: L 0 ← {}; L 1 ← CC 2: ← 1 3: while L = {} do 4: for condition ψ ∈ L do 5: compute Rψ1∨2 ← Rψ1 ∪ Rψ2 6: P Iψ1∨2 (L) ← Find I nstances(Rψ1∨2 , L) 7: end for 8: if ψ has imbalanced P I (based on P Iψ1∨2 (L)), or not Mon(ψ1∨2 )orT rivU nion(Rψ1 , Rψ2 ) then 9: L ← L − {ψ} 10: end if 11: CC ← CC ∪ L 12: for conditions ψ1 , ψ2 ∈ L do 13: ψ1∨2 ← ψ1 ∨ ψ2 14: if not Assoc(ψ1∨2 )ornot I nc(Rψ1 , Rψ2 ) then 15: L +1 ← L +1 ∪ {ψ1∨2 } 16: end if 17: end for 18: ← + 1 19: end while
For example, if the set of correlation conditions (atomic or conjunctive) computed in the previous steps is CC = {ψ1 , ψ2 , ψ3 }, then the set of possible candidate disjunctive conditions is MC = {(ψ1 ∨ψ2 ), (ψ1 ∨ψ3 ), (ψ2 ∨ψ3 ), (ψ1 ∨ ψ2 ∨ψ3 )}. This corresponds to exploring the set containment lattice of CC. We also propose an algorithm by adopting a level-wise approach [28] to search the space of possible disjunctive conditions, similar to Algorithm 2 for discovery of conjunctive conditions. This is performed in an iterative process comprised of three phases: finding instances, candidate pruning, and next level candidate generation. The respective algorithm is depicted in Algorithm 3. The input of this algorithm is CC, which contains both atomic and conjunctive conditions that are selected as interesting in the previous two steps. It ensures that all minimal disjunctive conditions are discovered. This step has the following three phases: (i) Finding instances for disjunctive condition ψ1∨2 . (lines 4 to 7) For a disjunctive condition ψ1∨2 , we have Rψ1∨2 = Rψ1 ∪ Rψ1 , i.e., the set of correlated message pairs of ψ1∨2 is the union of set of message pairs ψ1 and those of ψ2 : (m x , m y ) ∈ Rψ1∨2 ⇔ (m x , m y ) ∈ Rψ1 ∨ (m x , m y ) ∈ Rψ2 ⇔ (m x , m y ) ∈ Rψ1 ∪ Rψ2 Rψ1∨2 is computed using an SQL query based on UNION operator on Rψ1 and Rψ2 (line 5). Then, the set of instances P Iψ1∨2 (L) is computed based on finding connected components of its correlation graph G ψ1∨2 (L) using a graph decomposition algorithm (line 6), as discussed in Subsect. 4.2.1.
Event correlation for process discovery from web service interaction logs
(ii) Pruning candidate disjunctive conditions. In this phase, the following three criteria are used to prune noninteresting correlation conditions: (1) Criterion imbalanced P I . Instances formed based on disjunction of ψ1 and ψ2 are always less numerous (or of equal number) than those of ψ1 and ψ2 (e.g., consider condition ψd in Fig. 5, where n = 2, which is equal to those of ψ1 and ψ2 ). Hence, it suffices to check if a candidate condition ψ1∨2 satisfies P I _ratio(ψ1∨2 ) > α (line 8). If the user knowledge is provided, e.g., any of avg_num, avg_len or avg_dur , we use the following approach to identify non-interesting conditions: if |P Iψ | < avg_num, or if the average length of instances is greater than avg_len, or if the average duration of the instances is greater than avg_dur , then it is not interesting to explore any more conditions built by disjunction of this condition and others. This is because more disjunctions result in a fewer number of instances of longer lengths and greater durations. (2) Monotonicity of the number and the length of instances with respect to the disjunctive operator: The number of instances resulting from a condition ψi∨ j is smaller than, and each instance is at least as long as but expectedly longer than those of ψi or ψ j since disjunctions add to the connectivity of the correlation graph (see, for instance, ψd = ψ3 ∨ ψ4 in Fig. 5, its instances are longer than those of ψ3 and ψ4 ). Therefore, if for a given disjunctive condition, the number of instances does not decrease or the lengths of (at least some) instances do not increase, then such disjunctive condition is not interesting. This is because, as for conjunctive conditions, in this case ψ1∨2 does not create a new interesting process view with respect to ψ1 and ψ2 . The slight change in the number or the length of instances (since P Iψ1∨2 (L) = P Iψ1 (L) and P Iψ1∨2 (L) = P Iψ2 (L)) can be due to imperfections in the log. We use this observation to identify and prune noninteresting disjunctive conditions. This criterion is referred to as not Mon(ψ1∨2 ) in line 8 of Algorithm 3. (3) Avoiding trivial unions of instance sets. Consider PS and CRM views in Fig. 3, which represent partial views of the process compared to SCM view. Applying disjunctive operator on the conditions of these views leads to a new view, namely PS-CRM (not shown in Fig. 3), which its corresponding instance set is the union of the instances in PS and CRM views. Such views are mainly interesting for the highest level views (e.g., for SCM business service), where the most complete view of the interaction is desirable. Other intermediate nodes such as PS-CRM would not add any information and are therefore discarded. Note that users can nonetheless explore unions of any two or more views, if interested, using the exploration tool after completion of the discovery process. Indeed, the set of correlated messages in Retailer view is the union of those in OS and PS views, that is, we have Rψ4 = Rψ1 ∪ Rψ2 , assuming that ψ1 , ψ2 , and ψ4 represent the
conditions of OS, PS, and Retailer view, respectively. However, we do not have the same relationships between their instances, i.e., P Iψ4 (L) = P Iψ1 (L) ∪ P Iψ2 (L). The reason is that instances in OS and PS views connect and make (new) longer instances. For instance, if we have instances PO, CO and I nv, Pay, in OS and PS views, respectively, then the new instance in Retailer view is PO, CO,I nv, Pay. This is possible based on condition ψ1 . But in the former case (PS-CRM view), none of the conditions allow the instances in the two sets to join. Its instances are the result of a trivial union of instances in PS and CRM views. This criterion is referred to as T rivU nion(Rψ1 , Rψ2 ) in line 8 of Algorithm 3. Generating candidate disjunctive conditions. In this phase, the set of candidate disjunctive conditions for the next level is generated (lines 12 to 17). This set is built by combining selected conditions from the previous level, i.e., the ones that are not pruned. We avoid computing disjunction composed of conditions that have been tagged as non-interesting in the previous level, since new disjunctive conditions built using such conditions are necessarily not interesting. We define the following set of criteria to predict which candidate conditions are not potentially interesting, and so to avoid computing the set of instances for them, which is a computationally expensive operation (see Sects. 4.3 and 6): (1) Associativity of conjunction and disjunction: A condition that combines disjunction and conjunction of the same atomic condition can be simplified into a condition previously explored. For example, the correlation condition ψ2 ∨ (ψ2 ∧ ψ3 ) is equivalent to the correlation condition ψ2 (Rψ2 ∪(Rψ2 ∩ Rψ3 ) = Rψ2 ). This is not useful to compute since the instance sets for the above two conditions are the same. Hence, we do not compute such combinations further. This criterion is referred to as not Assoc(ψ1∨2 ) in line 14 of Algorithm 3. (2) Inclusion Property: If Rψ1 is included in Rψ2 , i.e., Rψ1 ⊂ Rψ2 , then ψ1 ∨ ψ2 = ψ2 . Therefore, ψ1 ∨ ψ2 is not minimal (since it builds down to ψ2 ), and we can avoid computing it. Moreover, when Rψ1 = Rψ2 , then ψ1 and ψ2 partition the log identically (they produce the same views), i.e., Rψ1 ∨ψ2 = Rψ1 = Rψ2 , and only one is needed to be explored. This criterion is referred to as not I nc(Rψ1 , Rψ2 ) in line 14 of Algorithm 3. 4.3 Complexity analysis As evaluated in Sect. 3.2, the theoretical number of possible correlation conditions that should be explored using a k 2 /2
brute-forth approach is of N = O(2(2 +k /2) ), assuming that k is the number of attributes in the dataset. The time complexity of exploring the space of all possible correlation conditions is of O(N . p), in which p is the time complexity 2
123
H. R. Motahari-Nezhad et al.
of partitioning log L into instances for a condition ψ. The log partitioning consists of (i) computing the set of correlated message pairs Rψ , (ii) the correlation graph G ψ (L), and (iii) decomposing it. The worst case time complexity of computing Rψ is O(|L|2 ), which is the case for comparing each message in the log L to all messages after it. Building the graph G ψ from Rψ has a complexity of O(|L|2 ), as well (adding an edge for all pairs of correlated messages in the graph). Finally, the time complexity of graph decomposition algorithm is O(|V | + |E|), |V |, and |E| representing the number of vertices and edges in the graph. This is equivalent to O(|L| + |L|2 ) in the worst case. Summing up all these, we get the complexity of p = O(|L|2 ). Therefore, the time complexity of the solving problem using a brute force approach is O(N .|L|2 ). This complexity is exponential in the number of attributes in the log. In the following, we analyze the complexity of the proposed correlation conditions discovery approach both in worst case scenarios and also in practical cases. Atomic condition discovery. The time complexity of atomic conditions discovery depends on the number of attributes (k) in the dataset and also the number of messages in the log (|L|). In the worst case, the time complexity is O(k 2 . p), i.e., O(k 2 .|L|2 ). Since k |L| and the rate that it increases is also much smaller than that of |L|, the time complexity is O(|L|2 ). At any given time, Rψ and its corresponding graph for ψ are in the memory. Therefore, the space complexity, in the worst case, is O(|L|2 ). This is the case for a key-based condition that has only one value, so each message in the log is connected to all other messages. However, in practice, due to the definition of correlations between messages in information systems in an enterprise, only a small part of all possible combinations makes sense. Due to effective pruning criteria and also efficient implementation techniques, the time and space complexity are significantly smaller than the worst case analysis. The number of attributes with repeating values in real-world datasets is always much smaller than k. In addition, not all messages in a log are correlated with each other. Especially, in chain-based correlations, each message is only correlated with one other message in the instance. Moreover, looking forward from each message in the log, for correlated messages to it, allows reducing the worst case number of correlated pairs by half. The combination of these facts, and also using database queries and indexing techniques (e.g., B-Tree in our implementation) to compute Rψ , its time complexity in most practical cases, is almost linear with respect to the database size (see Sect. 6 for experimental results). Conjunctive condition discovery. The time complexity of the conjunctive condition discovery algorithm depends on the number of conjunctive conditions, for which the respective set of instances is computed. Let s be number of such conditions. In the worst case, s = O(2|A| ), in which A represents
123
the set of atomic conditions (in the worst case, |A| = k 2 /2). During the computations, s conditions are discovered, so the time complexity for computing the conjunctive conditions is O(s. p), in which p is the time complexity of partitioning log into instances according to a conjunctive condition ψ1∧2 . This time complexity is the same as that for atomic conditions (see above). Therefore, in the worst case scenario, the time complexity is O(2|A| .|L|2 ). However, in practice, due to pruning criteria, s can be significantly smaller than the worst case analysis shows. In addition, in many systems, not more than a few (e.g., 4) atomic conditions are used in forming a conjunctive condition to uniquely represent instances. Therefore, no more than a few levels have to be explored. In addition, many of the possible candidates are pruned before computing their respective sets of instances, and many others are not even considered as candidates as their parent conditions are pruned. Hence, in practice, the complexity is much smaller than the worst case analysis (see the experiments in Sect. 6). There is one set of correlated messages, Rψ1∧2 , and its corresponding correlation graph in the memory at a time. Hence, the space complexity, in the worst case, is O(|L|2 ). The space complexity, in most practical cases, is significantly smaller than what the worst case scenario shows. Indeed, the space complexity depends on the size of the longest instances, which is correlated using a key-based approach (denoted by cn max ). Even if all the other instances are of the same length (the worst case), then approximately there are |L|/cn max instances in the log. In this case, the space complexity is | .cn max ) = O(cn max .|L|). In most practical cases, O( cn|Lmax cn max |L|. The experiments validate this estimation (see Sect. 6). Disjunctive conditions discovery. The analysis of worst case and practical case time and space complexities for disjunctive conditions discovery is similar to those for conjunctive conditions discovery. The time complexity, in the worst case scenario, is O(s. p) = (2|| .|L|2 ), in which represents the union of the sets of atomic and conjunctive conditions discovered in previous steps. However, in practical cases, s is significantly smaller due to the extensive set of pruning criteria applied to only keep the interesting conditions. This number is much smaller than the possible candidates. This is validated by experiments reported in Sect. 6.
5 Visual exploration and refinement In this section, we present the front-end of Process Spaceship. We describe how the tool organizes the discovered process views into a process map, facilitates process views exploration, and supports the supervision of process view discovery and refinement of results. Section 6.3 presents user interface and experience with using the tool.
Event correlation for process discovery from web service interaction logs
Fig. 6 A screenshot of Process Spaceship showing part of the process map discovered for the SCM business service
The role of front-end and visualization is important as understanding the results of automated event correlation by looking at only the set of discovered correlation conditions or the set of process instances for each correlation condition would be difficult. To facilitate understanding the results, as shown in Fig. 4, we suggest to discover the process model corresponding to a given set of process instances (for a given correlation condition). As mentioned in Sect. 3.2, we use existing process mining techniques (in particular [31]) for this purpose. As also mentioned earlier, a set of events, a correlation condition, the corresponding process instances, and the process model form a process view. We propose to organize the process views in a process map considering the relationships between process views. In the following, we explain how the process views relationships are discovered and the process map is built. 5.1 Process map We introduced the notion of process map, which is a metaphor for organizing and navigating the discovered process views, in Sect. 2.2. A process map organizes the various process views based on the relationships that exist between the business processes of each view. Figure 3 shows a part of the process map for the SCM business service. Figure 6 shows a screenshot of Process Spaceship, which illustrates a larger portion of the process map discovered for the SCM business service. A process map consists of two visual elements: nodes (process views), and links (relationships between views). Each process view is represented as a node in the map. Building the process map consists of finding the relationships between process views, and organizing the process map in the 2-D space, described in the following. Finding relationships between process views. A first step in building the process map is finding the relationships
between process views. In this paper, we consider only two relationships between process views, i.e., “part-of” or “subsumption” (See Definitions 5 and 6). Note that in general such relationships could be more diverse (see [47]). The approach that we take for efficiently computing the above-mentioned relationships between process views is to leverage the computation that took place during the condition discovery step. That is we consider the relationships at the process instance level rather than at the process models of each process view. This is important as otherwise we would have to compute process model relationships between each pair of process views, which is a very computationally expensive task. These two approaches are equivalent. A subsumption of process models translates into an inclusion property between the set of process instances of each process view. During the computation of composite conditions from Rψ1 and Rψ2 , the inclusion between these two is tested (see Sect. 4.2). In the disjunctive conditions discovery, by default, using the trivial-union criterion ensures that the subsumption relationship is only allowed between the process views in the highest level and its immediately lower level. A part-of relationship exists between a composite condition ψ and the conditions of its parent nodes in the map. In fact, instances corresponding to a correlation condition ψ1 are part-of instances built using a condition ψ1∨2 . Conversely, instances built using a condition ψ1∧2 are part-of instances built using ψ1 . This approach enables us to find the relationships of process views at almost no additional cost. Organization of process map. Another aspect of building the process map is the layout of the map and its organization in the 2D space. To be consistent with the feature of having process views at different levels of abstraction, we organize the process views in levels such that the highest level contains the largest and most generic processes (i.e., processes not subsumed nor part of any other). Correspondingly, the lower levels represent processes that correspond to more specific and precise interactions. For the layout, we propose to place the process views corresponding to atomic and conjunctive correlation conditions in the bottom level. The other process views, which correspond to process views with disjunctive conditions, are placed in higher levels based on the number of atomic or conjunctive conditions that are used in the conjunctive operation. In this sense, the highest node(s) would have the largest number of atomic/conjunctive conditions. This layout makes it possible to have more specific, single-system or userspecific views placed in the lower levels, while more inclusive, abstract and larger process views are at the higher levels. This is supported by the having the process views that subsume other process views at a higher level of the map. With this layout, from any given process view, users can navigate the map by traversing the links up toward larger, more abstract views and down toward finer grained, less abstract
123
H. R. Motahari-Nezhad et al.
views. It should be noted that other layouts could also be used to present the process map. Process views metadata. Each node in the map is associated with meta-data that characterize the corresponding process view: (i) Metrics: These metrics provide high-level information on the process instances for the view such as the number of instances, their minimum, average, and maximum lengths (i.e., the number of messages in an instance) and their duration (the time difference between the first and the last messages of an instance); (ii) Correlation approach: This indicates the approach used for correlating messages, i.e., key-based, reference-based, or a combination of these; (iii) Business process: A graphical model of the business process inferred from the instances of the view by applying the process discovery algorithm [31,33]; (iv) Service list: If the WSDL interfaces corresponding to the Web services involved are known, we can list the services that are concerned with the view by examining the different types of messages involved in the corresponding instances. This allows identifying the dependencies among services in the given view [36]. 5.2 User-driven discovery and refinement Our approach is user-driven to account for the subjectivity of event correlation and process views discovery. We enable users to supervise the correlation discovery process. This is important as (i) users can lead the search toward views that interest them (to account for subjectivity of the process view discovery), and (ii) although the goal of the automated approach is to minimize the risk of false positives (i.e., the inclusion of irrelevant views) and false negatives (i.e., excluding an interesting view), this risk cannot be entirely avoided. In addition, there are exceptions to the heuristics discussed in Subsect. 4.1.1. For example, in some applications, customerID and countryCode may be used as correlators. However, countryCode is a categorical attribute with relatively small domain and is therefore likely to be pruned. In such situations, user knowledge and input can ensure achieving proper results. We allow users to supervise all steps of the discovery process, from correlator attributes selection to atomic, conjunctive, and disjunctive conditions discovery. Before (resp. after) each step, the user can inspect the input (resp. output) and corresponding views and refine by adding/removing candidate views. This facility proved to be effective in practice (see Sect. 6). Operations AddView and RemoveView allow users to instruct the tool to further explore some directions, after automated discovery is finished, or to remove unrelated results. 6 Experiments Implementation. The components of Process Spaceship have been implemented as Eclipse plug-ins using Java 5.0 as the
123
Table 3 Characteristics of the datasets Dataset
SCM
Robostrike
PurchaseNode
Service operations Messages in log Attributes
14 4,050 28
32 40,000 98
26 34,803 26
programming language and PostgreSQL 8.2 as the database management system. For the implementation of the graph decomposition algorithm, we used JGraphT5 library (release 0.7.2), which is a free Java graph library that provides mathematical graph-theory objects and algorithms. All experiments have been performed on a notebook machine with 2.3 GHz Duo Core CPU and 2 GB of memory. 6.1 Datasets We carried out experiments on three datasets: SCM. This dataset is the interaction log of a SCM business service, developed based on the supply chain management scenario provided by WS-I (the Web Service Interoperability organization),6 for which a simplified business process is depicted in SCM view of Fig. 3 (in fact this figure shows a part of its process map). There are eight Web services realizing this business service. The interaction log of Web services with clients was collected using a real-world commercial logging system for Web services, HP SOA Manager.7 The services in SCM scenario are implemented in Java and use Apache Axis as SOAP implementation engine and Apache Tomcat as Web application server. Table 3 shows the characteristics of this dataset. The log has 4,050 tuples, each corresponding to an operation invocation. The process of SCM has three paths, for which instances of one path is correlated using disjunctive conditions (the same as those of Retailer view in Fig. 3), the other using an atomic condition (not show in the figure), and finally the other is correlated a conjunctive condition (the same as those of CRM view in Fig. 3). HP SOA Manager records meta-data about message exchange in 13 attributes, and we extracted 15 attributes from messages in this dataset. This dataset mainly provides an example of a system, for which its instances are correlated in a chain-based method. Robostrike. This is the interaction log of a multi-player on-line game service called Robostrike.8 In this game, clients (players) exchange XML messages with the game service performing various operations, e.g., designing new games and playing. Each session of a player may include several game plays or game creations. The log contains 40,000 5
http://jgrapht.sourceforge.net.
6
http://www.ws-i.org.
7
http://managementsoftware.hp.com/products/soa.
8
http://www.robostrike.com.
Event correlation for process discovery from web service interaction logs
messages (Table 3), which correspond to one day of activities of the game service. In a pre-processing step, we extracted all the attributes of messages to present them as a single relation. This dataset represents a system for which its instances are correlated using a key-based approach, and instances are very long. PurchaseNode. This process log was produced by a workflow management system supporting a purchase order management service called PurchaseNode (PN). The PN dataset contains 34,803 tuples corresponding to task executions within workflow instances (Table 3). It is a private process log of a service in which all messages are correlated using atomic conditions. This dataset was originally organized into two tables: one for the workflow definitions and the other for the workflow instances (execution data). The workflow definition table defines 14 workflows using different combinations of the 26 workflow tasks. For this experiment, we joined the two tables, based on workflow identifier. By using this dataset, we also test the applicability of the approach to process logs. This dataset is an example of a system, for which its instances are correlated using a key-based approach. 6.2 Evaluation of the correlation discovery approach Evaluation criteria. We evaluate our proposition along three dimensions: (i) the quality of discovered views, (ii) the execution time, and (iii) the contributions of the proposed criteria in pruning the search space, as described in the following subsections. 6.2.1 The quality of discovered views The quality of the results is assessed using classical precision and recall metrics. Precision is defined as the percentage of discovered views that are actually interesting. Recall is computed as the percentage of interesting views that have actually been discovered. Evaluation approach. The approach we have taken for evaluating the interestingness of process views is having human users, who have knowledge about the dataset and the related process, look at the discovered process views and identify what they consider relevant and interesting from a business perspective, i.e., the interesting process views must represent (part of) a meaningful business process in the context of the work of the user(s). In this setting, the event correlation process is carried out in a supervised mode to take into account the inputs of the user in each step. Robostrike. For Robostrike dataset, the set of discovered process views were evaluated by the game service developers and the founder. In this case, the recall was 90% (9 out of 10 expected process views were discovered). One of the expected atomic conditions was not among the results.
By looking back at the log, we found that the attribute of this condition is a categorical attribute that was pruned due to small number of distinct values. In total, there are 12 process views that are discovered. Among these, 7 are key-based atomic conditions, which correspond to views representing instances in individual games, individual user sessions (that include several games), and multiple sessions of a same player, etc. Three of the remaining conditions are conjunctive, each formed using two atomic conditions. One of these three corresponds to a view that refers to the behavior of individual users in different games, so it is interesting. Finally, the remaining two conditions are disjunctive, each consisting of two atomic conditions. One of them corresponds to an unexpected view that was surprising for the dataset owners. It refers to the correlation of messages based on private chat conversations among players. This view highlights the behavior of communities of players that are talking about a given game. In summary, for this dataset, the precision was 75% (9 out of 12 were interesting). PurchaseNode. The dataset owner advised that this dataset contains events related to one process, and these events are correlated in their application using the attribute flowInstanceId present in the log. The smaller number of selected conditions and views in this dataset makes it possible to report the experiments in more detail. In the first step, i.e., the correlator attribute identification, 4 attributes out of 26 are selected. Most pruned attributes are filtered based on non-RepeatingV alues criterion except one that is pruned based on having all unique values in the dataset (a unique identifier for each message). The three remaining attributes are flowInstanceId, startTime, and endTime. In particular, attributes startTime and endTime are selected since there are some tasks that were started or finished at the same times, so these two attributes have repeated values. There are four conditions defined based on these three attributes (three key-based, and one reference-based between startTime and endTime. By applying criterion imbalanced P I on the number of instances formed using each condition, the only condition that remains is based on attribute flowInstanceId. As in this specific case the only discovered condition is the only interesting correlation, as well, it result in a recall of 100% and precision of 100%. SCM. This is a synthetic dataset that we have created based on WS-I and our knowledge from the supply chain of one of the HP’s customers. We designed it to test scenarios that are not covered by Robostrike and PurchaseNode datasets. The expectation from the evaluation from this dataset was to test the capability of the approach in discovering process views covering the business processes of more than one system. In the following, we discuss the details of this experiment. In the attributes selection phase, after pruning based on criterion non-RepeatingV alues, there are 9 attributes
123
H. R. Motahari-Nezhad et al.
identified, which are customerID, quoteID, oID, invID, payID, shipID, custID, surveyID, and rfpID. These lead to forming 9 key-based atomic conditions. The atomic condition based on customerID is used to correlate messages in the catalog system. Atomic conditions based on quoteID, oID, invID, payID, and shipID are used to correlate messages in the quoting, ordering, invoice, payment, and shipment systems. There is one conjunctive condition discovered that is formed from conjunction of customerID and surveyID. This condition correlates messages in the customer relationship management (CRM) system. In fact, customerID is used in two systems for correlations, the catalog and in the CRM systems. In the latter system, it is used as part of a conjunctive condition. There are 11 process views discovered with disjunctive conditions. One of these conditions is the disjunction of atomic conditions based on customerID, quoteID, oID, invID, payID, and shipID. This condition corresponds to the Retailer business service offering all purchase order management services in the enterprise (see Fig. 3). Another disjunctive condition, formed based on atomic conditions on custID and rfpID, is related to the view corresponding to the product system. The view with the highest number of conditions is formed using disjunction of all above conditions. This view represents the model of interactions in the whole SCM scenario. There are 9 disjunctive conditions that represent intermediate views in the process map. For instance, they represent the interactions between the quoting and ordering systems, the ordering and invoice, invoice and payment, and so on. The results demonstrate the capability of the approach in discovering process views of individual systems as well as that of the enterprise. It should be noted that in addition to the attributes of the content of the messages (introduced above), there are attributes in this log which are meta-data attributes recorded by HP SOA Manager for each event. For instance in case of SCM, some of these attributes exceed the attribute pruning thresholds but are not relevant to the process. The nonrelevant attributes that meet the threshold in SCM dataset are RequestSize and ResponseSize. Also, some relevant attributes to processes are pruned as they do not meet the criterion non-RepeatingV alues (e.g., in case of Robostrike). We learned that an important step in the event correlation is the attribute selection step, and if done purely based on statistical properties and heuristics without taking into account the user inputs as well, it could have a negative effect on the quality of results and the amount of computations (by not effectively pruning the search space). We facilitate the job of users in providing feedback on the step of identifying relevant correlator attributes by offering attribute meta-data information in the tool, as described in Sect. 6.3. In case of SCM, as we run in the the supervised mode, we can filter the above-mentioned two attributes from the results based
123
on feedback mechanism. However, if they are not filtered by users at this step, they lead to two atomic conditions and therefore two process views with no business semantics. In summary, for SCM, since all the expected process views are among the discovered views the results lead to a recall of 100% and the precision of 91%, considering the two views related to RequestSize and ResponseSize are not filtered by the user.
6.2.2 Execution time PurchaseNode. Figure 7a shows the execution time of the approach on the PurchaseNode dataset for 6 different sizes of the database. In this case, there are only atomic conditions discovered. Hence, only the execution time of atomic conditions discovery is presented. This chart allows comparing the amount of time spent on different sub-steps, i.e., computing Rψ , G ψ , P Iψ , and applying pruning criteria (denoted by Other). It can be seen that more than 90% of the time is spent on computing Rψ . This step is performed as an SQL query over the database PurchaseNode, and the other steps are mainly performed in the memory. Rψ for condition flowInstanceID = flowInstanceID takes the longest time, among others, due to longer length of instances. Rψ for this condition has 48,476 message pairs in the dataset with 15,000 messages. The selected attributes and discovered conditions are the same for all dataset sizes. The time denoted by Other represents the time spent for applying pruning conditions (here, criterion non-RepeatingV alues and criterion imbalanced P I ). This time is in the same range as that of computing G ψ , but less than those of computing P Iψ and Rψ . The time increase is nearly linear as the dataset size increases. Robostrike. The overall execution time of the approach on the Robostrike dataset is shown in Fig. 7b. The size of the dataset is varied from 2,500 messages in the log to 15,000 messages. This chart also compares the execution time for the three steps, atomic, conjunctive, and disjunctive conditions discovery. On average, 36% of the time is spent on discovering atomic conditions, 20% on conjunctive conditions, and 44% on disjunctive conditions. Figure 7c shows the execution time of atomic conditions for this dataset. On average, the shares of computing Rψ , G ψ , P Iψ , and applying pruning criteria (Other) are 81, 14, 1.5, and 3.5% of the time, respectively. The reason for computing Rψ taking 81% is that atomic conditions in Robostrike are key-based, and there are very long instances in this dataset, e.g., the maximum length of instances for an atomic condition in a dataset with 15,000 messages is 821 messages, with Rψ having 2,140,502 message pairs. Having long instances is normal as usually during a game many messages are exchanged between the service and the application of the player.
Event correlation for process discovery from web service interaction logs
(b) (a)
(c) (d)
(e)
Fig. 7 The evaluation results of the approach on three datasets. a The execution time of the approach on the PurchaseNode dataset. b The execution time of the approach on the Robostrike dataset—all steps. c The execution time of the approach on the Robostrike dataset—atomic
conditions discovery. d The execution time of the approach on the Robostrike dataset—conjunctive conditions discovery. e The execution time of the approach on the Robostrike dataset—disjunctive conditions discovery
Figure 7d shows the details of execution time of the conjunctive conditions discovery on this dataset for various sizes. In this phase, on average, 48% of the time is spent on computing Rψ , 11% on computing G ψ , 1% on computing P Iψ , and 40% on applying pruning criteria. The reason that applying pruning criteria has a significant share in the execution time is that applying some of the criteria, e.g., checking for inclusion, requires executing queries on the database. The correlation graph in this case is less connected (having smaller instances) and so the time for computing connected components is small. Finally, Fig. 7e shows the details of execution time of the disjunctive conditions discovery on this dataset for various numbers of messages. This chart shows that 25, 48, 4, and 23% of the time, on average, is spent on computing Rψ , G ψ , P Iψ , and applying pruning criteria, respectively. It can be seen that most of the time is spent on computing the correlation graph (the graph is built by parsing Rψ and creating an edge for each correlated message pairs in
Rψ ). The reason for this significant time is that the correlation graphs for disjunctive conditions can be significantly big (since the set of edges of a correlated graph for a disjunctive condition is the union of those of its parent conditions) in this dataset, given the key-based correlation and long instances. Experiments show that the discovered conditions are the same for the datasets with 7,500 messages and higher. The reason is that in the Robostrike dataset, playing session of users typically lasts between 1 and 4 hours, that is, about 7,000 messages in the dataset. Hence, if we have 7,000 or more messages, then we have complete instances in the log. Identifying the appropriate size of a dataset, in which there exist complete instances, is a domain-specific task. Our tool allows users to specify either the size of the dataset (the number of messages), or in terms of the average instance duration. SCM. Table 4 shows the execution time of the approach on the SCM dataset. This dataset represents a case, where instances are correlated using a chain-based approach.
123
H. R. Motahari-Nezhad et al. Table 4 The execution time of the approach on the SCM dataset (in s)
Atomic Conjunctive Disjunctive
Rψ
Gψ
P Iψ
Other
Total
0.562 0.079 0.606
0.016 0.002 0.422
0.031 0.001 0.205
0.25 0.34 1.22
0.859 0.422 2.453
Hence, in this case, the execution time of the atomic conditions discovery is small (0.859 s), compared to 63 s. for a sample with the same size from Robostrike dataset. For this dataset, 66% of the time is spent on computing disjunctive conditions, 11% on computing conjunctive conditions, and 23% on computing atomic conditions. Discussion. Above shows that the proposed approach is efficient, especially in cases where the correlation follows a chain-based approach. The reason is that in such cases not all messages in an instance are correlated, so the set of correlated messages (Rψ ) for such conditions are relatively small. In addition, the proposed approach is very effective in pruning the search space for computing composite conditions. Indeed, in composite conditions discovery, Rψ , G ψ , and P Iψ are mainly computed for relevant conditions. In cases for key-based correlation with long instances (Robostrike), a significant time is spent on computing Rψ in all steps, and also on G ψ in computing disjunctive conditions. One way to improve this time complexity is to use approximate approaches for computing Rψ and also G ψ . Investigation into these optimization techniques is proposed as part of future work. 6.2.3 Search space pruning As discussed above, the proposed criteria proved effective in identifying and pruning non-interesting candidates. The analysis of proposed criteria in pruning the search space in discovery of conjunctive conditions for Robostrike dataset shows that criterion imbalanced P I is responsible for 31%, inclusion for 25%, monotonocity for 25%, and attribute definition constraints for 18% of pruning. These numbers show that almost all criteria are equally important for search space pruning in discovering conjunctive conditions for this dataset. In discovery of disjunctive conditions, criterion imbalanced P I has a share of 4%, inclusion 13%, monotonocity 50%, associativity 9%, and finally trivialunion criterion 23%. This shows that monotonocity and trivial-union criteria are the most useful criteria for pruning the search space in discovering disjunctive conditions for systems where messages are mainly correlated using key-based conditions. 6.2.4 Results discussion Looking at the evaluation, we highlight the following observations:
123
– We have designed the evaluation so that the approach is tested using different types of process event logs, i.e., a real-world workflow log (PurchaseNode) as a single-process log, a multi-service interaction log (SCM) as a multiprocess log, and Robostrike, whose traces correspond to the complex logic of a real-world, online game service. These datasets cover typical logs found in the interactions of services. Given that the approach enables successful discovery of correlations in above scenarios and for different users, we conclude that the approach is generic and applicable to various scenarios. This conclusion is also supported by a successful application of this approach and the toolset for discovering correlations and process views on the log of a customer-facing application in HP, and plans are underway to use it in one of the products. – The evaluation shows that the approach is performing excellent in terms of search space pruning as well as in achieving a high quality of the discovered process views (as discussed earlier). – The evaluation shows a polynomial (nearly linear) increase in the execution time of the algorithms in respect with the dataset size. – Based on the lesson learned, we believe the execution time of the algorithms can be improved by applying more efficient approaches such as approximate approach for estimating the size of conversations (further discussed in further work). – We have learned that in some applications the correlation conditions may not always be expressed as the equality of attributes, and further investigation into this aspect is planned for future work. Application scenarios. The event correlation approach in this paper is designed and applicable to service-based processes, i.e., on the log of interactions of a set of services that participate in one or more business processes. The approach follows a reverse engineering exercise starting by looking at the events related to messages that are exchanged among services. We make the assumption that the information that allows us to correlate service events is captured and is present in the log. The application of the approach in this paper becomes more important considering that the relationships of services in the enterprise is becoming more dynamic and therefore the involvement of services in various processes may dynamically change. Using our approach enables us to identify such evolving relationships by looking at the log of service interactions. In particular, the application scenarios of this work include the following (i) discovering the process space of a department of an enterprise (set of processes over different services), superimposed over a set of interacting services, (ii) finding all the processes that a given service is involved in, and (iii) finding the process views that a particular user
Event correlation for process discovery from web service interaction logs
is interested in a department over the interaction of a set of services. We are witnessing a rapid need for these approaches in the industrial environments. For example, at Hewlett Packard and within the context of Business Availability Center application, we have seen the need for identifying the relationships between services in the context of tracing complex business transactions across heterogeneous environments involving services. As an application scenario, we are in process of using the methods introduced in this paper for this purpose. Limitations. Currently, the application of the tool is limited to the following settings: (i) the supported language for correlation is based on the equality of values of event attributes. If the correlation of services events in a log is defined using other methods, this is not supported (see Sect. 8 for a discussion on possible directions to extend the language), (ii) in case of a heterogeneous setting in the enterprise, where the log data formats for services are not homogeneous from syntax or semantic point of view, we assume that the integration of log data is already performed, and the input to the correlation discovery process is in the format defined in Sect. 2.2, and (iii) we assume that the data in the log is not noisy or incomplete (missing services events from the logs related to process instances). The approach in this paper need to be extended to deal with noisy or incomplete service logs. 6.3 Experience in using Process Spaceship The Process Spaceship [30] (see Fig. 6) implements the proposed approach and provides visual facilities to discover and refine process views. The tool can be used by the process architect in the enterprise, who is an actor responsible for the design and definition of process models in the enterprise. The capabilities of the tool are described in [30]. Here, we give a brief overview of how Process Spaceship simplifies the job of the process architect using the SCM dataset. The process architect starts from the integrated event log and can operate the tool in two modes: automated or semiautomated. In the automated mode, the architect instructs the tool (using buttons in the top-left in Fig. 6) to discover the set of potentially interesting process views based on heuristics. The discovered process views are organized in a process map, similar to the one in Fig. 3. For SCM, 7 out of 9 bottom-level views correspond to views of individual systems, which are interesting to keep in the map. These can be quickly identified by looking at the map, as nodes in the map are labeled with the names of systems that they represent. In addition, when a view is selected, the meta-data related to it (e.g., its process model and statistical meta-data) are displayed in the bottom-right frame. Finally, the architect may look at the highest level view, which corresponds to the SCM view.
In the semi-automated mode, the architect can supervise the process view discovery. This is conducted in three steps including candidate attributes selection, simple conditions, and composite conditions discovery. The tool can capture the architect’s knowledge in terms of expected correlation pattern (key-based or reference-based), the average number, duration or length of instances for various systems. This information is used to direct the search toward desired process views. In addition, before and after each step, the architect is provided with a set of meta-data (including statistical meta-data and process models) that helps in making informed decisions, e.g., to keep a condition for consideration or to remove it. This interactive discovery and refinement allows users to effectively find interesting views and avoid discovery of un-related views. User study. The Process Spaceship tool has been used by a set of test users which are students and university colleagues working in the same area. We asked these users to report on their experience on the ease of use and the amount of effort needed to run the tool in the two modes. Before the evaluation, we provided a brief tutorial about the context and the data sources, so that they have domain knowledge about the logs. We asked the users to rate each step of the correlation discovery process from 1 (very easy) to 5 (difficult). The evaluation results of the semi-automated mode show that the steps of data source selection, service selection, attribute selection, the wizards for discovering atomic, conjunctive and disjunctive conditions are rated 1 (very easy and intuitive). The attribute selection step also include an area for advanced users to adjust the default values for thresholds. This threshold selection page is rated 3 due to lack of description in the page to explain the implication of adjustments on the correlation results. The automated mode with default setting has been rated 1 (very easy and intuitive) as the user arrives at the process map. It also includes a page for advanced users to select the thresholds used in various steps of the approach. This threshold selection in the automated mode was also rated 3 due to the need for explanation on how the change of thresholds will affect the results. We explained to the users that (1) the user input and feedbacks, and specifically threshold adjustment, are optional, and it provides them with the opportunity to guide the tool toward discovering processes that they are interested in. (2) their input and feedback in various steps and specially on the candidate attribute selection and candidate condition selection steps override the selections made automatically through applying thresholds. Therefore, setting the values of thresholds does not play a fundamental impact on the results, as long as the user takes the opportunity to review the candidate attributes/conditions before each step and provide feedback on results after running each steps. Nevertheless, the values of the thresholds are set based on heuristics that demonstrated to be effective in the experiments.
123
H. R. Motahari-Nezhad et al.
Finally, although this usability evaluation was not professionally designed, the experience of these users, who were average end users, is still significant in proving the usability of the tool and reasonably low effort needed in running the tool. In summary, the tool facilities saving considerable time and effort when compared to what would be done without tool support. Integrating this tool with available process analysis and tracking tools (e.g. [6,21,22]) simplifies the job of process analysts and end-users. We are in the process of integrating this tool into HP (Mercury) process and service interactions analysis toolkit.
7 Related work We discuss related work in two categories: (i) business processes and (ii) process event correlation. 7.1 Business processes Workflow and business process management. As discussed earlier, WfMSs support the definition, development, execution, and maintenance of business processes. BPMSs, as an extension of classical WfMSs, focus on analysis, prediction, and tracking of business processes. WfMSs and BPMSs cover only operational business processes, i.e., the ones that are explicitly designed and modeled. In contrast, process spaces aim at providing complementary support to look at the process execution from the perspectives of various people and systems. Moreover, the process space aims to identify the relationships of various systems in terms of process executions and provide an opportunity to explicitly define them. An additional important difference is that in WfMS it is assumed that the correlation of events into process spaces is predefined. However, we propose to discover them from the information items in the enterprise. Another key distinguishing feature is that in process spaces we recognize the independence of underlying IT systems that execute the process and we intend mainly to provide an understanding of the process execution on the aggregate of existing systems. However, a BPMS (WfMS) assumes the full control of the systems (data sources) that execute the underlying process. Business activity monitoring (BAM). Available solutions for BAM, e.g., Oracle BAM [34], focus on processing realtime events at the middleware level and providing performance indicators in a dashboard. These approaches take advantage of event processing systems, e.g., [13]. However, BAM systems enable business process analysis based on a data-centric approach rather than a process-centric one. Nevertheless, a PVDS is complementary to event processing systems and BAM tools where event correlation is needed in the context of process executions.
123
Process views. The notion of “process view” has been used in the literature with a different meaning referring to an abstract representation of a process model used to reduce the complexity of process presentation or visualization (e.g., [9, 26,43,48]). For instance, in [9], a process view for a given process model is derived from the process model by reducing and/or aggregating tasks in the original process model. In [26,48], process views are derived from a process view based on the role of a user. However, we do not assume a known process model out of which process views are derived. Our definition of process view consists of a set of events, respective process instances, and the process model that is the result of correlation of process events using a certain correlation condition. Different process views may be defined on the same set of events corresponding to different correlation conditions. Nevertheless, process views in our context also may be presented at various levels of abstraction and may represent only part of another coarser grain process view or be subsumed by it. Process mining techniques. Process mining techniques enable discovering the process model followed by a given set of process instances. There exist an extensive body of work and algorithms for process mining (for a survey see [45]) including our prior work [31]. The problem that we focus on in this paper is that of event correlation to group process events into process instances. Therefore, our work is complementary to process mining techniques as we enable grouping events in the log into process instances that are then input to process mining algorithms. In particular, to help the users to understand the process exhibited by a given way of correlation in a visualized manner, we apply a process mining algorithm [31] to discover the corresponding process models and show to the user. Therefore, we use these techniques in this paper but do not innovate in this space. From the process mining perspective, at a conceptual level, the innovation of our work is discovering various process views from the log based on various ways to correlate the events. 7.2 Process event correlation In the following, we show why the problem of process event correlation cannot be addressed by formulating as one of classical problems and discuss related areas in process event correlation. Functional dependencies. The problem of event correlation can be seen as related to that of discovering functional dependency in databases. In functional dependency inference [25,37], the problem is to identify properties of the form A → B, where A and B are sets of attributes of the relation that hold for a significant subset of tuples of the relation. Many types of functional dependencies (e.g., nested, multivalued, join) have been explored [19]. In contrast to functional dependency, a correlation condition is not expressed
Event correlation for process discovery from web service interaction logs
over the tuples of a database but among these tuples. That is, a functional dependency considers sets of tuples that agree on a subset of their value, while correlation discovery considers sets of tuples that may agree on a single value, only pairwise, and this value may be common to only a pair of tuples or a small subset of them (up to a process instance length). Moreover, there exist valid functional dependencies between attributes but these would not be indicative of valid correlation conditions. The knowledge of strict dependency between attributes could be used to restrict the search space for correlation conditions by avoiding to evaluate conditions that are potentially equivalent but our algorithm already avoids generating equivalent conditions through the more general inclusion property (see the generation of candidate conditions in Subsects. 4.2.2 and 4.2.3). Composite and foreign keys discovery. The problem of discovering (composite) keys in relational databases [40] consists of identifying a subset of attributes that identify uniquely individual tuples. Unlike the event correlation problem, it is possible to objectively assess the validity of composite key (it has to identify a unique tuple). The problem of foreign key discovery [12] is concerned with finding pairs or subsets of attributes such that a join operation can be performed using these attributes. A correlation condition characterizes a join, and a process instance can be computed as the transitive closure of this join. Event correlation discovery can benefit from foreign key discovery methods and proposals for the rapid estimation of join sizes (e.g., [3]) to improve the performance of the discovery of atomic conditions. However, the availability of join-size estimates or the foreign key discovery algorithms are not sufficient to find correlation conditions as criteria such as the transitive closure size and the resulting partitioning in terms of process instances have to be examined for each condition. The exploration of the solution space for correlation discovery can be achieved by leveraging properties (such as inclusion or monotonicity) as done in our approach. Association rule mining. Association rule mining techniques identify values that co-occur frequently in tuples of a dataset [2,20]. However, frequent co-occurrence of values is not a sufficient criterion to identify all the correct process views. For instance, in the reference-based correlation method, a value used for correlating two messages appears only twice in the entire dataset (once for each of the two messages). While association rule mining looks for values that co-occur in a same tuple, message correlation is concerned with values that occur in different messages in the dataset. Classification. Since instances are drawn from partitioning the log, building them could be seen as a classification problem [1,18] where each instance is a distinct class. However, classification approaches assume a fixed (and rather small) number of classes, while instances come in unbounded and unknown number, depending mainly on the size of the log
considered. Moreover, messages are sometimes correlated by reference to each other rather than by reference to the class (instance), making it impossible to define a classification function on a message-by-message basis. Furthermore, classification approaches rely on pre-classified instances to infer the classification function. In our work, process instances to be used as training examples are not available. Note that inferring the correlation conditions from a collection of instances, if available, would be an interesting and complementary problem to explore. Clustering. One might also argue that correlation could be formulated as a clustering problem [18,24]. In clustering, the relationship between members of a cluster is typically assessed by their relative proximity according to, e.g., some distance measure. However, messages of a same instance may be very different (e.g., a purchase order and a payment message), while messages of two distinct instances may be very similar (e.g., two purchase orders for the same products). In fact, two messages of a same instance may well have nothing in common due to the transitive nature of the correlation mechanism. Hence, clustering approaches, as well as other similarity-based approaches such as, e.g., record linkage [15], could only be used provided a suitable distance measure is defined. Defining this measure is equivalent to identifying how to correlate messages, which is done in this paper. Session reconstruction. Web usage mining has raised the problem of session reconstruction [41]. A session represents all the activities of a user on a Web site during a single visit. Identification of users is usually achieved using cookies and IP addresses, or through heuristics on the duration and behavior of the user. A time-based approach for identifying Web service sessions is presented in [14]. The proposed approach heuristically assumes a session duration threshold and then evaluates the quality of the sessions based on the assumption that the resulting sessions should be similar in terms of number of services consumed, the order of service consumed and the like. The threshold is then updated until sessions with a satisfactory quality are found. This approach enables discovering sessions, where all sessions are well separated, i.e., there are no concurrent sessions with a same service, and also when sessions are similar. These cover a small portion of real cases. Moreover, according to our study and [5], time is not enough for message correlation in services, and messages in long-running service transactions are often correlated based their on content. Application dependency. Correlation is often cited in the context of dependency discovery, where the task is to identify whether some events may depend on some others. However, correlation in that context bears a different meaning than the one intended in this paper. It refers to a temporal dependency between events where, for example, an event is
123
H. R. Motahari-Nezhad et al.
a cause that triggers one or more subsequent events. Examples of approaches in this category include several statistical approaches to numerical time series correlation [27] or event correlation for root cause analysis [42]. It is possible that events that are identified as dependent in the above approaches turn out not to be part of the same process instance. Conversely, events of a same process instance may not be recognized as dependent by these approaches if they occur very far from each other. Correlation in web services. The problem of correlation in Web services and the need for automated approaches has first been reported in our earlier work [8] where a real situation on how to correlate service messages is presented. Correlation patterns in Web service workflows are studied in [5] where three categories of function-based, chain-based, and aggregation functions are identified. This categorization covers the correlation classes discussed in this paper; however, they do not provide automated support for message correlation. The need for automated approaches for correlation of service messages is also raised in [35]. This work proposes an approach to discover the correlation between message pairs (e.g., PurchaseOrder and Invoice message pair) from the log of service interactions. In addition to the fact that our approach for messages correlation for Web services (found in [32]) is the first work reported in this space, our work is more advanced in several respects: (i) they identify correlations between pairs of message types, while we reason at the instance level and use many properties of instance to identify interesting correlation conditions; (ii) our approach covers advanced classes of correlation conditions (atomic and composite), while they only consider atomic conditions; (iii) we introduce the notion of process views and process spaces to account for the fact that there is more than one possible way of correlating messages into instances. However, using that approach, it is not clear how messages are related at the instance and process level.
8 Conclusion and future work In this paper, we have presented a set of novel concepts, techniques, and a tool for the correlation of process events in the context of service-based processes. We have characterized the problem of event correlation in terms of discovering process views expressed in terms of different ways to group messages into service instances. To the best of our knowledge, this is the first work to introduce the concept of process spaces, and a process spaces discovery system called Process Spaceship [30] empowered by a semi-automated approach to discover the set of process views of an enterprise from process event logs. The main contributions, besides the framing of the problem, lie in (i) the introduction of notions of process views,
123
process map, process spaces, and process views discovery systems, (ii) the identification of correlation conditions for service-based processes, (iii) the presentation of an efficient and viable approach for the discovery of process views by adopting a level-wise approach, and (iv) the organization of process views into a process map, labeled and structured in terms of the instances and models implied by the correlations associated to each node in the map. Research in the area of process spaces is just starting. Further steps needed in this direction involve defining and discovering the correlation of heterogeneous and distributed data (relational data, XML, MS Word documents, and so on), and runtime components that allow tracking of executions at different abstraction levels based on the process views. In particular, the process space discovery framework can be extended in many directions. These extensions make it possible to adapt it in various contexts. We highlight some future research directions: Condition language. A lot of work has been done in the record linkage community to uncover relationships among seemingly separate entities. Users might be interested in discovering non-trivial correlation properties within collections of data (e.g., documents, files, etc). The condition language used in this article is basic and can be extended in several directions to handle various situations. For instance, the equality between values can be replaced by a similarity function. Such conditions would replace the current graph structure with a weighted graph, and the problem of identifying connected components is replaced by that of identifying strongly connected components, i.e., components such that the aggregate of their relationship weights is above some threshold. Business process models and process views relationships. We have adopted state machines in this paper, assuming that service interactions exhibit behaviors represented by sequential business process. In many cases, the interactions of business processes involve concurrent activities, and therefore, more expressive process formalism (such as Petri net) needs to be adopted. In such a case, the relationships between process views could be more complex (see [47]). Therefore, the definitions of process map and the relationships between process views would require a non-trivial extension correspondingly. Imperfect logs. Event logs may be incomplete or noisy. For instance, consider instances with missing messages and/or missing attribute values. Such imperfections may make the message correlation problem harder. We are planning to investigate extensions of our approach to tackle the problem of imperfect logs. Optimization and approximation techniques. The approach for computing the set of correlated message pairs (Rψ ) is to include all pairs of correlated messages in Rψ . For long instances that are correlated using a
Event correlation for process discovery from web service interaction logs
key-based approach, Rψ becomes large (if cmax is the size of the longest instance, then the size of correlated message pair only for this instance is (cmax )2 ). However, in some cases, it is possible to devise some optimization techniques to not include all the correlated message pairs to compute G ψ (L). For instance, assume that we have Rψ = {(1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4), (5, 6)}. In this case, P Iψ (L) = {1, 2, 3, 4, 5, 6}. In fact, to compute P Iψ (L), it is sufficient to have Rψ = {(1, 2), (1, 3), (1, 4), (5, 6)}, and not including pairs (2, 3), (2, 4), (3, 4) does not change the results due to transitivity relationships between message pairs. In some other cases, it may be appropriate to use approximate approaches instead of exact approaches proposed in this paper. This may include using a smaller sample of the dataset to estimate the number of instances, but not to actually compute them. Devising and applying such techniques may reduce both time and space complexity of computing graph G ψ (L). Acknowledgments Authors would like to thank anonymous reviewers for providing invaluable comments on earlier drafts of this paper. We also would like to thank researchers and students at UNSW and other universities and in particular Adnene Gubtani and Seyed A. Beheshti for assisting in the experimentation of the tool usage.
References 1. Agrawal, R., Ghosh, S.P., Imielinski, T., Iyer, B.R., Swami, A.N.: An interval classifier for database mining applications. In: Proceedings of VLDB’92, pp. 560–573 (1992) 2. Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of SIGMOD’93, pp. 207–216 (1993) 3. Alon, N., Gibbons, P.B., Matias, Y., Szegedy, M.: Tracking join and self-join sizes in limited storage. In: PODS, pp. 10–20 (1999) 4. Alonso, G., Casati, F., Kuno, H.A., Machiraju, V.: Web services– concepts, architectures and applications. Data-centric systems and applications. Springer, Berlin (2004) 5. Barros, A.P., Decker, G., Dumas, M., Weber, F.: Correlation patterns in service-oriented architectures. In: Proceedings of 10th International Conference on Fundamental Approaches to Software Engineering (FASE) vol. 4422 of LNCS, pp. 245–259 (2007) 6. Beeri, C., Eyal, A., Milo, T., Pilberg, A.: Query-based monitoring of bpel business processes. In: Proceedings of SIGMOD’07, pp. 1122–1124 (2007) 7. Benatallah, B., Casati, F., Toumani, F.: Representing, analysing and managing web service protocols. Data and Knowledge Engineering 58(3), 327–357 (2006) 8. Benatallah, B., Motahari, H., Saint-Paul, R., Casati, F.: Rotocol discovery for web services. In: 13th HP OVUA (2006) 9. Bobrik, R., Reichert, M., Bauer, T.: View-based process visualization. In: Proceedings of International Conference on Business Process Management (BPM), pp. 88–95 (2007) 10. Casati, F., Castellanos, M., Dayal, U., Salazar, N.: A generic solution for warehousing business process data. In: Proceedings of VLDB’07, pp. 1128–1137 (2007)
11. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. The MIT Press/McGraw-Hill Book Company, Cambridge/New York (2001) 12. Dasu, T., Johnson, T., Muthukrishnan, S., Shkapenyuk, V.: Mining database structure; or, how to build a data quality browser. In: SIGMOD, pp. 240–251 (2002) 13. Demers, A.J., Gehrke, J., Panda, B., Riedewald, M., Sharma, V., White, W.M.: Cayuga: a general purpose event monitoring system. In: Proceedings of CIDR’07, pp. 412–422 (2007) 14. Dustdar, S., Gombotz, R.: Discovering web service workflows using web services interaction mining. Int. J. Bus. Process Integ. Manag. (IJBPIM) 1(4), 256–266 (2006) 15. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE 19(1), 1–16 (2007) 16. Georgakopoulos, D., Hornick, M.F., Sheth, A.P.: An overview of workflow management: from process modeling to workflow automation infrastructure. Distribut. Parallel Databases 3(2), 119– 153 (1995) 17. Grigori, D., Casati, F., Castellanos, M., Dayal, U., Sayal, M., Shan, M.-C.: Business process intelligence. Comput. Ind. 53(3), 321– 343 (2004) 18. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., Massachusetts (2005) 19. Hara, C.S., Davidson, S.B.: Reasoning about nested functional dependencies. In: Proceedings of 18th ACM SIGMOD-SIGACT-SIGART Symp. Principles of database systems (PODS’99), pp. 91–100. ACM Press, New York (1999) 20. Hipp, J., Guntzer, U., Nakhaeizadeh, G.: Algorithms for association rule mining—a general survey and comparison. SIGKDD Explor. 2(1), 58–64 (2000) 21. HP. HP OpenView Solutions. http://www.managementsoftware. hp.com (2007) 22. IBM. FileNet Enterprise Content Management Solutions. http:// www.filenet.com (2007) 23. IBM. WebSphere Business Process Management software. http:// www.ibm.com/software/integration (2007) 24. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988) 25. Kivinen, J., Mannila, H.: Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149(1), 129– 149 (1995) 26. Liu, D.-R., Shen, M.: Workflow modeling for virtual processes: an order-preserving process-view approach. Inf. Syst. 28(6), 505– 532 (2003) 27. Mannila, H., Rusakov, D.: Decomposition of event sequences into independent components. In: Proceedings of 1st SIAM International Conference on Data Mining (January 2001) 28. Mannila, H., Toivonen, H.: Levelwise search and borders of theories in knowledge discovery. Data Mining Knowl. Discovery 1(3), 241–258 (1997) 29. McGarry, K.: A survey of interestingness measures for knowledge discovery. Knowl. Eng. Rev. 20(1), 39–61 (2005) 30. Motahari-Nezhad, H.R., Benatallah, B., Saint-Paul, R., Casati, F., Andritsos, P.: Process spaceship: discovering and exploring process views from event logs in data spaces. Proc. VLDB Endow. 1(2), 1412–1415 (2008) 31. Motahari-Nezhad, H.R., Saint-Paul, R., Benatallah, B., Casati, F.: Deriving protocol models from imperfect service conversation logs. IEEE TKDE 20(12), 1683–1698 (2008) 32. Motahari-Nezhad, H.R., Saint-paul, R., Benatallah, B., Casati, F., Andritsos, P.: Message correlation for conversation reconstruction in service interaction logs. Technical Report UNSW-CSE-TR0709, The University of New South Wales, Australia (March 2007) 33. Motahari-Nezhad, H.R., Saint-Paul, R., Benatallah, B., Casati, F., Ponge, J., Toumani, F.: Servicemosaic: interactive
123
H. R. Motahari-Nezhad et al.
34.
35.
36.
37.
38. 39. 40.
41.
analysis and manipulation of service conversations. In: Proceedings of ICDE’07, pp. 1497–1498 (2007) Oracle. Business Activity Monitoring. http://www.oracle.com/ technology/products/integration/bam/pdf/oracle-bam-datasheet. pdf (2006) Pauw, W.D., Hoch, R., Huang Y.: Discovering conversations in web services using semantic correlation analysis. In: Proceedings of International Conference on Web Services (ICWS’07), pp. 639– 646 (2007) Pauw, W.D., Lei, M., Pring, E., Villard, L., Arnold, M., Morar, J.F.: Web services navigator: visualizing the execution of web services. IBM Syst. J. 44(4), 821–846 (2005) Petit, J.-M., Toumani, F., Boulicaut, J.-F., Kouloumdjian, J.: Towards the reverse engineering of denormalized relational databases. In: Proceedings of 12th International Conference on Data Engineering (ICDE’96), pp. 218–227 (1996) Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000) Sahar, S.: Interestingness via what is not interesting. In: Proceedings of (KDD 1999), pp. 332–336 (1999) Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: Gordian: Efficient and scalable discovery of composite keys. In: Proceedings of VLDB’06, pp. 691–702 (2006) Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, B.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. INFORMS J. Comput. 15(2), 171–190 (2003)
123
42. Steinle, M., Aberer, K., Girdzijauskas, S., Lovis, C.: Mapping moving landscapes by mining mountains of logs: novel techniques for dependency model generation. In: Proceedings of VLDB’06, pp. 1093–1102 (2006) 43. Tran, H., Zdun, U., Dustdar, S.: View-based reverse engineering approach for enhancing model interoperability and reusability in process-driven soas. In: 10th International Conference on Software Reuse, pp. 233–244 (2008) 44. van der Aalst, W., ter Hofstede, A.H.M., Weske, M.: Business process management: a survey. In: Proceedings of International Conference on Business Process Management (BPM), pp. 1–12 (2003) 45. van der Aalst, W., van Dongen, B.F., Herbst, J., Maruster, L., Schimm, G., Weijters, A.J.M.M.: Workflow mining: a survey of issues and approaches. Data Knowl. Eng. 47(2), 237–267 (2003) 46. van der Aalst, W., van Hee, K.: Workflow management: models, methods, and systems. MIT Press, Cambridge (2002) 47. Weidlich, M., Barros, A., Mendling, J., Weske, M.: Vertical alignment of process models—how can we get there? In: CAiSE 2009 Workshop Proceedings, 10th Workshop on Business Process Modeling, Development, and Support (BPMDS’09), pp. 71–84 (2009) 48. Zhao, X., Liu, C., Sadiq, W., Kowalkiewicz, M.: Process view derivation and composition in a dynamic collaboration environment. In: OTM Conferences (1), pp. 82–99 (2008)