Discovering Event Correlation Rules for Semi ... - Semantic Scholar

3 downloads 1536 Views 2MB Size Report
IBM T.J. Watson Research Center. 19 Skyline Drive ... Hawthorne NY 10532 USA gtlakshm@us.ibm.com ...... Programming without a call stack - event-.
Discovering Event Correlation Rules for Semi-Structured Business Processes Szabolcs Rozsnyai

Aleksander Slominski

Geetika T. Lakshmanan

IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne NY 10532 USA

IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne NY 10532 USA

IBM T.J. Watson Research Center 19 Skyline Drive Hawthorne NY 10532 USA

[email protected]

[email protected]

[email protected]

ABSTRACT In this paper we describe an algorithm to discover event correlation rules from arbitrary data sources. Correlation rules can be useful for determining relationships between events in order to isolate instances of a running business process for the purposes of monitoring, discovery and other applications. We have implemented our algorithm and validate our approach on events generated by a simulator that implements a real-world inspired export compliance regulations scenario consisting of 24 activities and corresponding event types. This simulated scenario involves a wide range of heterogeneous systems (e.g. Order Management, Document Management, E-Mail, and Export Violation Detection Services) as well as workflow-supported human-driven interactions (Process Management System). Experimental results demonstrate that our algorithm achieves a high level of accuracy in the detection of correlation rules. This paper confirms that our algorithm is a step towards semiautomating the task of detecting correlations. We also demonstrate how correlation rules discovered by our algorithm can be used to create aggregation nodes that allow more efficient querying, filtering and analytics. The results in this paper encourage future directions such as distributed statistics calculation, and scalability in terms of handling massive data sets.

Categories and Subject Descriptors H.3.4 [Systems and Software]: Distributed Systems, H.3.0 [Information Storage and Retrieval]: General, E.1 [Data Structures]: Data Structures

General Terms Algorithms, Management, Performance

Keywords Correlation discovery. Business Process Discovery, Complex Event Processing, Data Mining, Event Analysis

1. INTRODUCTION Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DEBS’11, July 11-15, 2011, New York, New York, USA. Copyright 2011 ACM 1-58113-000-0/00/0010…$10.00.

Systems that support today’s globally distributed, rapidly changing and agile businesses are steadily growing in size as well as complexity. They are becoming increasingly federated, loosely coupled, distributed and at the same time generating a huge number of events ranging from record entries representing business activities to more technical events at various levels of granularity. Industries such as healthcare and insurance have witnessed an explosion in the growth of semi-structured business processes that has been fuelled by the advent of such systems. These business or scientific processes depart from the traditional kind of structured processes; their lifecycle is not fully driven by a formal process model. While an informal description of the process may be available, the execution of a semi-structured process is not completely controlled by a central entity (such as a workflow engine). Monitoring such semi-structured business processes is useful because it enables a variety of business applications such as process discovery, analytics, verification and process improvement. Accomplishing this is an important research challenge. Such processes could be implemented on diverse event-driven architectures, where none of the components have to be aware of each other and the interactions are driven by events in an asynchronous fashion [10][16][17]. Creating a unified view of processes, also known in literature as composite business applications [9] is a difficult problem. Not every event contains a unified process instance identifier for creating an end-to-end view of the underlying processes. In certain scenarios, events are also transformed or aggregated during execution steps so that identifiers that relate events to process instances or to each other become extremely hard to track [18][19]. This is a key problem that arises when tracking process instances across various system and application layers. In fast changing environments where business processes are executed across a wide range of distributed systems it is difficult to trace process instances as the relationships of events must be explicitly known and defined. Furthermore, supposedly isolated process instances, a transport coordination process for example, can be related to other processes such as the order management and invoicing process. The attributes that bridge those distinct processes, however, can only be found in the events of isolated processes instances. An important concept in event processing is event correlation which is linking event instances based on their payload values [1]. The first step towards isolating a process instance in the scenarios we are targeting involves correlation of events generated by heterogeneous and distributed systems. This allows one to isolate and track end-to-end instances of a given semi-structured business process. The problem of correlating events has been addressed in the past for the purposes of integrating large and complex data

sources. In this area the task of matching schemas (relational database schemas for instance) for the purposes of tracking an end-to-end process instance has been identified as a very timeconsuming and labor intensive process that requires tool-support and automation [1][4]. Consequently a significant amount of research effort has been devoted to information retrieval, knowledge representation, schema mapping and translation as well as integration [5]. Extensive work has also been conducted in the domain of data integration and exchange motivated by the requirements for processes such Extract Transform Load (ETL) processes in data warehousing. In data warehousing, an ETL process requires the extraction of data from various sources and the transformation of the data to match a corresponding target schema. Such data exchange scenarios require extensive knowledge about the semantics of data structures in order to convert messages from a source schema to a target schema. Existing work devoted to deriving relationships between data elements for the purposes of data exchange has a strong focus on foreign-key relationships and assumes relational data (i.e. normalized) [4][5][6][7]. Finding and defining relationships (correlations) in an arbitrary and non-normalized data space has thus far received little attention, and is the focus of our work. In this paper we address the problem of automatically deriving correlations from arbitrary sources of data. A correlation, in the context of this work, is a set of rules that define which attribute(s) form a relationship between events. This type of correlation is to a certain extent comparable to foreign-key relationships known from the relational world. An important difference, however, is that we do not assume that events are grouped together in a normalized schema and nor do we assume that we have any information on meta-data that describes an event’s attributes. In this paper we present a correlation discovery algorithm that is built upon some preliminary ideas presented in our recent workshop paper [1]. We describe the design and implementation of our correlation discovery algorithm and present a comprehensive evaluation of our algorithm’s detection performance with respect to a real-world inspired order management and export compliance regulations scenario. We designed and implemented a simulator to implement this scenario. We also demonstrate the utility of our algorithm to create aggregation nodes that facilitate efficient calculation of composite level aggregate statistics. The discovered correlation rules produced by our algorithm can be used either during runtime to group related events together, such as events belonging to a process instance or to create a graph of relationships that enables querying and traversing relationship paths. The first part of the paper (Section 2) defines and discusses certain terminologies that are essential for understanding the concepts and brings the contribution into a larger context to highlight the importance. In Section 3 we introduce the correlation discovery algorithm and explain the major concepts with simple examples. In Section 4 we introduce and discuss evaluation results. Finally, in Section 5 we put our solution in context with related work and in Section 6 we provide an outlook for the future work.

2. BACKGROUND AND TERMINOLOGY In this section we define and discuss terminology that is essential for understanding the concepts described in this work. In addition we present correlation discovery in the context of event processing applications (Figure 2) and briefly discuss each layer

to create a better understanding of the importance and usage of this paper’s contribution in a broader context. Finally we discuss advanced correlation representations in order to highlight certain aspects of the correlation discovery.

2.1 Terminology A correlation describes the relationship between two events and defines a collection of semantic rules to specify how certain events are related to each other. Correlations are defined through specifying correlating attributes between event types. The ability to define relationships between events is an important component in event processing applications such as event- driven rules [24]. Such applications allow the detection of business situations in order to trigger automatic responses such as early warnings to prevent damage, loss or excessive cost, and provide alerts to exploit time-critical business opportunities. Correlations are also an important aspect for event retrieval systems, pattern discovery and event mining [20]. The definition of a correlation between event types is called a correlation rule. For instance, the following expressions A.x = B.y represents a correlation rule between the event types A and B over their attributes x and y. Single correlation rules are typically not capable of isolating specific patterns that are of interest. Therefore, it is necessary to combine several correlation rules in order to be able to define a correlation that includes all events that share a relationship in a certain context. The context might be, for example, the instance of a process, as demonstrated in the transportation process (illustrated in a example in Figure 1), that is executed across different systems and thus produces various events. If a user has enough knowledge about the underlying systems and events he or she can easily express the correlation rules as: OrderReceived.OrderId = ShipmentCreated.OrderId, ShipmentCreated.ShipmentId = TransportStarted.ShipmentId, TransportStarted.TransportId = TransportEnded.TransportId, This allows a correlation engine to isolate a desired process instance.

Figure 1: Tracking Correlation Rules for a Transportation Scenario In the above example we isolate specific process instances. Events, however, might share all kinds of relationships that can be expressed. A user or component (e.g. a rule engine) might not always be interested in process instances, but in certain dimensions of events such as in the case of a correlation that groups all related events together if they have the same customer (orders placed by the same customer). Such a correlation would enable another component to continuously calculate the average order volume for instance. Correlation rules are defined on the basis of a user’s objectives. Therefore a correlation discovery algorithm is a means for a user to group events via correlation

rules in order to satisfy his or her objectives. Previous work [20] has separated correlations into two major groups – primal and bridged correlations. The primal correlation defines direct correlation relationships between event types and their attributes. The bridged correlation extends this model by allowing the definition of correlations between several primal correlation. This type of correlation allows forming indirect relationships between events through defining bridging attributes between primal sets of correlations.

2.2 Conceptual Overview and Context

be stored for further analysis following the store everything, discover later paradigm. The idea is that at the time the data is stored it is not necessarily known what a user is specifically going to look for in it. Therefore, it is important to store as much data as possible in its original und unaltered form. This is particularly true for correlation discovery. At a later point in time a user may discover the importance of a specific group of events which had been of little interest in the past. Now such events can be analyzed by a correlation discovery algorithm to detect relationships between them for further use. Correlation Discovery. The correlation discovery algorithm takes events from the storage component and determines correlations, by calculating a unique combination of statistics on attributes. The output of the correlation discovery algorithm are correlation rules that express how certain events are related to each other. Those correlations can either isolate process instances (e.g. an Order Process) or certain dimensions (by Customer, by Product). Correlation Engine. A correlation engine uses the previously discovered and defined correlation rules during runtime to either group related events together or create a graph of relationships by connecting events through their shared dimensional relationships. A correlation engine might also apply the correlation rules on a storage system containing historical events to create a graph of relationships that then can be used later for analytical purposes.

Figure 2: Correlation discovery and its applications Figure 2 illustrates conceptually where a correlation discovery component would fit with respect to an end-to-end system serving different applications such as process mining, analytics, monitoring and querying. Next we describe each layer: Data Sources. The bottom layer represents event processing source systems, producing a wide range of artifacts (events, records, logs, etc) from different domains at various levels of granularity. Data Integration. The data sources produce events that represent activities or resources associated to processes and can be consumed by applications such as process analytics. Such events can be in different formats (XML, PDF, JSON, CSV, etc) and with various structures (XSD, column semantics of CSV files, etc). Furthermore, the data sources are constantly subject to change. Changes may occur when IT systems are replaced, when data structures are improved, errors are fixed or new components are introduced that add additional data. Connecting systems directly with the source is therefore rarely an alternative as every change is accompanied with large integration efforts. Therefore data integration creates an abstraction layer over those source events in order to have a stable representation which can be used by applications at higher layers. The advantage is that the abstracted layer does not change, but the data mapping and the extraction of the attributes from the source is altered. Storage. Events extracted from various source systems can be either delegated to real-time event processing components or can

Applications. Correlated events can have several applications. Events correlated at runtime might be used in monitoring applications or event-driven rules to detect exceptional situations and raise alerts. Another application is process mining. Process mining algorithms require historical traces of process instances from which they can derive a process model. Correlation rules can be applied to execution traces before applying a mining algorithm to isolate the process instances that are of interest. Correlated process execution instance traces can then be provided as input to the mining algorithms. Correlations can lead to graphs of relationships that can be utilized to speed up queries if events are stored. It would be possible to traverse through the graph of relations by accessing the various references that are represented by correlations. Correlations are particularly useful for features that require interaction, analysis and exploration of events.

2.3 Enabling Aggregation Nodes from Discovered Correlation Rules The correlation discovery algorithm, described in detail in Section 3, generates a set of correlation rules that reflect valid correlations between events. The complete combination of rules do not always isolate process instances or specific dimensions of relationships between events such as for example grouping related events together if they have the same customer. The user must apply his or her domain knowledge and interest to group correlation rules so that a correlation engine is capable of creating a network of relationships that keeps track of correlated events for event processing purposes such as continuously calculating statistics, observing patterns and reacting to certain situations. Therefore, we introduce the concept of aggregation nodes to facilitate grouping correlation rules to represent certain aspects of an application that may be of interest to a user. Aggregation rules also enable efficient analytics and improve the ease of use and performance when querying, browsing and filtering events.

above example the All OrderToShipment Processes aggregation node contains the average values (Avg. CycleTime, Avg. Order Amount) of all underlying processes. Dimensional information about events can be created by grouping the corresponding correlation rules, such as in the case for the aggregation nodes By CustomerId, by Product or by Destination. If the user queries for a particular customer the system could immediately retrieve the By CustomerId aggregation node, which could hold several key statistics (Total Orders, ...). By retrieving that aggregation node, references to all related orders and thus the order processes are maintained and can be immediately accessed.

3. CORRELATION DISCOVERY ALGORITHM Our correlation discovery algorithm consists of three stages: a)

b) Figure 3: Aggregation Nodes for the Transportation Scenario Figure 3 illustrates the data structure that can be applied to organize correlated events with such aggregation nodes. For clarity the middle layer in the figure shows the stream of events that are ordered. The events share a directed correlation. OrderToShipment  {OrderReceived.OrderId = ShipmentCreated.OrderId, ShipmentCreated.ShipmentId = TransportStarted.ShipmentId, TransportStarted.TransportId = TransportEnded.TransportId} The direction can be introduced by the correlation engine based either on chronological order or on another defined causal constraint. Each set of correlation rules gets an identifier assigned which can be used to generate an aggregation node such as the OrderToShipment. For every group of events that matches a group of correlation rules an aggregation node is created that references each event of the subset. In the transportation example shown in Figure 3 there are two OrderToShipment aggregation nodes because there are two isolated groups of process instances. Such aggregation nodes can be used as a constraint in a query when the user wants to restrict the search space only to groups of correlated events that belong to the OrderToShipment. Aggregation nodes also help to create a logical grouping and enable easier querying when using the data in interactive visualizations as the related events already provide connections and do not need extra queries. Furthermore aggregation nodes can contain attributes that can contain calculated statistics of the lower level such as the CycleTime or the OrderAmount in the example shown. By leveraging this concept of representing correlations, it is also possible to create higher level aggregations that include several lower level aggregation nodes. Statistics can be aggregated to provide information over all related events. For instance, in the

c)

Data Pre-Processing. The first step of the correlation discovery process is to load and integrate the data into a data store (e.g. database, cloud storage, etc) that is then used to calculate statistics and determine correlation candidates. Statistics Calculation. After the data has been loaded and integrated into the internal representation, various statistics, mainly on attribute values, are calculated and stored into a fast accessible data structure as illustrated in the table in Figure 4). Determining Correlation Candidates. In the last step the correlation discovery algorithm determines correlation pairs with a certain confidence value based on the statistics calculated in the previous step.

In the following sections 3.1-3.3 we discuss each step in detail with respect to the transportation scenario introduced in this section.

3.1 Data Pre-Processing The first step (Step 1 in Figure 5) is to infer a configuration setup for data integration and correlation discovery from data that may be present in sample execution traces or directly retrieved from other data sources. Configuration requires specification of the: a)

properties (i.e. attributes) that should be extracted from the raw events, and

b)

attribute extraction algorithms that should be applied to extract the events attributes.

For the purposes of simplicity the examples in this paper focus on data sources represented in XML. Nevertheless our proposed algorithm for detecting correlation identifiers is widely applicable to heterogeneous data sources and not limited to XML. The data sources specified as input are parsed and a property definition is created for each element and its attributes. A property is also referred to as an alias that is a representation of an extracted attribute of an event. Since we assume in this paper that sources are represented in XML, for each property a corresponding XPath expression is derived from the source structure that allows an extraction algorithm to extract the property each time an event is added to the storage.

Figure 4: Data Pre-Processing In situations where an XML element or an attribute is not unique and may exist as a child in other elements, their corresponding XPath expressions are grouped together as shown in the example (a) in Figure 4. The example (a) in Figure 4 shows a property definition named gtd:caseId. The element ps:XPathAlias refers to that property and defines a set of XPath expressions. After an event has been loaded into the data store as a record, the system is able to infer a configuration for it which consists of the Property Aliases and XPath extractors. Configuration allows automatic determination of the extraction algorithms which should be applied to extract the attributes of an event. In this example if the document is of a type gtd:Order then the caseId of the document is extracted and stored explicitly as an attribute. In the next step (Step 2 in Figure 4), after the configuration has been generated, the raw event sources (such as event traces) or a sample set of them are loaded into data storage. The loading process is aware of the (semantic) “type” (e.g. it is an Order) of the data and flags data accordingly. In the example we depicted in Figure 1, the type is determined by the top-level XML element names such as CaseCreated, OrderReceived, etc. Other methods may be used to determine the type of an event. For instance by applying information known beforehand about the source or by more sophisticated methods of automatically discovering type characteristics. Regardless of the choice of method, events are separated into groups, clusters or types as the goal is to determine the relationships between those types or clusters of events. We use HBase, an open source, non-relational, distributed database modelled after Google's BigTable. It consists of sorted key-value pairs where a key is a unique identifier and its value

spans an arbitrary number of immutable attributes (Step 3 in Figure 4). These attributes can be grouped together in families such as Common, Alias and Graph. Their structure is comparable to relational schemas with a major difference being that the attributes are schema-less. This means that there is neither a defined set of attributes nor a data type defined for those attributes [22]. For example a CaseCreated event may contain three attributes while an E-Mail may contain four completely different data types of attributes. This kind of data structure has many advantages in distributed cloud storage systems as tables are always sorted by their key and thus can be easily distributed horizontally over several machines. Applying MapReduce jobs (M/R) for analytical or query tasks over huge data sets can significantly boost performance. We intend to study the utilization of M/R jobs to speed up correlation discovery over large data sets in future work. The raw event with its (semantic) type is inserted as-is into the Common family along with a unique identifier as the key. Based on the initial configuration that was created, the attributes are extracted and stored separately into the Alias family (Step 4 in Figure 4). The most important step is the indexing (Step 5 in Figure 4). Every extracted value of a raw event is stored into an inverted index. For each type and attribute a separate index table is created where the value of an attribute becomes the key and the value of the index table holds a list of references to the corresponding records where the key occurs. This enables the calculation of statistics that are needed for correlation discovery. A separate index for each attribute and type is created. The next step is to compute statistics on the pre-processed data.

Figure 5: Type and Attribute Statistics

3.2 Statistics Calculation After the raw events have been loaded and pre-processed the next step is to compute and store various statistics about the events. Figure 5 illustrates the data structure and lists the statistics that need to be calculated for each type and attribute in order to detect correlation candidates. For every event type a type-map TypeStats container (Step 1 in Figure 5) is created containing all attributes that ever occurred for that type including statistics, referred to as TypeStats (Step 2 in Figure 5), for each of those contained attributes. Each TypeStat contains the following calculated statistics (Step 3 in Figure 5): • Attribute Cardinality: Based on the previously created inverted index, the Attribute Cardinality contains a map of each value and how often each of those values occur. • Card: Determines the number of different values for the attribute (cardinality). • Cnt: Represents the total number of instances in which the attribute occurs (count). As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. • AvgAttributeLength: Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long length value may signify that an attribute might be a unique identifier. Unique identifiers such as OrderId are potential attributes that occurs in other types and thus form a correlation. Long attribute lengths may also be misleading since a textual description may be very long and is unique but it is never used for correlating events. • InferencedType: Defines the data type of an attribute. The data type of an attribute is an important characteristic of correlation discovery for the purposes of reducing the problem space of correlation candidates. The chances that a data type would correlate with another attribute given that the data type contains mostly alpha-numeric attributes are very low. We make a distinction between the numeric and alphanumeric attribute data types. This particular characteristic can, however, be

extended to significantly reduce the problem space. Timestamps, for instance, could be filtered out of correlation candidates. The determination of the data type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi. We support the following type of distinctions: Numeric or Alphanumeric, Timestamp/DateTime, Boolean and Descriptiontext. • NoOfNumeric: Depending on the InferencedType this variable contains the number of values that are of numeric type. • NoOfAlphaNum: Depending on the InferencedType this variable contains the number of values that are of type alphanumeric.

3.2.1 Example In this section we illustrate the calculation of the statistics by using a simple example. In the transportation scenario introduced earlier we distinguish between four different event types: OrderReceived, ShipmentCreated, TransportStarted and TransportEnded.

Figure 6: OrderReceived Events Figure 6 shows a table representing OrderReceived event instances as rows and their attributes as columns. In the next section we explain the calculation of the statistics for the Product attribute.

Figure 7: Product Index

The attribute cardinality (named as Index in Figure 5) contains a map of each value and how often each of these values occur (Figure 7).

Figure 8: OrderReceived statistics Based on the index we can determine the cardinality (Card), which is four as we only have four different products occurring in our event instances. The Cnt for the Product attribute is in this case 5 as it occurs in every event. This might not always be the case. With the index we can determine the AvgAttributeLength for the Product. In this simple example the variance of the product names is zero and the AvgAttributeLength is 8. The type inference component also utilizes the index to determine the type (which is alphanumeric). The next step is to compute correlation candidates on the basis of the computed statistics.

3.3 Determining Correlation Candidates At this point data has been loaded into the storage and various statistics have been calculated for each type and attribute of events. This provides a foundation for determining the correlation candidates. The goal of the candidate matching algorithm is to utilize the statistics within certain boundaries (parameters) to present a result set containing pairs of potentially correlating attributes expressed by a confidence score. This has the advantage of allowing a user to specify approximate parameters and select desired candidates through a user interface. In a fully automated solution a system can select candidates with a very high confidence factor. The confidence score of correlation candidates is determined by the following three parameters with a default set of weights: a)

b)

c)

Difference Set. A difference set determines the difference between all permutations of pairs of all attribute candidates on their instance data and is assigned a weight of 60%. Difference between AvgAttributeLength. The difference between the lengths of values of two correlation candidates is assigned a weight of 20%. LevenshteinDistance. The Levenshtein distance between attribute names is assigned a weight of 20%

We determine the weights for each parameter experimentally. Now we explain the calculation of each of these parameters used for the overall confidence score calculation. Difference Set. The first step in computing the confidence score is to compute the difference set of all permutations of pairs of all attribute candidates. To reduce the search space of candidates we apply an approach similar to [3][6][7][8], where we first want to determine Highly Indexable Attributes for each type and then Mappable Attributes to form pair candidates. A Highly Indexable Attribute is an attribute that is potentially unique for each instance of a type. This attribute is determined by the following equation: IndexableAttributeSet := {i | i ∈ Attributes ∧ (Card(i) / Cnt(i)) > Alpha ∧ AvgAttribtueLength(i) > Epsilon}

Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates. Epsilon is an additional parameter that defines the minimum average length of an attribute. The Mappable Attribute can be seen as a means to reduce the search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how frequently a value of an attribute can occur. The assumption is that if it occurs more than x times it is unlikely that it is a correlation candidate. Our approach of reducing the search space is inspired from the relational data field. Consider for example an order relation that contains one unique key. Customer complaints are stored into a separate relation containing the order-key as a reference. We assume that that a complaint cannot occur more than 10 times for one order. The Mappable Attribute is defined as follows: MappableAttributeSet := {m | m ∈ Attributes ∧ Card(m) < Gamma} Gamma is a threshold parameter that can be set experimentally and customized to the application scenario based on knowledge of the events. This parameter bears the drawback of missing correlation candidates in some cases. For example, in a situation where a Customer has many Orders with a foreign-key relationship, it does not make sense to set a value for Gamma. By determining all the Indexable and Mappable Attributes of all types the next step is to find candidates of pairs of attributes that potentially correlate with each other. Therefore a difference set A\B = {x | x∈Α ∧ x∉Β} between all permutations of attribute candidates A and B is created where A = IndexableAttributeSet and B = MappableAttributeSet. The size of A\B must be below a certain threshold in order to be taken into account: |A\B| Alpha ∧ AvgAttribtueLength > Epsilon.

Commutative rules are removed from this list. In this case every attribute within a pair has the same type. If attributes are not of the same type they are also excluded from the list and thus the difference set is not calculated (such as it is the case for OrderReceived.OrderId = ShipmentCreated.Carrier). Based on this list, first we determine the DifferenceSet for all correlation rules. The remaining list contains only one correlation rule: OrderReceived.OrderId = ShipmentCreated.OrderId Then we determine the difference between the AvgAttributeLengths between the candidates and finally we calculate the LevenshteinDistance. The result is a table with all correlation rule candidates containing the previously determined weight. In this reduced case there is only one candidate where DifferenceSet = 0, AvgAttributeLengths = 0 and the LevenshteinDistance = 0. The confidence score is calculated based on the weights (DifferenceSet = 60%, AvgAttributeLengths = 20%, LevenshteinDistance = 20%) and is therefore 100% which means that OrderReceived.OrderId = ShipmentCreated.OrderId have a very significant correlation. In the next section we discuss the result of implementing and testing our correlation discovery algorithm on a detailed order management scenario that contains the transportation scenario as a component of its implementation.

4. RESULTS AND EVALUATION

Figure 11: Order Management Scenario Figure 10: ShipmentCreated Mappables Figure 10 shows the statistics for ShipmentCreated events. In this domain it might be unlikely that a shipment has more than 10 orders. However this might cause problems in other domains or for certain relationships (one customer definitely has more than 10 orders). Therefore, we set Gamma = 10 and as Card < Gamma applies for all attribute they are all flagged as mappable attributes. Finally, the attributes DateTime are removed from the candidate list from both OrderReceived and ShipmentCreated as they are of type DateTime and thus they are not suitable for correlation pairs. This also applies for booleans and description texts. Now we have a pruned list of attributes that are potential correlation identifiers for each type and create a list of all permutations of possible correlation rules. OrderReceived.OrderId = ShipmentCreated.ShipmentId OrderReceived.OrderId = ShipmentCreated.OrderId OrderReceived.CustomerId = ShipmentCreated.ShipmentId OrderReceived.CustomerId = ShipmentCreated.OrderId OrderReceived.CustomerId = ShipmentCreated.Carrier

For evaluating the detection accuracy of the correlation algorithm we implement a semi-structured case-oriented business process scenario relating to order management and export compliance regulations (illustrated in Figure 11). This scenario encompasses the transportations process that we introduced earlier in the paper and used to illustrate the concepts of the algorithm. The general idea of the scenario is that every order of a foreign customer has to be checked as to whether it violates certain export regulations. In the case of a clear export violation or inconsistencies the order is flagged, automated background checks are performed to collect information about the customer and the order and then a case is created. Before the order is finally declined or released, domain experts must perform some workflow driven investigation involving e-mail inquiries, site visits and also evidence gathering. This has to be done to ensure that decisions are made objectively and sufficient documentation is available for later justifications or audits (which is important for responsibly releasing an order). If a decision is made to release the order then the order-to-shipment process is continued as normal.

Figure 12: Correlation Discovery Example Screenshot This scenario is particularly interesting for our purposes as it involves a wide range of heterogeneous systems (Order Management, Document Management, E-Mail, Export Violation Detection Services, … ) as well as workflow-supported humandriven interactions (Process Management System). All of those systems generate a wide range of events at different granularity levels which makes it challenging to extract a set of correlation rules that can isolate process instances. The goal of our evaluation is to determine the precision of our proposed correlation algorithm in terms of its accuracy in determining correct correlations. To achieve this, we developed a tool that simulates events representing the processes and systems in the order management scenario and export compliance regulations. In our experiments we take 24 event types, consisting of altogether 95 attributes into account for correlation discovery. Normally, we determine the parameters based on experience and apply knowledge about the source data. We have made good empirical observations on different scenarios by applying the following parameter setup: Alpha = 0.95, Gamma = 1000, Epsilon = 5, Phi = 0.9 and DiffThreshold = 80.

• Phi: 0.90 – 0.50 with steps increasing the parameter by 0.1 • DiffThreshold: 50 with steps increasing the parameter by 5 We left out the Gamma parameter for reasons described in Section 3.3 and set it to a high value. As a result of applying all permutations of the parameters we calculate a total of 1151349 correlation rule candidates for 4265 correlation sets. Among these 4265 sets we determine the threshold of parameters for the correlation set where all of the correlation rules have a confidence of 100%. In order to determine correlation candidates we remove transitive rules within a set. The resulting best configuration of parameters for this particular scenario is: Alpha = 0.95, Gamma=10000, Epsilon=6, Phi = 0.5, DiffThreshold=90.

In order to gain an understanding of the best parameter setup, we conduct an experiment and apply the correlation discovery algorithm with a large spectrum of parameter permutations. The following list presents the intervals of the parameters that have been tested: • Alpha Range: 0.5 – 0.95 with steps increasing the parameter by 0.05 • Epsilon: 5 – 15 with steps increasing the parameter by 1

Figure 13: Number of correlation rules according to the number of correlation sets.

Figure 13 shows the distribution of the number of correlation sets (x-axis) against the number of correlation rules per set (y-axis) computed on the basis of the previously introduced range of permutations. The left side of the x-axis has the parameter combinations with the lowest selectivity and the right side shows the highest selectivity. Two interesting observations can be made from the graph: Point (A) shows a sharp break due to a very high Epsilon parameter. At that point Epsilon=14 and filters out a large portion of correlation candidates as the majority of the potential correlation candidates have a lower minimum average length of their attributes (compare with Section 3.3). The second point of interest (B) in the chart shows a similar drastic change in the selectivity due to a major difference in the DiffThreshold settings. Basically a setting of DiffThreshold > 90 has a high selectivity on the simulated scenario data. We execute the correlation discovery algorithm with the best parameter setup computed above on simulated test data that contains 40 simulated cases. The algorithm detects a total of 464 correlation rules with one false positive and one undesired correlation rule. If the goal is to isolate a process instance then the precision of the algorithm is 99.56% (No.of.RelevantCorrelationRules / (No.of. RelevantCorrelationRules + FalsePositives) * 100). The execution time to calculate the correlation rule candidates for this scenario (total of 1.3MB process data) is on average 320ms on an Intel Core 2 Duo CPU (2.5Mhz) machine with 4GB RAM. One example of incorrect correlation pairs is correlation over the attribute orderVolume that occurs in several events within a process instance. Avoiding this kind of false correlation rule is difficult as the attribute value has a certain length (over 6) that might indicate it is unique and the attribute occurs in several other events due to the same amount that has been ordered by various customers. The other exceptional correlation rule is the relationship between events via their customerId. This correlation is not technically false, but it conflicts with the goal to isolate process instances since a customerId forms relationships across independent process instances (orders of a particular customer). Our experimental results demonstrate that the algorithm achieves good precision in extracting correlation sets. The algorithm extracts a large number of correlation rules that are correct but not all of the extracted correlation rules may be useful to isolate specific process aspects that are of interest to a user. For example, almost every event is correlated over an orderId and a caseId. However, one of them is enough to form a correlation that isolates a process instance. The large number of extracted correlation pairs can be reduced by removing transitive correlation rules (as done in the evaluation above). Another possibility is to apply graph reduction to reduce the number of correlation rules that isolate a process instance. In the particular example discussed above it is possible to reduce the number of correlation rules to 21. Figure 12 shows a screenshot of the correlation discovery user interface displaying an excerpt of the correlation rules discovered in our scenario and applies graph reduction to reduce complexity. The graph edges in the screenshot do not imply a direction. Nevertheless, for applying graph reduction we treat them as a direction. The arrow indicates which nodes contain a mappable attribute. Graph reduction may not be always desirable particularly if the rules are used to create a correlation graph between every related event.

Having implemented the correlation discovery algorithm, we leveraged discovered correlation rules to build aggregation nodes for the order management scenario. The resulting aggregation nodes demonstrate how they can be used to isolate process instances or specific dimensions of relationships between events to allow more efficient querying, filtering and analytics.

5. RELATED WORK Some existing work addresses the problem of correlating events to create a historic view to explore and discover different aspects of business processes [20][21]. Process mining partly addresses the problem by analyzing logged execution data of process instances and generating a representation of a process model. Current work in the area of process mining and discovery such as [11] require clean pre-processed, chronologically ordered and correlated process instance traces [12][13]. The correlation specification in these papers is assumed to be conducted by a human having expert knowledge about the domain, the data sources and the applications involved. The work by DePauw et al [8] is very relevant to our work and influenced the design of our correlation algorithm. Like DePauw et al. we also take the notion and determination of Indexable and Mappable Paths into account, but with the major purpose of reducing the problem space of candidate-pair permutations that need to be checked against each other for potential correlations. In our algorithm this step can be left out and instead every attribute of a type can be matched against every other attribute of the same type. As we store data for correlation discovery in a distributed data store, we can distribute statistics and matching calculations on several machines. This could allow us to significantly reduce the detection time depending on the cluster size and would not force one to make the trade off of reducing the problem space. This is the subject of our future work. Our correlation algorithm also takes several other attribute-based statistics into account to improve the precision of the correlation candidate detection and also calculates a confidence score based on those statistics. Rostin et al. [14] take a machine learning approach to automatically discover foreign key constraints in relational databases. They compile and validate a list of the most selective rules for their purposes including rules such as (a) a foreign-key (FK) must have a good coverage of a primary key (PK), (b) the PK and FK column-names must have significant similarity, (c) the average length difference between the values of attributes should be as low as possible and (d) the value range of PKs should be only slightly outside the range of FKs. Their rules for detecting foreign key constraints share some similarities with our discovery algorithm with the difference that their rules are specific to their application. Therefore, from our point of view the weight confidence must be applied and be specified as parameters in order to adjust to changing data sources. Research by Motahari Nezhad et. al. [23] is also relevant to our work. Their approach primarily takes instance based measures into account to determine the “interestingness” of correlation pairs (and groups of pairs). Similar to DePauw et. al. [8] they apply a basic ratio measure to prune correlation pairs up front. Their approach has a major advantage as instance characteristics are taken into account to significantly improve the result quality of the algorithm, particularly when the application domain is focused on process instance discovery. On the other hand it comes with a trade-off regarding the performance as it requires correlating a relatively large number of messages to form instances. Our

approach, in contrast, focuses on determining correlation pairs based on computed statistics before correlating events as the goal is to produce correlation rules that can subsequently be applied to correlated events for further investigation. The CORDS [6] tool makes use of statistical methods to discover correlations and soft functionalities between database columns to produce a dependency graph to improve the performance of query optimizers. Their approach of detecting correlation candidates is mainly based on the work of Haas and Brown [15] which generates pairing rules of tables and applies pruning rules, such as type and statistical constraints, to reduce the search space. A pairing rule in in the context of their work is a relationship between two attributes such as for example a join between two database tables over two attributes (orders.orderID = deliveries.orderID). In relational databases, data and its attributes are organized in tables (i.e. relations) to minimize redundancy in order to avoid undesired side-effects. For instance, inconsistencies can arise when applying operations (insertions, deletions). This process is commonly referred to as normalization. However, other modelling disciplines, such as Data Warehousing, apply de-normalized and redundant data structures in order to increase the query performance with the trade-off of lower insertion performance. In both cases, there is a detailed knowledge about the data available which is defined in a data schema. This means that there are defined relations with defined attributes and types (e.g. integer, string, timestamp, …). So for instance, a relation Order has a defined set of attributes such as an orderId as Integer or a deliveryTime as a timestamp. A key difference between our work and such other approaches is that our approach does not assume that events are grouped together in a normalized schema and nor does it have any information on meta-data that describes an event’s attribute. Therefore there is no information available if an attribute is of a certain type and therefore the algorithm needs to inference these characteristics based on various attribute value statistics.

6. CONCLUSION AND FUTURE WORK In this paper we address the problem of automatically deriving correlations from arbitrary data sources. The algorithm we present for correlation discovery is similar in principle to previous work that focuses on determining foreign-key relationships known from the relational world. A key difference, however, is that our correlation discovery algorithm does not rely on the assumption that the events are grouped together in a normalized schema and thus can deal with redundancies and does not have any information about meta-data that describes the event attributes. We have implemented our correlation discovery algorithm and designed and implemented a simulator to validate the results. The simulator implements a semi-structured case-oriented business process scenario relating to export compliance regulations. Experimental results on events generated by the simulator indicate that our correlation discovery algorithm achieves good performance in terms of accuracy of generated correlation rules. This allows us to conclude that it is a promising tool for automatically discovering correlation rules. The performance of our algorithm on very large data sets could be greatly improved by distributing the algorithm on multiple machines. The need for discovering correlation rules over large data sets arises for a variety of reasons. In domains where the

algorithm needs to detect correlations between events, representing processes, it is not always possible to extract a small sample set of data. For instance, if a sample set of one week is sliced out for correlation discovery, certain events might be missing and chances that the right correlation rules are detected are low. The correct sample size for detecting the right set of correlation rules depends on the domain, the nature of the event producing systems and the processes. For example, in the case of the transportation scenario, we would expect that an end-to-end process would have a cycle-time of weeks. This means that a good sampling set would require more than a slice of a week. Since the data storage of our system is based on a cloud infrastructure future work includes distributing the computation of statistics and analytics with the goal of operating on large sample sets delivering results in a reasonable amount of time. This would also enable the comparison of correlation rule changes over time. Slices of certain time-frames could be extracted to compute correlation rules for each of the corresponding time-intervals. Comparison of correlation rules from different time-frames could be used for instance to gain insight into process evolution. Another avenue of future work could be the incorporation of semantic knowledge into the correlation discovery algorithm from an ontology space to bridge semantic gaps as Moser et al have done [25]. In order to achieve a fully automated correlation discovery system it is necessary to have a method for grouping or clustering source events. At the data staging step knowledge about the schema and the structure which introduces a type is required. The algorithm detects correlations between attributes of those types. In most cases this is naturally given by the source of the event or by some attribute. When a natural distinction is not possible, however, one needs to be able to create groups, clusters or types automatically without explicitly requiring humans to define ways to differentiate between them.

7. REFERENCES [1] S. Rozsnyai, A. Slominski, and G. T. Lakshmanan. Automated Correlation Discovery for Semi-Structured Business Processes. DMA4SP 2011. [2] R. S. Barga and H. Caituiro-Monge. Event correlation and pattern detection in CEDR. In Proc. Int. Workshop Reactivity on the Web, 2006. [3] G. T. Lakshmanan, P. Keyser , and A. Slominski, F. Curbera, and R. Khalaf: A Business Centric End-to-End Monitoring Approach for Service Composites. IEEE SCC 2010: 409-416 [4] A. Halevy, A. Rajaraman, and J. Ordille. (2006). Data integration: the teenage years (p. 9-16). VLDB. [5] E. Rahm, and P. A. Bernstein. (2001). A survey of approaches to automatic schema matching, 10(4), 334-350. Springer. [6] I. Ilyas, V. Markl, and P. Haas, P. Brown. (2004). CORDS: Automatic discovery of correlations and soft functional dependencies. [7] A. Rostin, O. Albrecht, F. Naumann, J. Bauckmann, and U. Leser. (2009). A Machine Learning Approach to Foreign Key Discovery, (WebDB).

[8] W. De Pauw, R. Hoch, and Y. Huang. (2007). Discovering Conversations in Web Services Using Semantic Correlation Analysis, (ICWS), 639-646. IEEE. [9] G. Hohpe and B. Woolf, Enterprise Integration Patterns, Addison Wesley, 2004

128.ibm.com/developerworks/webservices/library/ws-soaeda-esb/index.html, 112007. [18] K. Gerke, J. Mendling, and K. Tarmyshov, “Case construction for mining supply chain processes”, In W. Abramowicz, editor, Proc. of the Conf. on Business Information Systems, Springer, 2009.

[10] Niblett, P., Graham, S.: Events and Service-Oriented Architecture: The OASIS Web Services Notification Specifications. IBM Syst. J. 44(4). pp. 869--887. (2005)

[19] K. Gerke, A. Claus, and J. Mendling, “Process Mining of RFID-based Supply Chains”, In Proc. IEEE CEC, 2009

[11] B. F. van Dongen and W. van der Aalst, “A meta model for process mining data”, In Proc. of the CAiSE’05 Workshops, vol. 2, pp. 309–320, 2005.

[20] S. Rozsnyai, R. Vecera, J. Schiefer, and A. Schatten. „Event cloud - searching for correlated business events”, In CEC/EEE, IEEE Computer Society, 409–420, 2007

[12] H. Gonzalez, J. Han, J. and X. Li, “Mining compressed commodity workflows from massive RFID data sets”, CIKM ’06: Proceedings of the 15th ACM International Conference on Information and Knowledge Management, ACM, pp. 162– 171, 2006.

[21] H. Roth, J. Schiefer, H. Obweger, S. Rozsnyai: “Event Data Warehousing for Complex Event Processing”, In Proc RCIS, 2009.

[13] G. Decker and J. Mendling, “Process instantiation”, Data and Knowledge Engineering, 2009. [14] A. Rostin, O. Albrecht, J. Bauckmann, F. Naumann, and U. Leser. A machine learning approach to foreign key discovery. In WebDB, 2009. [15] P. J. Haas and P. G. Brown. BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In Proc. 29th VLDB, pages 668–679. Morgan Kaufmann, 2003. [16] Gregor Hohpe. Programming without a call stack - eventdriven architectures. www.enterpriseintegrationpatterns.com/docs/EDA.pdf, 11 2007. [17] Jean-Louis Marechaux. Combining service-oriented architecture and event-driven architecture using an enterprise service bus. http://www-

[22] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In Proc. of the 7th OSDI, November 2006 [23] H. R. Motahari Nezhad, R. Saint-Paul, B. Benatallah, F. Casati, Event Correlation for Process Discovery from Web Service Interaction Logs, Accepted in VLDB Journal, August 2010. [24] J. Schiefer, S. Rozsnyai, C. Rauscher, and G. Saurer. Eventdriven rules for sensing and responding to business situations. In Proc. DEBS, pages 198–205. ACM, 2007. [25] T. Moser, H. Roth, S. Rozsnyai, R. Mordinyi, and S. Biffl. Semantic Event Correlation Using Ontologies. In Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009.

Suggest Documents