Message Correlation for Conversation Reconstruction

0 downloads 0 Views 1MB Size Report
it will identify correlations to generate a proper partition of the log, where the ...... in Java and uses Apache Axis as SOAP implementation engine and Apache.
Message Correlation for Conversation Reconstruction in Service Interaction Logs Hamid R. Motahari Nezhad, Regis Saint-Paul, Boualem Benatallah School of Computer Science and Engineering The University of New South Wales Sydney, Australia {hamidm|regiss|boualem}@cse.unsw.edu.au

Fabio Casati, Periklis Andritsos University of Trento Trento, Italy {casati|periklis}@dit.unitn.it

Technical Report UNSW-CSE-TR-0709

March-May 2007

Abstract The problem of understanding the behavior of business processes and of services is rapidly becoming a priority in medium and large companies. To this end, recently, analysis tools as well as variations of data mining techniques have been applied to process and service execution logs to perform OLAP-style analysis and to discover behavioral (process and protocol) models out of execution data. All these approaches are based on one key assumption: events describing executions and stored in process and service logs include identifiers that allow associating each event to the process or service execution they belong to (e.g., can correlate all events related to the processing of a certain purchase order or to the hiring of a given employee). In reality, however, such information rarely exists. In this paper, we present a framework for discovering correlations among messages in service logs. We characterize the problem of message correlation and propose novel algorithms and techniques based on well-funded principles and heuristics on the characteristics of conversations and of message attributes that can act as identifier for such conversations. As we will show, there is no right or wrong way to correlate messages, and such correlation is necessarily subjective. To account for this subjectiveness, we propose an approach where algorithms suggest candidate correlators, provide measures that help users understand the implications of choosing a given correlators, and organize candidate correlators in such a way to facilitate visual exploration. The approach has been implemented and experimental results show its viability and scalability on large synthetic and real-world datasets. We believe that message correlation is a very important and challenging area of research that will witness many contributions in the near future due to the pressing industry needs for process and service execution analysis.

1

Introduction

The problem of understanding the behavior of information systems and the processes and services they support is rapidly becoming a priority in medium and large companies. This is demonstrated by the proliferation of tools for the analysis of process executions, service interactions, and service dependencies and by recent research work in process data warehousing and process discovery. Indeed, the adoption of business intelligence techniques for business process improvement is the primary concern for medium and large companies [12]. The opportunity for such analysis comes with the increased automated support, which corresponds to the ability to observe (collect events on) executions and interactions, thereby building information sources that can be used for the analysis. Typical applications consist in monitoring process and service executions, performing OLAP-style queries such as average process durations, or identifying steps in the process where the most time is spent (bottlenecks). Recently, variations of data mining techniques have also been applied to process and service execution logs to achieve important results such as (i) process and protocol discovery [25, 20] and (ii) root cause analysis for performance or quality degradations in process and service executions [24, 11]. All these approaches and solutions to business process and service analysis are based on one key assumption: it is known which set of attributes of events identifies the set of events that belong to the same execution (also called case) of a process or service, e.g., belong to the processing of the same purchase order. Without knowing these attributes that correlates events, execution analysis is difficult if not impossible. In reality, the large majority of information in a company is not correlated today [10, 6], and there is no automated or semi-automated way to achieve correlation of events of the same execution case. Without this correlation, tasks such as process and protocol analysis cannot be performed. In addition, the situations in which trivially known correlator attributes (e.g., Instance ID) are present are very limited. In process execution, instance IDs are essentially present only for those processes supported by a workflow management system (WfMS), and specifically for that part of the business process supported by the WfMS. Today a small percentage of processes are supported by a WfMS and even in those cases only a small portion of the entire business process is WfMS-based. Hence, process analysis today is limited to the portions of the processes supported by the WfMS. Taken from the services side, the problem is similar: when studying conversations among services (a conversation is a sequence of message exchanges between parties for achieving a certain goal, e.g., completing a purchase order), we rarely find conversation identifiers in web service execution logs. Only if the service implementation is tightly coupled with the log1

ging infrastructure (e.g., if IBM WebSphere Integration Developer and IBM Process Server are used together [21]), then it is possible that identifiers are included or are known to the logging infrastructure. However, in many cases the logging infrastructure is offered as a standalone product (as in the case of HP SOA Manager and CA SOA Manager) that is not linked to individual service implementations. In these cases, the information about identifiers is not known to the logging infrastructure, but is hidden somewhere in the exchanged messages. In fact, this is one of the main obstacles that the authors faced on integrating automated process/protocol discovery solutions in HP SOA Manager, which is a monitoring and management infrastructure for Web services, and was one of the motivations of this work. This paper tackles the problem of discovering correlations among messages exchanged by services. Specifically, our goal is that of correlating and grouping messages based on the conversation they belong to. This is equivalent to deriving some conversation IDs to label each message. In particular, we define the problem of message correlation, we present algorithms for correlation, and we show how to perform semi-automated correlation by leveraging user input as part of the correlation process. Message correlation is challenging for several reasons. First, the definition of what constitutes a conversation is subjective. As an example, consider the interactions of a game service with its clients (players). One may be interested in grouping the messages based the session identifier into different sessions, in each players may play several games. Another may be interested to group the message based on the game identifier regardless of individual players involved. A third analyst may be interested in grouping the same messages based on player identifiers regardless of sessions and games that they have been involved in. Even if the problem was clearly defined, we would still have to face the challenge of how to perform correlation. Simple approaches such as only looking at temporal vicinity cannot be applied, as we may have dozens of concurrent conversations at each point in time, and as the time gap between two invocations of a service in the context of a same conversation can range from seconds to days. Looking at message content is also far from trivial, as there is no obvious choice for identifying message attributes that can act as correlators, and as the correlation logic can be quite complex. For example, messages that are related to the same purchase order processing may all share the same order identifier, but it can also be that some messages include order identifiers, others include payment identifiers, but they are all related to the same order processing. In addition, not a single attribute but a set of attributes may be used to correlate messages to one another in a same conversation. Even if we knew that there is in the dataset a (set of) attribute(s) that can act as correlator, finding the appropriate one is not trivial, as there are many attributes that could be used as a basis for partitioning messages into conversations (e.g., consider the game service 2

Back-end

Front-end Refinement operations

Rule Summarization

Visual Rule Exploration User Driven Exploration & Refinement

Atomic Rule Inference

Attribute Pruning

Composite Rule Inference

Log tables

Candidate Correlation Rule Inference

Figure 1: Overview of the correlation discovery process

data introduced above). A final challenge consists in understanding how we can leverage possible user knowledge in performing the correlation and how we can guide users in the choice of a partitioning that is consistent with the user’s correlation needs (which in turn depend on the kind of analysis to be performed on the correlated sets). In general, performing message correlation therefore requires exploring different combination of a large number of message attributes and various possible correlation logics without a clear and formally identifiable goal. In this paper, we propose a generic framework and a set of algorithms for message correlation. Figure 1 shows an overview of our proposed solution. It consists of two components: a back-end component that automatically discovers candidate correlators and a front-end component that provides a visual environment to explore the discovered correlations and refine them interactively. This paper, and the proposed solution, provides the following contributions: 1. We define and characterize the problem of message correlation, illustrating the different natures of the correlation problem and introducing the notion of correlation dependency, which identifies whether two messages are related, and why they can be considered as being related. These dependencies are expressed by means of rules, which are functions over message attributes. 2. We identify classes (patterns) of rules that can be used to define dependencies. For example, a simple rule is to identify correlation dependencies based on whether an attribute A has the same value in two 3

messages, but more complex rules are possible. The identification of these classes is based both on common ways of performing correlation as well as on recommendations by standardization proposals for how services should perform correlation. 3. We present a method for discovering correlation rules based on a levelwise approach [18, 2]. The approach is iterative. In each iteration, it generates a set of candidate rules, evaluates and prunes non-interesting rules. In particular, we first identify candidate correlator attributes (step attribute pruning in Figure 1). The attribute pruning step operates based on heuristics on typical characteristics of attributes whose values can be used as discriminators for defining membership of messages to conversations (and that therefore can be used to define correlation rules). The candidate attributes are used to derive the rules (step atomic rule inference). We expect that if a correlation rule exists, it will identify correlations to generate a proper partition of the log, where the number of conversations is more than one but less than the number of messages. Based on this fact, discovered atomic rules are pruned. The correlation dependency may be expressed using by a composition of rules rather than by a single atomic rule. For example, some messages within the same conversation are correlated via the orderID, others via a paymentID (step composite rule inference in Figure 1). We introduce algorithms for atomic rule composition using conjunctive and disjunctive operators. These algorithms also follow an iterative approach including candidate rule generation and pruning. 4. Due to the subjective nature of the correlation problem, the set of candidate rules is derived in a conservative manner, meaning that the algorithm tries to capture a wide range of correlation rules potentially of interest. Finally, all discovered rules are presented to the user. The final evaluation step is interactive to cater for subjectivness of correlation. Hence, the whole approach is iterative and semi-automated as any knowledge discovery task [9]. To help the user in exploring the discovered rules, we define metrics computed based on the statistics of the conversations that are result of using a given rule to group messages in the log. We present a visual environment, in which rules are represented in a lattice. We also allow the user to refine the discovered correlation rules that leads to adding/removing rules to/from the set of discovered rules. In addition, we allow for the user knowledge to be incorporated into the correlation discovery approach as part of heuristics that drive the discovery.

4

In this paper, we also substantiate the arguments by means of experiments on real and synthetic datasets and we discuss our implementation. We believe that the proposed correlation discovery framework can have a significant impact, also outside the area of services and conversations. For example, analogous concepts can be applied to correlation among process execution data (service messages are not significantly different from process execution events, and the problem of subjective interpretations of what constitutes a business process also applies), and hence can enable business process correlation and analysis throughout heterogeneous enterprise systems. The reminder of the paper is structured as follows: Section 2 introduces some definitions and concepts and defines the correlation discovery problem. Section 3 presents our approach for automated discovery of correlation rules. In Section 5 we present our implementation and experimentations, conducted over three interaction logs. We discuss related work in Section 6. Finally, we conclude and discuss the future work in Section 7.

2 2.1

Message Correlation Problem Web Service Logs

Messages are exchanged between services to fulfill a goal, e.g., to place a purchase order, receive an invoice, pay for good, and finally arrange for shipping. Taken together, these messages form a conversation that achieves a single business transaction. At any given time, there might be several ongoing conversations, corresponding to several clients interacting simultaneously with a service. Messages exchanged during service conversations can be logged using various infrastructures [7]. For simplicity, we define in this article a generic log model where each message is logged as an event e, represented by a tuple of a relation L = {e1 , e2 , . . . , em }, and ei ∈ A1 × A2 × · · · × Ak , where k is the number of attributes and m is the number of events. An event may have several attributes, corresponding to the attributes of the message exchanged. We denote by e.Ai , 1 ≤ i ≤ k, the value of attribute Ai of event e. We further assume that each event e has at least three other attributes: the timestamp at which the event is recorded (e.τ ), the sender of the message (e.s), and the receiver (e.r). In the following, we use interchangeably message—the actual document exchanged between services—and event—their representation as tuples in the log. Note that web service interactions usually involve structured (XML) messages (of different types, and hence with different attributes) organized in sections such as an header and one or more body parts. A preprocessing ETL-like is required to extract features from XML documents and represent them as event tuples (see Section 5). The attributes A1 × . . . × Ak represent 5

here the union of all the message attributes that belong to the different message types. Each message (event) will only have a subset of these attributes. We assume that there are some attributes that allow to determine if any two message belong to the same conversation. We call such attributes correlator attributes, and the dependencies (relationships) defined over correlator attributes as correlation rules.

2.2

Correlation Rules

In this paper, we define a correlation of messages in a service log as a partition of log L into a collection of conversations c1 , c2 , . . . , where each conversation ci is a sequence of events (each corresponding to a message), denoted ci = he1 , e2 , . . . i. Note that, as we stressed in the introduction, correlation has a degree of subjectivity, so there is no “right” or “wrong” partitioning. The key toward a “good” partitioning lies in finding a way to group messages based on their relationship, for the sake of facilitating service execution analysis and protocol discovery. The approach we follow is based on deriving a set of correlation rules, which are functions over pairs of correlator attributes for event pairs (ex , ey ). Correlation rules specify whether events belong to the same conversation. A correlation rule R can alternatively be seen as a relation over L × L where (ex , ey ) ∈ R if and only if ex and ey are correlated. In general, a correlation rule can be defined as an arbitrary function over a pair of events. The precise form of a rule depends on the specific domain for which the rule is defined. We call atomic a rule defined on only one pair of attributes. For example, an atomic rule may specify an equality relationship between attributes Ai and Ai in (ex , ey ), represented as ex .Ai = ey .Aj . A rule can be composite when it is made of a conjunction or disjunction of atomic rules. The correlation discovery problem is the problem of inferring, from the data present in the log, how to correlate messages, that is, inferring the correlation rule definition. In this paper, we perform message correlation for the purpose of conversation discovery (reconstruction). To derive correlation rules for messages, we first explore the different forms that correlation rules may take. In the context of message correlation, we leverage the different patterns in which dependency between messages manifests itself in service logs. This patterns are defined based on analysis of standard proposals and service implementations. Doing so, we restrict the rules to follow a relatively small set of corresponding rule patterns. Then, we analyze the log based on heuristics, that capture the subjective aspect of correlation to derive the candidate correlation rules and to determine which patterns can be applied.

6

2.3

Correlation Rule Patterns

In this section, we describe the correlation patterns commonly used for message correlation in web services. Key-based Correlation. In many cases, conversations among services are characterized by the fact that all messages have the same value for certain attributes. For example, when services adopt web services standards such as WS-Coordination, WS-Conversation or WS-CDL all messages may include a standard-specified attribute —typically called (conversation or instance) identifier— that acts as the correlation element. Even when these standards are not used, there may be an attribute that characterize the conversation. For example, messages related to a procurement conversation will likely have the same order ID, or the same pair hsupplierID, orderIDi. Standard proposals such as WS-BPEL allow for specifying a given (set of) attribute(s) as correlator. In WS-BPEL the definition of correlation sets may be used for this purpose. Based on this observation, we define the key-based correlation pattern, where some attributes of messages, called keys, are used to uniquely identify a conversation. As per keys in relational databases, correlation keys can be simple (single attribute) or composite (multiple attributes). For example, in the log presented in Figure 2(a), messages can be correlated using the simple key on attribute ConvID. In this case, the log contains two conversations c1 = he1 , e3 , e6 i and c2 = he2 , e4 , e5 i. Figure 2(b) presents a log where correlation may be using a composite key on attributes CustID and OID, resulting in conversations c1 = he1 , e3 i, c2 = he2 , e5 i and c3 = he4 , e6 i. However, contrary to the traditional concept of key in database, the value of a key attribute is not unique per tuple in L but only it is unique per conversation. Nevertheless, we do not know which events in L form a same conversation a priori. As a consequence, it can not be known a priori if correlation has to be done using this composite key or with either of the simple keys CustID and OID. All these three rules are valid candidate rules for correlating this log. e1 e2 e3 e4 e5 e6

msg m1 m1 m2 m2 m3 m3

ConvID 1 2 1 2 2 1

e1 e2 e3 e4 e5 e6

(a)

msg m1 m1 m2 m1 m2 m2

CustID c1 c2 c1 c2 c2 c2

OID o1 o1 o1 o2 o1 o2

(b)

Figure 2: (a) ConvID as a simple key, (b) CustID and OID as composite keys Rules corresponding to key-based correlation are of the form: (ex , ey ) ∈ R ⇔ ex .Ai = ey .Ai ∧ ex .Aj = ey .Aj ∧ . . . 7

where Ai , Aj , . . . are the attributes composing the key. Reference-based Correlation. Messages in a conversation may be related via links (references) that connect each message with a previous one in the same conversation. For example, a reply message is often correlated with the request message (most reply messages in Web services contain a reply to attribute that links to the request). As another example, again taken from the procurement domain, all messages related to the purchase of a good will carry either the order ID, or the invoice ID, or both. In this case there is no common key (the invoice ID appears only in the later messages in the conversation), but there is always a way to link all messages in the conversation (except the first) to a previous one. For example, there is typically a message which includes both order ID and invoice ID, acting as bridge and closing the missing link in the link chain. This correlation pattern is supported by BPEL and can be defined using correlation sets. It also could be implemented using RelatedTo attribute in WS-Addressing standard that allows to refer to message ID (MsgID) of a previous message. To account for this form of correlation, we introduce the reference-based pattern, in which messages of a conversation (except for the first one), are linked with a preceding one via a reference attribute. This pattern may have various sub-patterns. For instance, each message is linked to the preceding one in a conversation (chain). In Figure 3(a), events form the sequence he1 , e3 , e5 i where events of the pair (e1 , e3 ) are linked by a common value on attribute OID (Order Id) and the pair (e3 , e5 ) by a common value on attribute IID (Invoice Id). Another sub-pattern is the one where all messages are linked to the first message. For instance, in Figure 3(b), events e3 and e5 can both be correlated with event e1 since their value on attribute Ref is equal to that of e1 on attribute OID. Note that the attribute used for reference-based correlation may change during a conversation. Furthermore, even the value used for reference correlation may be common only to pairs of messages, not to the entire conversation. Hence, key-based correlation is not a particular case of reference-based correlation. Similarly to key-based correlation, references may be defined over more than one attribute. If the key attribute(s) is known, then reconstructing the conversations in a key-based correlation consists in a standard GROUP BY operation. In reference-based correlation, this is not the case and a transitive closure of the relation between attributes has to be computed. For the chain example of Figure 3(a), a possible correlation rule would be Rchain : ex .OID = ey .OID ∨ ex .IID = ey .IID. It would be Rtree : ex .OID = ey .Ref for the tree example of Figure 3(b). Time constraints on correlation. In some cases, it is not sufficient to specify a correlation rule using equalities of attribute values to distinguish between conversations. For example, in the case of a key-based correlation, key values may be reused (recycled) over time and, thus, a key value uniquely 8

e1 e2 e3 e4 e5 e6

msg m1 m1 m2 m2 m3 m3

OID o1 o2 o1 o2

IID

i1 i2 i1 i2

P ID

p1 p2

(a) chain

e1 e2 e3 e4 e5 e6

msg m1 m1 m2 m2 m2 m2

OID o1 o2

Ref

o1 o2 o1 o2

(b) tree

Figure 3: Reference-based correlation

identifies a conversation only within a certain time window. Intuitively, messages that are close in time have higher probability of being logically related (be dependent). Hence, the correlation logic may have to include temporal vicinity between events (though this is rarely a rule that can be used by itself, since time can aid the definition of correlation but can rarely be used as the only correlator, especially when it is possible to have many concurrent conversations between a service and a same client). Correlation rules would have to account for this situation with an additional condition on the time difference between two messages, e.g. R : |ex .τ − ey .τ | ≤ θ, where θ represents a maximal time separation between two messages of a same conversation. It is also possible to specify the maximum duration of a conversation (the different of time between the first and the last message of a same conversation). WS-CDL standard proposal supports definition of time constraints on correlation rules. We do not discuss this issue further in this paper. Combination of the above patterns. In general, correlation rules may include a combination of atomic rules, of the same or of different patterns. For instance, messages may first be correlated in a key-based way while customer query for product catalogs or product description. Then, after an order is placed, subsequent messages are correlated with reference to that order. We discuss composition of rule patterns in Section 3.4.

3

Automated Discovery of Correlation Rules

The automated correlation rule discovery consists of three steps: (i) candidate attribute selection, which introduces techniques to select attributes of L; (ii) atomic rule discovery, in which candidate attributes from the previous step are used to identify atomic candidate rules; (iii) composite rule discovery, in which techniques for composing atomic rules are presented. We adopt a level-wise approach [18] for generating, pruning and exploring the space of potentially interesting correlation rules. The proposed algorithm for correlation rule discovered is presented in the following (Algorithm 1):

9

Algorithm 1 Correlation Rule Discovery Algorithm Require: L S Ensure: R = (Ratomic , R∧ , R∨ ) 1: A ← selectCandidateAttributes(L) 2: Ratomic ← computeAtomicRules(A) 3: R∧ ← computeConjunctiveRules(Ratomic ) 4: Ratomic ← Ratomic ∪ R∧ 5: R∨ ← computeDisjunctiveRules(Ratomic ) S 6: R ← (Ratomic , R∨ )

The proposed algorithm, in spirit, is similar to Apriori algorithm [2] in the sense that first atomic rules are discovered (similar to itemsets of size 1). Following, composite rules are discovered (similar to computation of itemsets of larger sizes). Each step of the algorithm (attribute selection, atomic rule discovery, conjunctive rules discovery, and disjunctive rules discovery) consists of two phases: candidate generation, and candidate pruning. We introduce a set of properties, taken from first order logic, and conditions that play a similar role to that of Apriori condition in the candidate generation phase, and also a set of heuristics to be used during the attribute/rule pruning phase. The heuristics are defined based on general statistical characteristics of conversations in the service logs. The heuristics play a similar role to that of the minimum support threshold in the Apriori algorithm. These properties, conditions and heuristics are deployed to prune the search space of the problem effectively. To address the other challenge of the problem, which is the huge database size and high dimentionality of the data, we introduce some optimization techniques to make the approach computationally efficient. The output of the algorithm is the set R, which is the union of atomic, conjunctive and disjunctive rules. The details of the algorithm steps are discussed in the following subsections. Attribute selection techniques are discussed in Section 3.2, generation and pruning of atomic rules in Section 3.3, and finally the discovery of composite rules (conjunctive and disjunctives) are explained in Section 3.4. Note that in Algorithm 1, key-based correlators are discovered as part of atomic or conjunctive rules, respectively for simple and composite keys. Referencebased correlation and combination of correlation patterns is also addressed by discovering disjunctive rules. Note that the time constraints are not considered in this algorithm, and is subject of future work. Before presenting the algorithm, in the next section, we we discuss how rules are used to partition a log and define a series of metrics used to characterize a log partition.

10

R2 e1



e2

R4 ≺



e4

R1

R1 R3

e3

R4

Figure 4: Rules graph of candidate rules

3.1

Partitioning the Log

In order to illustrate how correlation rules (atomic or composite) partition the log into conversations, we will represent the relationships among events as a graph GR = (L, R). In this graph, nodes represent events and there is an edge from event ex to event ey if and only if (ex , ey ) ∈ R. Building edges in this way for candidate atomic rule in GR , we obtain a labeled multigraph called rule graph denoted by G. For example, assume that we have a log L containing four events L = {e1 , e2 , e3 , e4 }, and that we have discovered four atomic rules R1 , R2 , R3 , R4 , a possible rule graph for this log is illustrated in Figure 4. Partitioning a log L with a correlation rule R consists of identifying the set of conversations CR (L) = {c1 , c2 , . . .}, ci ⊆ L such that ( ∀ci ∈ CR (L), ex ∈ ci ⇔ ∃ey , (ex , ey ) ∈ R ∨ (ey , ex ) ∈ R ∀ci , cj ∈ CR (L), i 6= j ⇔ ci ∩ cj = ∅ Thus, identifying the conversations CR can be formulated as identifying the connected components of GR . For example, correlating the rule graph presented in Figure 4 using rule R1 produces two connected components, each of length 2 and we have CR1 = {he1 , e2 i,

he3 , e4 i} .

The problem of identifying the connected components of a graph is wellinvestigated in the literature [3]. In the database context, this problem consists in identifying the transitive closure of a query. Since we can not be sure that the components do not have cycles, we chose to implement this decomposition using a breadth-first search approach. In order to optimize this processing time as well as the count of distinct values of attributes that is used in Section 3.2 and 3.3, we first index separately each attribute of the log. With this setting, our experiments have shown that partitioning the log is very fast (see Section 5). A number of metrics can be defined to characterize a partition CR in a concise way:

11

• AvgLen(CR ), M inLen(CR ) and M axLen(CR (L)) represent respectively the average, minimum and maximum length of conversations in conversation CR ; • ConvCount(CR ) represents the number of conversations. It is equal to |CR |; • U ncorrelatedCount(CR ) represents the number of messages that are not related to any other message according to R. In the reminder of this section we explain how these metrics are used to reduce the search space for candidate rules.

3.2

Candidate Attribute Selection

All attributes of log L are initially considered as candidates. Then, we apply a number of heuristics to prune obvious non-candidates. Here, we first discuss the heuristics and then explain the attribute selection techniques. 3.2.1

Characteristics of Correlator Attributes

From the correlation patterns identified in Section 2.3, it can be seen that attributes used in key-based correlation or reference-based correlation are both some sort of “identifier” (code) attributes. While there can be a variety of identifier domains (e.g. string for login names, or integer for customer codes), these attributes share some common aspects, in particular with respect to identifiers that are potentially interesting for correlation purposes: 1. The domain is nominal (string or number). When a number is used, it is typically an integer. This allows to exclude for example attributes containing floating point values. Similarly, free text attributes can be excluded. In our approach, we differentiate the free text from string identifier on the basis of the length of the string, and also the number of words by setting a threshold (see Section 5). 2. The number of distinct values is rather large, as it needs to discriminate among many different entities, e.g., orders, invoices, or payments. When the number of values is small (e.g., identifiers of the banks that a company uses for payments), these identifiers are often used in conjunction with other (more discriminating) identifiers to form, for instance, a composite key. Hence, it is reasonable to avoid considering attributes whose domain is too small when compared to the dataset size. Furthermore, in general, the number of distinct values is not fixed and depends on the log size (the higher the number of conversations, the higher the number of distinct order IDs).

12

3. Values are repeated in the dataset. To be useful for correlation purposes, a same value has to appear in at least in two messages (referencebased correlation). Values that are unique in the dataset may reflect partial conversations (i.e. only the first message has been logged) but if a majority of the values of an attribute appear only once (in this attribute or others) then it is more likely that this attribute is not an identifier useful for correlation. 3.2.2

Attribute Pruning

We instantiated the criteria for correlator identification presented in previous section in the following ways: First, we eliminate any attribute whose type is either floating point or that contains long strings. We have set the threshold for what constitutes a long string arbitrarily to 100 characters. Note that other methods could be used to more precisely differentiate between free text and potential correlators such as, for instance, analyzing the content of the text. Alternatively, a manual selection of the attributes could prove very effective and would not be too time consuming. Then, we count the number of distinct values of each attribute Ai relative to the total number of values (distinct ratio(Ai ) = d/n, where d is the number of distinct values, and n denotes the total number of non-null values for this attribute). distinct ratio(Ai ) defines a measure to decide if an attribute is a non-candidate. In fact, based on the heuristics above, there are two types of attributes that are not interesting: • Attributes with categorical domains: Their characteristic, compared with free text or identifiers, is to have only few distinct values. We identify them by setting an arbitrary threshold called lowerT hreshold that defines a minimum for distinct ratio. • Attributes that do not have repeated values: according to heuristics, values of candidate correlator attributes have to be repeated somewhere in the dataset, either in the same attribute or in another one. We can expect that most distinct values will appear at least twice since they correlate at least two messages. In the case of value repeated in a same attribute, distinct ratio will be well below one. On the other hand, if values repeat in other attributes, then we should observe a large number of tuples with a null (empty) value on that attribute. Indeed, if all value are non-null and they do not repeat, this attribute would lead to correlate messages in conversations of a single message. Hence, we can set an upper threshold for distinct ratio (called upperT hreshold hereafter) assorted with a condition on the ratio of null values to the number of non-null values.

13

We are aware the above choices for correlator detection may be refined depending on particular application domain. Overall, we chose a permissive approach that will tag as candidate correlator all attributes even very unlikely to be so. This implementation of criteria allows for having a very high recall and purposely poor precision to enable inclusion of likely correlator attributes.

3.3

Atomic Rule Discovery

In the previous subsection, we identified the candidate correlator attributes. In this section, we present an approach for discovering atomic candidate correlation rules. The atomic rule discovery process is based on finding the intersection of distinct values of attribute pairs in L. We then present an approach based on some general heuristics to prune non-candidate atomic rules. 3.3.1

Candidate Atomic Rule Generation

As the focus of this paper is on discovering key-based and reference-based rule patterns for conversation reconstruction, as the first step, we generate atomic candidate rules of form R : ex .Ai = ey .Aj , 1 ≤ i ≤ j ≤ k over event pairs (ex , ey ) ∈ L2 . Such rules define equality relationships between pairs of attributes for each event pair (ex , ey ). To identify whether a given candidate rule holds, we need to compute the support of each rule. The support shows the number of pairs (ex , ey ) for which the values of attributes ex .Ai and ey .Aj are equal. A high support implies that the correlation rule holds for a large subset of the dataset and that equal values found in these two attributes is not a mere coincidence. If the support of a given rule R is non zero, then it means that there are values that appear in both attributes Ai and Aj in L, i.e., the intersection of values of Ai and Aj is not empty. So, instead of actually computing the support of rules, which would be computationally expensive, we compare the sets distinct(Ai ) and distinct(AJ ) of distinct values of Ai and Aj respectively. Moreover, distinct(Ai ) was already computed during the candidate attribute identification phase (see previous section) for all attributes and so, in our implementation, these sets are computed only once. An initial list of candidate rules is produced from all the pairs (Ai , Aj ), 1 ≤ i ≤ j ≤ k such that |distinct(Ai ) ∩ distinct(Aj )| > 0 holds. For example, for the table in Figure 3(b), the discovered rules and their support are: OID = OID : 2, IID = Ref : 2. 3.3.2

Atomic Rule Pruning

We use the following criteria to decide if a rule is a non-candidate:

14

• Identifiers have repeating values: If attributes Ai and Aj define a reference-based correlation (R : ex .Ai = ey .Aj ), then it is likely that, typically, values in Aj also appear in Ai , and vice versa. If by contrast Ai and Aj have only few distinct values in common and many that are different, then we can conclude that this pair is not correlated, and the relationship among this two attributes is only incidental. So, it does not form a valid rule and it can be removed from the list of candidate correlation rules. In order to reflect this idea, we define a measure of attribute-pair interestingness denoted by Ψ(Ai , Aj ): Ψ(Ai , Aj ) =

|distinct(Ai ) ∩ distinct(Aj )| max(|distinct(Ai )|, |distinct(Aj )|)

Ψ(Ai , Aj ) is defined over [0, 1] with 1 (resp. 0) indicating a pair of attributes that share a large number (resp. only few) of their values. We set a threshold ψ and prune all rules for which Ψ(Ai , Aj ) < ψ. Note that this measure is used only for pairs (Ai , Aj ) with i 6= j. Indeed, attributes not pruned in the attribute pruning step (Section 3.2) are already shown as candidate for a key-based correlation. • Rules partition the log in a statistically meaningful way: The general principle that the number of conversations in a log is more than one conversation but less than the number of messages is applied. The decision is based on ConvCount metric, and rules for which the number of partition is too high (based on a upper threshold) are pruned. Such rules allow for very few messages in the log to be connected, however, they may have a very high number of shared values, and so are not punned based on the previous criterion. This criterion is particularly useful for atomic rules of form R : ex .Ai = ey .Ai which is defined on the same attribute Ai , but in practice correlates few messages.

3.4

Composite Rule Discovery

Correlation rules may not always be atomic, but composed of several atomic rules to form composite rules (see Section 2.3). In this paper, we examine two types of composition operators: conjunctive (∧), used for representing keys defined on more than one attributes, and disjunctive (∨), used to identify conversations where messages are not all correlated using the same rule. In order to discover all the candidate composite rules, we take as input the candidate atomic rules inferred in the previous section and proceed in the following steps: 1. We build a list of candidate conjunctive rules. This is done by generating the possible combinations of atomic rules and pruning obvious 15

non-candidate combinations (step 3 of Algorithm 1). After this step, conjunctive rules are treated as atomic rules (step 4 of Algorithm 1). 2. The candidate atomic rules are combined in disjunctions and organized in a lattice structure (step 5 of Algorithm 1). 3.4.1

Conjunctive Rules

As discussed in Section 2.3, the composition of more than one attribute may be needed to correlate messages of a same conversation. For example, if purchase order numbers are uniquely assigned on a per customer basis, it is not sufficient to consider only the purchase order number to correctly correlate messages. It is necessary to use both the purchase order number and the customer number to relate a message to the correct conversation (see Figure 2(b)). The operator ∧ defined hereafter will be used to express this type of situation. The conjunctive operator is applied to define the intersection of the relations defined by the two rules. Consider two correlation rules R1 and R2 . The composite rule R1∧2 = R1 ∧ R2 is defined as follows: (ex , ey ) ∈ R1∧2 ⇔ (ex , ey ) ∈ R1 ∧ (ex , ey ) ∈ R2 ⇔ (ex , ey ) ∈ R1 ∩ R2 in this expression, R1 and R2 are both atomic rules and they each express that events ex and ey are equal on some attributes. We need to identify all the candidate conjunctive rules. The number of possible conjunctions from r atomic rules is large, i.e., 2r −1. In the following we introduce two techniques to be used in the candidate generation phase, and two techniques in candidate pruning phase that allow to identify noninteresting rules. 3.4.1.1 Identifying non-candidate rules in the candidate generation phase. Here, we introduce one condition and a property that allows to predict if a conjunctive rule is interesting before actually computing the relevant partitioning metrics for the rule using L. In fact, we are interested to be able to identify non-interesting rules at this stage. • Attribute definition constraints: Taken individually, R1 and R2 could represent a key or reference based correlation. Thus, each of the attributes used in R1 and R2 may be undefined for some events (see Figure 3). However, when considered together in a conjunction, a new constraint appears: attributes of R2 have to be defined whenever the attributes of R1 are defined. More formally, a general expression of R1∧2 is of the form

16

(ex , ey ) ∈ R1∧2 ⇔ ex .Ai1 = ey .Aj1 ∧ ex .Ai2 = ey .Aj2 . The above constraint implies that for any message of the log, attribute Ai1 is defined if and only if Ai2 is also defined and that Aj1 is defined if and only if Aj2 is also defined. Thus, conjunctions are only valid for rule pairs that satisfy this constraint, i.e., they are defined only on the same set of messages. This condition could be applied at the conjunctive candidate generation phase. • Inclusion Property. In graph G, if the set of edges of rule R1 is included in those of rule R2 , i.e., R1 ⊂ R2 : ∀(ex , ey ) ∈ R1 ⇒ (ex , ey ) ∈ R2 then, we have R1 ∧ R2 = R1 . Hence, it is not needed to compute R1 ∧ R2 . Furthermore, if R1 = R2 , i.e., ∀(ex , ey ) ∈ R1 ⇔ (ex , ey ) ∈ R2 then, R1 and R2 partition the log in the same way, so it is enough to compute the log partitioning for one of them and then to extend the result to the other. This implies that one further rule is punned from the set of rules that its conjunction with other rules to be computed. This condition could be applied at the conjunctive candidate generation phase before actual computation of the conjunction. 3.4.1.2 Identifying non-candidate rules in the rule pruning phase. For any candidate conjunctive rules the partitioning metrics is computed. In the candidate pruning phase, we examine the interestingness of the rule using the following criteria: • Identifiers have repeating values: A conjunctive rule results in a new identifier (for a reference or a key) built using the attributes of each atomic rule. Using a criterion similar to those in section 3.2, we know that identifiers have to appear in more than one message to be valid. For example, the correlation rule R1∧2 defined above is valid only if most of the pairs of values that are defined on attributes (Ai1 , Ai2 ) also appear in at least one other message on attribute (Aj1 , Aj2 ) in any order. Here, this is evaluated based on how the rule partition the log. In other words, rules that create too many partitions (above a upper threshold on ConvCount metric) are not useful.

17

• Monotonic property of partitioning metrics with respect to conjunctive operator : An important property of partition metrics is their monotonicity with respect to the conjunctive operator: applying the conjunction operator on two rules will either leave unchanged or reduce the number of neighbors of a message in the graph, i.e., the number of messages to which it is related. This makes conversations build on conjunctive rules both shorter and more numerous than conversations built on the atomic rules. We have: ( M inLen(CR1∧2 ) ≤ min(M inLen(CR1 ), M inLen(CR2 )) |CR1∧2 | ≥ max(|CR1 |, |CR2 |) The above means that ConvCount is monotonically non-decreasing, and the M inLen is monotonically non-increasing with conjunctive operator. With the exception of conversations that are partially logged, most of the conversations should have a length of two or more. In other words, the number of conversations should be less than half of the number of messages. Thus, if a pair of rules does not satisfy the above criteria, it is unnecessary to compute their conjunction. Since ConvCount is monotonically non-decreasing with conjunctive, it is not necessary to examine conjunctions of bigger size (with more rules) when these rules themselves do not satisfy this criteria. The set of candidate conjunctive rules is the set of possible conjunctive rules for which all the above criteria hold. Note that for all these criteria, if the constraints do not hold for a conjunction of two rules, then it will not hold for conjunctions of a larger number of rules. 3.4.2

Disjunctive Rules

Disjunction of rules are useful for expressing situations where messages are not correlated always the same way throughout a conversation. This is the case, for example, when an invoice message references a purchase order number while a payment message references the invoice number (see Figure 3(a)). To correlate these three messages, we need to define two rules. Given two atomic or conjunctive rules R1 and R2 , the disjunctive rule R1∨2 is defined as follows: (ex , ey ) ∈ R1∨2 ⇔ (ex , ey ) ∈ R1 ∨ (ex , ey ) ∈ R2 ⇔ (ex , ey ) ∈ R1 ∪ R2 Discovery of disjunctive rules also is performed in an iterative process comprised of two phases: candidate generation and candidate pruning. In 18

the following, we introduce criteria to examine if it is interesting to compute metrics for a possible disjunctive candidate, and then to evaluate the interestingness of candidate rules after the partitioning metrics are computed for the rule: 3.4.2.1 Identifying non-candidate rules in the candidate generation phase. We use the following criteria to identify to consider a possible combination as candidate: • Associativity of conjunction and disjunction: Rules that combine the disjunction and conjunction of a same atomic rule are not needed to be computed: due to the associative property of disjunction and conjunction it is possible to simplify the rule to one that is already considered. For example, the correlation rule R2 ∨ R2∧3 is equivalent to the correlation rule R2 . They are not useful since conjunctive and disjunctive operators are here associative (they correspond to intersection and union of relations). • Inclusion Property: If relation R1 is included in R2 , i.e., R1 ⊂ R2 , then, we have R1 ∨ R2 = R2 . Hence, it is not needed to compute R1 ∧ R2 . Furthermore, when R1 = R2 , as stated before, R1 and R2 partition the log in the same way and R1 ∧ R2 = R1 = R2 . Hence, so it is enough to compute the log partitioning for one of them and then to extend the result to the other. 3.4.2.2 Identifying non-candidate rules in the rule pruning phase. The following criteria are used to identify the interestingness of the candidate rules after the partitioning metrics are computed for the rule: • Identifiers have repeating values: A disjunctive rule results in a new identifier (for a reference) built using the attributes of each atomic rule. Using a criterion similar to those in section 3.2, we know that identifiers have to appear in more than one message to be valid. Here, this is evaluated based on how the rule partition the log. A disjunction of rules is said to be valid if it leads to a partitioning of the log in at least two conversations. Rules that create too few partitions (below than a lower threshold on ConvCount metric), i.e., correlated most messages in the log are not useful. • Monotonic property of partition metrics with respect to disjunctive operator : The partition metrics are also monotonic with respect to the disjunctive operator: the more we have rules in a disjunction, the more each message becomes connected with others and the longest are the

19

R1 ∨ R2 ∨ R3 ∨ R2∨3

R1 ∨ R2 ∨ R3

R1 ∨ R2

R1 ∨ R2 ∨ R2∧3

R1 ∨ R3 ∨ R2∧3

R1 ∨ R3

R1 ∨ R2∧3

R2 ∨ R3

R1

R2

R3

R2 ∨ R3 ∨ R2∧3

R2 ∨ R2∧3 R3 ∨ R2∧3

R2∧3



Figure 5: Candidate Correlation Rule Lattice

conversations in the corresponding partition. We have: ( M axLen(CR1∨2 ) ≥ max(M axLen(CR1 ), M axLen(CR2 )) |CR1∨2 | ≤ (|CR1 | + |CR2 |) The above implies that ConvCount is monotonically non-increasing, and the M axLen is non-decreasing with disjunctive operator. Since ConvCount is monotonically non-increasing with disjunction, it is not necessary to examine disjunction comprising more rules when these rules themselves do not satisfy this criteria. In theory, if there are r candidate atomic rules, then there are cr = possible conjunctive rules. Let dr denote the number of possible disjunctive rules, we have dr = 2cr+r − (cr + r + 1), as conjunctive rules are considered atomic. However, as we will see in the experiments (Section 5), applying the introduced criteria and heuristics shrinks this search space significantly. 2r − (r + 1)

3.4.3

Rule Lattice

When composing together in disjunctive form both atomic and conjunctive rules, we can organize rules in a lattice structure to facilitate user navigation. An example of rule lattice is illustrated in Figure 5. The bottom layer of nodes corresponds to both the atomic and conjunctive rules identified in Sections 3.3 and 3.4.1. Each higher level layer represents a combination of the lower level rules. Nodes that are shown in dark gray are nodes that are not computed. They correspond to rules combining one or more lower level composite rules found invalid. Suppose 20

that rule R2 ∨ R3 (highlighted in Figure 5) is computed and that the corresponding partitioning CR2∨3 produces a single conversation comprising all the messages of the log. Extending this disjunction to additional rules will always result in the same conversation due to monotonicity property (see Section 3.4.2) and thus it is unnecessary to compute it. From the above example on lattice, it can be seen that as soon as a rule is identified as invalid, this propagates to the upper levels of the lattice. In Section 4 we present guidelines on how effectively to explore the discovered rules, and rank rules based on different metrics. In Section 5, we present our experiments and in particular, we discuss the number of composite rules that were actually computed. This number is very low when compared to the number of possible combinations.

4

User Driven Exploration and Refinement of Correlation Rules

In this section, we present the front-end component of the system in two parts: correlation rule summarization, and refinement interface (see Figure 1).

4.1

Correlation Rule Summary

The discovered rules can be structured in a lattice, as discussed in Section 3.4.3, which facilitates navigation among candidate rules. To help users identify interesting correlation rules, each node is displayed with a summary that describes the properties of its corresponding correlation rule. The summary comprises the following information: • Metrics: These correspond to the aggregate values defined in Section 3.1. We also display an histograms that shows the distribution of conversation lengths. • Description: The rule definition is displayed as in the notation similar to that used in this paper (the difference is that we only display the attribute names). When attribute names carry some semantic, these rules are interesting in themselves and help understand the nature of conversation discovered. These in particular proved useful in the case of the game dataset to differentiate among the 57 candidate rules.

4.2

Ranking Discovered Rules

In general, partitions with less uncorrelated messages (U ncorrelatedCount), and higher values of average of length of conversations (AvgLen) are preferred. This is because partitions with relatively small average length do

21

not correlate messages successfully. Hence, our tool uses a simple formula to compute a trade-off between AvgLen and U ncorrelatedCount of different partitions and to rank rules. In our experiments, this approach proved effective in recommending the interesting rules in the first few ranks. However, it is possible to rank rules based on other metrics.

4.3

Refinement Interface

We propose two refinement operations that allow users to further expand or reduce the search space. Following refinement operations are available: • Users can propose new rules to be examined. This operation is handled by adding a new atomic rule to the set of existing atomic rules. Adding this atomic rule automatically updates the lattice to generate additional nodes corresponding to composition of this rule with other atomic rules. • Users can propose to remove a rule from the lattice. This operation updates the lattice to remove composite rules built by composition of this rule.

4.4

User Domain Knowledge

We observe that the user may have some domain knowledge to help in reducing the space of possible correlations, i.e., to do not computing some nodes in the lattice or to identify interesting rules for the user. There are various types of user knowledge information, among which the most general ones include: • The average number of conversation per day. For example, the user may be able to tell how many purchase order are lodged per day. • The average/minimum/maximum number of messages per each conversation, if possible; • an estimate of the number of conversation in the given log; This and above two items help in deciding on interestingness of a discovered rules. • Attributes that potentially are (not) used for correlation; This is helpful for attribute pruning step. • the rule pattern that is used for correlation of the log. • The user may have a log with millions of records. However, parsing all events in the system after a certain number of events may not lead to discovering further rules. For the sake of processing time, users can 22

suggest to use part of log, for instance, all messages between a certain period of time, e.g., 3 months, if it is assumed that there are enough conversations and also the longest possible conversation will not take more than 3 months. Our framework allows to incorporate above domain knowledge as part of heuristics used during discovery. In particular, if the user knows the correlation rule pattern that is used for correlating the messages in the log, it has the following implications on the discovery approach: Key-based Pattern. This knowledge helps in two ways. First, we can define an additional heuristics that helps in pruning non-candidate attributes: the property of a key-based attribute is that is has some non-null value over all messages in the log. So, among candidate attribute with proper set of distinct values we only consider the ones that the number of their values is equal to the number of messages in the log. This reduce the candidate space significantly. As discussed in Section 2.3, the key is either simple or composite. Discovered atomic rules are candidate for simple keys. As the second implication of this knowledge, for discovering composite key attributes, we only need to apply the conjunctive operator (∧) on atomic rules. This means that the search space for composite rules is significantly reduced. If there are r atomic candidate rules, then the number of possible composite (conjunctive) rules is cr = 2r − (r + 1). Note that cr  dr (see Section 3.4.2). Reference-based Pattern. Knowing the the log is correlated using a reference-based pattern helps in further pruning non-candidate attributes based on the heuristics that for key-based attributes the number of values is equal to the number of messages in the log (having non-null values in all messages). This means that we also prune the attributes that are candidate for key-based correlation. However, the method for computing composite keys is the same, as it is possible to use both conjunctive and disjunctive rules. Conjunctive rules allow for discovering composite reference attributes, and disjunctive are required to discover composite correlation rules to allow each pair of messages in a conversation to be correlated using a different rule.

5

Implementation and Experiments

We have implemented the correlation discovery approach presented in this paper in a prototype tool as part of a larger project, called ServiceMosaic, a framework for analysis and management of Web service interactions. It has been implemented using Java 5.0 as Plug-ins in Eclipse framework, and with PostgreSQL 8.2 as the database engine. All the experiments have been performed on a notebook machine with 2.3 GHz Duo Core CPU, and 2GB of memory. 23

Table 1: Characteristics of the datasets Dataset # of messages # of events # of attributes

5.1

Retailer 10 2,800 15

Robostrike 32 4,776,330 98

P urchaseN ode 26 50,806 26

Datasets

We have conducted the experiments on both synthetic and real-world datasets, described next. 5.1.1

Synthetic Dataset

We used the interaction log of a Retailer service, developed based on the scenario of WS-I1 (the Web Service Interoperability organization). In this dataset, the log is collected using a real-world commercial logging system for Web services (HP SOA Manager2 ). The Retailer service is implemented in Java and uses Apache Axis as SOAP implementation engine and Apache Tomcat as Web application server. Table 1 shows the characteristics of this dataset. The log has 2800 tuples, each corresponding to an operation invocation. These messages form 480 conversations and are correlated using a key-based approach. HP SOA Manager records information organized in 13 attributes. Examples of attributes are SequenceID, ServiceN ame, Operation, T imestamp, RequestSize, and ResponseSize. In addition, we also extracted two attributes from the body of XML messages, namely RequestID and ResponseID. In fact, both RequestID and ResponseID contain the same value and each value is a conversation identifier. 5.1.2

Real-world Datasets

We used two real-world datasets: the interaction log of a multi-player online game service called Robostrike 3 , and a workflow management system log corresponding to a purchase order management system, called PurchaseNode. Robostrike service exposes 32 operations invoked by sending/receiving XML messages. Table 1 presents the statistics of this dataset. The log contains more than 4, 700, 000 events. In a pre-processing stage, we extracted all the attributes of all messages. To this end, the XML schema was flattened and each element was given a distinct element name. The XML schema of Robostrike is presented in Appendix C. 1

www.ws-i.org managementsoftware.hp.com/products/soa/ 3 www.robostrike.com 2

24

The PurchaseNode dataset was originally organized into two tables: one for the workflow definitions and the other for the workflow instances (execution data). The workflow definition table defines 14 workflows using different combinations of the 26 workflow tasks. For this experiment we joined the two tables, based on workflow identifier, to produce a log of 50, 806 records. By using this dataset we also test the applicability of the approach to workflow data. 5.1.3

Preprocessing Step

The main pre-processing step that is performed is processing XML messages in service logs to extract data elements (attributes) and to prepare the relational table that is required as the input of the algorithm. The conversion of XML schema to relational schema is a separate thread of research [16], and we rely on this research for such a conversion in general. In our implementation, however, we used two simplistic but systematic approach for extracting data elements from XML structure: (i) structure-ignorance naming, (ii) structure-preserving naming. It should be noted that the risk of information loss in using any of the above approaches exists, as XML schemas are richer than relational schemas. This is also the main challenge that XML schema to relational schema conversion approaches are faced. • structure-ignorance naming: In this approach, all the structure of XML is ignored, i.e., XML schema of each messages is flattened and only data elements (including element attributes) are extracted and considered as an attribute in table L. This approach is appropriate for applications that a same element/attribute name is used with the same functionality throughout the whole schema. For instance, the element ”customerID” refers to the same data element, which is the customer identifier in any place of the schema. In this case, the corresponding name of this element is also ”customerID” in table L. • structure-preserving naming: In this approach, the name of XML constructs are also considered as part of the attribute name. For example, if there is a complex XML construct called ”Order” and there is an element of this construct called, ”customerID”, then the corresponding name of attribute in L is considered ”Order customerID”, which will be treated differently from another attribute called ”Invoice customerID”. This method of naming attributes is appropriate for XML schemas that do not use the same element/attribute name for the same real-world entity or in the cases that the naming is not complete. For example, both ”Order” and ”Invoice” construct may have an element called ”Number”. Then, it is desired to distinguish between ”Order Number” and ”Invoice customerID”. In our imple-

25

Table 2: The performance of the automated correlation discovery process Dataset initial # of attribute # attributes after pruning # of atomic rules # of atomic rules after pruning # of discovered composite rules

Retailer 15 2 3 3 4

Robostrike 98 13 31 8 57

P urchaseN ode 26 3 6 4 3

mentation, it is possible to specify how many levels of constructs to be considered in the naming as the input of the procedure. In reality, a combination of these approaches may be required as there is some semantics are involved in naming of attributes that may not be able to addressed properly. In our experiments, in Robostrike and also in Retailer we have used the structure-ignorance naming method, as it was an appropriate selection considering the semantics of naming attributes. However, our approach allows for manual editing the XML conversion as the input of the algorithm is table L, and in general we assume that preprocessing is performed correctly.

5.2

Automated Correlation Rule Discovery

We ran experiments to validate the performance of the proposed approach on large datasets. We separately report the results of each of the three steps of the correlation discovery process. The details of the results is presented in Appendix A. The contribution of each criterion on pruning the search space of problem is also evaluated and reported there. 5.2.1

Candidate Attribute Selection

In this step, we compute the distinct ratio defined in Section 3.2.2. On the basis of a LowerT hreshold (resp. UpperThreshold ) of distinct ratio, we select attributes that have neither too much nor too many) distinct values (see Section 3.2.2). Our purpose in this experiment is twofold: (i) validate if this heuristics are effective to identify non-correlator attributes even in the case of a pessimistic setting of thresholds, and (ii) verify the time needed to compute this ratio for all attributes. Table 2 shows the summary of the result of applying this attribute selection approach to the datasets. In this settings, we used upperT hreshold = 0.9 and lowerT hreshold such that the number of distinct values is at least 15 (i.e. we assume the dataset contains at least 15 conversations). This setting is designed to prune only attributes which are obvious non-candidate since other steps of the process can further, and more precisely (i.e. without requiring threshold settings) identify attributes which are not correlators. 26

For P urchaseN ode, out of 26 attributes, 21 attributes are pruned on lowerT hreshold, and one attribute based on upperT hreshold. Only 3 attributes remained. For the Robostrike dataset, out of 98 attributes there were 13 attributes that were selected. Finally, from the Retailer dataset only 2 attributes out of 15 were selected, namely RequestID and ResponseID. Computation is performed with simple requests. For one million records taken from the Robostrike dataset, computing distinct ratio for all the 98 attributes took 19 minutes. Note that we perform this evaluation on a table where each attribute is indexed. 5.2.2

Atomic Rule Discovery

The approach presented in Section 3.3.2 for rule inference is based on identifying the number of shared values between distinct values of different pairs of attributes. The execution time for computing the intersection of attributes is rather small. For example, for the Robostrike dataset and for 1,000,000 records, the execution of this step takes slightly more than 400 seconds. Not surprisingly, as this computation is based on statndard standard database queries this time is linearly increasing with the growth in the dataset size. For the P urchaseN ode dataset, out of 6 candidate atomic rules, 4 remain after pruning. In case of the Robostrike dataset, 31 rules were discovered, out of which 8 identified as candidate atomic correlation rules. Finally, for the Retailer dataset, all three candidate atomic rules qualified. These results are summarized in Table 2. It should be noted that for the computation of atomic rules in all these experiments, the entire logs were used (e.g., 4 millions events for the Robostrike dataset). 5.2.3

Composite Rule Discovery

Execution time. Figure 6 shows the execution time of the algorithm for computing the composite rules based on atomic rules in the three datasets, with various dataset size. It takes just under 10 minutes to compute the composite rules for 5000 events in the Robostrike dataset. The execution time of the algorithm for the Retailer dataset for 2800 events present in this table takes 5 seconds. The impact of datasize on discovered rules. It is interesting to note that the dataset size does not notably impact the number of composite rules discovered. For the Robostrike dataset, from the 8 atomic rules discovered in the previous step, the theorical number of combination of rules was 28 (considering only disjunctions). After completion of the discovery process, this number was only 57. Then, we performed a similar computation with 500, 000 and then with 1, 000, 000 events on this same Robostrike dataset. This tenfold increase in size does not impact significantly the number of correlation rules discovered: the results were respectively 58 and 59, among

27

600

Time (Sec)

500 400 Robostrike PurchaseNode

300 200 100 0 1000

2000

3000

4000

5000

Number of events

Figure 6: The execution time of composite rule discovery

which 57 are the same as the 57 rules discovered with 5000 events. The quality of discovered rules. In any knowledge discovery task, interpreting the discovered rules and identifying the interestingness of the final results requires the judgment of the user [9]. In this experiments, we also presented the result to knowledgeable users about the datasets. In case of Retailer dataset, which is a synthetic dataset and follows a keybased correlation, all atomic rules discovered are keys. There are two attributes that could potentially used as key in this dataset. The only discovered three rules are based on these two attributes (see Section A.1 in Appendix A). For P urchaseN ode, the rule discovery process correctly identified F lowInstanceID as a key attribute. This key however is only one of the 10 possible correlations (see Section A.2 in Appendix A). The results of the correlation discovery are more interesting on the Robostrike dataset (see Section A.3 in Appendix A). The various rules discovered correspond to various levels of analysis of the game. For instance, it is possible to correlate as conversation a single game (based on Convid), or a long session that comprises several games (based on logn). Another correlation is possible at the user level, regardless of his session (uid and usiud). Finally, the algorithm identified a surprising correlation based on chat conversations among groups of players. These correlation mixes key-based and reference-based correlation. For instance, messages during a game refer to the game number while messages regarding a single player are of key type, based on her user id (usid = cntd ∨ usid = usid). Optimization techniques. We also used a number of techniques to make the computation more efficient. For example to identify the inclusion

28

Table 3: The order in which criteria are checked in candidate generation and pruning phases Phase Candidate Generation Candidate Pruning

Conjunctive Rules Attribute definition constraints Rule inclusion Upper threshold (on CC#) Monotonicity w.r.t ∧

Disjunctive Rules Associativity Rule inclusion Lower threshold (on CC#) Monotonicity w.r.t ∨

relationship of two rules we first conside the number of connected component of the two side, e.g., R1 and R2 , and consider the inclusion of the one with smaller number of connected components in the bigger (e.g., R1 in R2 ). In addition, instead of checking if all the connected of components of R! is included in the other, we check the negation of this relationship to see if it is not included. Accordingly, the first component that is not included in the corresponding connected components of R2 , we conclude that the inclusion relationship does not hold. Data Sampling. The running time of the algorithm depends polynomially on the size of database. On possibility of lowering this factor is to use only a sample of data for correlation. However, the challenge is that how much data is sufficient for correlation purpose. In case of Robostrike, as it is a game service, players usually play during the same day, or at most if it is midnight their play will be in two consecutive days. In this sense, in our experiments (simplified Robostrike dataset as used in Appendix A) we assume it is sufficient take the traffic of two days. As discussed above, such a decision does not change the discovered rule, but significantly impacts the running time of the algorithm. The contribution of different criteria in pruning the search space. While searching the solution space for possible correlation rule, one of the main goals of the discovery algorithm is prune the search space to do not compute non-interesting rules. This is done in both candidate generation and candidate pruning phases. However, we prefer predict the interestingness of each in the candidate generation phase to avoid computation of non-interesting than pruning them based on metrics that are computed for them. Table 3 presents the order in which we check for pruning criteria in each phase, and Figure 7 compares different criteria for their effectiveness in pruning the search space (for details of effectiveness of each criteria see Appendix A). The ranking is based on the average performance of each criteria in three datasets. The result of comparison shows that for conjunction rules the ranking of the contribution of criteria is as follows (from the most effective to less effective): attribute definition constraints (93%), inclusion(5.5%), upper threshold(1%), and finally monotonicity (0.5%), and for the disjunction the ranking of criteria is as follows (from the most effective to less 29

Criteria for Disjunction

Criteria for Conjunction

Associativity Inclusion Lower threshold Monotonicity

Attribute definition constraint Inclusion Upper threshold Monotonicity

(a)

(b)

Figure 7: Comparison of contribution of criteria in pruning the search space

effective): associativity(40%), inclusion(30%), lower threshold(28%), and monotonicity(2%). This results reveal that most of the pruning is performed in the candidate generation phase. This result also implies two important points: (i) the algorithm is efficient in discovering correlation rules as it prunes most non-candidate rules in the candidate generation phase, (ii) the algorithm does not depend heavily on the heuristics that are based on setting thresholds. Thresholds are only used in places where the certainty about a rule being non-candidate is very high.

6

Related Work

There is a large body of related work in the database area. In this section, we discuss how our work differentiates from other research. Functional dependencies and composite keys. The problem of correlation dependency analysis is related to that of discovering functional dependency. In functional dependency inference [15, 22] the problem is to identify properties of the form A → B, where A and B are sets of attributes of the relation, that hold for all or a well defined subset of tuples of this relation. Discovering a functional dependency is interesting as it reveals an implicit constraint over attribute values of tuples, which are not expressed in the schema. The main difference of this problem and correlation discovery is that correlation is not expressed over the tuples of a database but among the tuples and at the conversation level, which is not known a priori. Messages of a log may include valid functional dependencies, but these dependencies do not reveal anything about the conversations. Indeed, each group of messages that form a conversation exhibit a very strong dependency, however, this dependency is formed after conversations are known, i.e., the correlation rule is discovered. The problem of discovering (composite) keys in relational databases [23] is related to that of message correlation: both seek to identify a collection of attributes that identify individual instances. However, two major differences 30

make the solution to the former unsuitable for the latter. First, the validity of candidate composite keys can be assessed objectively since, by definition, each key value shall identify a unique instance. By contrast, there are only heuristics—but no definite criteria—to decide on the validity of correlation of messages. Second, keys are only one of the various rule patterns (see Section 2.3) that may be used for the correlation of messages. Association Rule Mining. Association rule mining techniques identify values that co-occur frequently in a data set [13]. Such rules identify, for example, the likelihood of a customer that bought product A, to also purchase product B, or what other set of products a customer is likely to buy, given that she bought product D. Although such rules have links to the notion of correlation and implication (causality), some correlation patterns can not be identified using this approaches. Consider the case of reference-based correlation where a pair of attributes is used to link messages in a chain. For this two attributes, which can be defined for all tuples, values never appear concurrently in the same event. Moreover, on the overall database, each value occurs only twice, one in each message of the chain, and hence is not frequent. Another intuition is that for item set to be frequent, the collection of possible values have to be defined on a finite, and relatively small, domain. In the case of correlation rules, we are looking for attribute that express identifier, i.e. values that repeat at most in message of a single conversation (unless recycled). Classification and clustering. Conversations could be seen as classes and the correlation problem could then be thought of as a classification one[1]. However, classification approaches assume a fixed (and rather small) number of classes while conversations come in unbounded and unknown number, depending mainly on the size of the log considered. Moreover, classification approaches rely on pre-classified instances to infer the classification function. In our work, conversations to be used as training examples are not available. Note that inferring the correlation rules from a collection of conversation instances, if available, would be an interesting and complementary problem to explore. One might also argue that correlation could be formulated as a clustering problem [14]. In clustering, the relationship between members of a cluster is their relative proximity according to some distance measure. However, messages of a same conversation may be very different (e.g. a purchase order and a payment message) while messages of two distinct conversation may be very similar (e.g. two purchase orders for the same products). Hence, clustering approaches—as well as other similarity-based approaches such as, e.g., Record Linkage[8]—could only be used provided a suitable and very ad-hoc distance measure has first been defined. Defining this measure is equivalent to identifying how to correlate messages; the purpose of this paper. We have performed some experiments trying to address the correlation problem as a clustering problem that supports this claim. The result of 31

experiments is presented in Appendix B. Session reconstruction. Web usage mining has raised the problem of session reconstruction. A session represents all the activities of a user in a Web site during a single visit. In this problem, however, the problem is not to identify how to correlate events in the log (this is typically done using the IP address) but how to decide on the boundaries of the session, i.e. after how much delay of inactivity would one consider that the same IP corresponds to a different user, or to the same user with discontinued activity. In many web services, using other types of information, e.g., identifiers, is more common for message correlation, or at least time is used in conjunction with other information. For instance, the WS-CDL standard proposal supports the usage of time with identifier-based correlation of messages in Web services. Application dependency. Correlation is often cited in the context of dependency discovery, where the task is to identify whether some events may depend on some others. However, correlation in that context bears a different meaning than the one intended in this paper. It refers to a temporal dependency between events where, for example, an event is a cause that triggers one or more subsequent events. Examples of approaches in this category include several statistical approaches to numerical time series correlation [17] or event correlation for root cause analysis [24]). In this paper, correlation means grouping messages that belong to the same conversation. Although there might be a causal dependency between these messages (discussed hereafter), it is only incidental. Message correlation in Web services. Correlation patterns in service workflows is studied in [5]. However, no automated support for message correlation is proposed. The work on service workflow discovery also characterizes the problem of time-based session reconstruction for Web services [7] and examines the possibility of adapting approaches of session reconstruction in Web usage mining to this problem. However, proposing an automated approach remains as future work. The need for automated approaches to message correlation has first been recognized in [19] where a real situation on how to correlate messages is presented.

7

Conclusions and Future Work

In this paper, we have presented concepts, as well as a framework and a set of tools for discovering correlations in service logs, in order to reconstruct conversations. We introduced the notion of correlation dependency among messages in relational logs, where individual events correspond to tuples of a relation, and provided a framework to define, discover and heuristically explore possible correlation rules. The framework can be extended in many directions. These extensions make it possible to adapt this correlation discovery framework in various

32

contexts. We highlight below some future research directions. Rule language. A lot of work has been done in the record linkage community to uncover relationships among seemingly separate entities. Users might be interested in discovering non-trivial correlation properties within collections of logs. The rule language used in this article is one of the most basic and can be extended in several directions to handle various situations. For instance, the equality between values can be replaced by a similarity function. Such rules would replace the current graph structure with a weighted graph and the problem of identifying disconnected components is replaced by that of identifying weakly connected components, i.e. components such that the aggregate of their relationship weights is below some threshold. Heuristics. Depending on the context, heuristics can be modified to either enlarge or restrict the search space. For instance, in this paper, we limited the search space by taking into account the order relationship between log messages. In the context of web-services choreography, several logs may be involved and will first need to be integrated and then correlated. During log integration, it would be very difficult to keep a satisfying ordering of the messages since separate logs may have separate clocks. Richer relations can be used to replace ≺ and ≈ in order to reflect whether messages belong to the same or a different original log. Time-based rule patterns. Note that discovering rule patterns based on the values of a time attribute is the subject of future work. One might need to define some functions over the time attributes of events in order to create log partitions. This way, events that share the same values of the attribute but are very far apart in terms of time (e.g., larger than a user provided threshold or a threshold discovered from the body of messages) cannot to be considered members of the same partition. Finally, one could also consider the partitioning of a log as a post-processing step, where events of each partition have a time difference larger than the average time difference of the events across all partitions of the log. Imperfect data. Service log events may contain imperfections [20], e.g. incomplete conversations and noisy data in that there are missing events and/or event attribute values. Such imperfections may make the message correlation problem harder. We are planning to investigate extensions of our approach to tackle the problem of imperfect logs.

8

Acknowledgments

The authors would like to thank Pierre Buraud for his help in implementation of the approach presented in this paper.

33

References [1] R. Agrawal, S. Ghosh, T. Imielinski, B. Iyer, and A. Swami. An interval classifier for database mining applications. In L.-Y. Yuan, editor, Proceedings of the 18th International Conference on Very Large Databases, pages 560–573, San Francisco, U.S.A., 1992. Morgan Kaufmann Publishers. [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fast discovery of association rules. pages 307–328, 1996. [3] A. V. Aho, J. E. Hopcroft, J. Ullman, J. D. Ullman, and J. E. Hopcroft. Data Structures and Algorithms. Addison-Wesley, 1983. [4] P. Andritsos, R. J. Miller, and P. Tsaparas. Information-theoretic tools for mining database structure from large data sets. In SIGMOD Conference, pages 731–742, 2004. [5] A. Barros, G. Decker, M. Dumas, and F. Weber. Correlation patterns in service-oriented architectures. In Proceedings of the 9th International Conference on Fundamental Approaches to Software Engineering (FASE). Springer Verlag, March 2007. [6] X. Dong and A. Y. Halevy. A platform for personal information management and integration. In The Conference on Innovative Data Systems Research (CIDR), 2005. [7] S. Dustdar and R. Gombotz. Discovering web service workflows using web services interaction mining. International Journal of Business Process Integration and Management (IJBPIM), Forthcoming. [8] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1–16, 2007. [9] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery: an overview. pages 1–34, 1996. [10] M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Record, 34(4):27–33, 2005. [11] D. Grigori, F. Casati, U. Dayal, and M.-C. Shan. Improving business process quality through exception understanding, prediction, and prevention. In The VLDB Journal, pages 159–168, 2001. [12] G. Group. Gartner exp report. www.gartner.com/pressr eleases/asset1 436781 1.html, 2006. 34

In

[13] J. Hipp, U. G¨ untzer, and G. Nakhaeizadeh. Algorithms for association rule mining - a general survey and comparison. SIGKDD Explorations, 2(1):58–64, 2000. [14] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, 1988. [15] J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. In ICDT ’92: Selected papers of the fourth international conference on Database theory, pages 129–149, Amsterdam, The Netherlands, The Netherlands, 1995. Elsevier Science Publishers B. V. [16] D. Lee, M. Mani, and W. W. Chu. Schema conversion methods between xml and relational models. In Knowledge Transformation for the Semantic Web, pages 1–17. 2003. [17] H. Mannila and D. Rusakov. Decomposition of event sequences into independent components. In Proc. of SIAM-SDM01, January 2001. [18] H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Min. Knowl. Discov., 1(3):241–258, 1997. [19] H. Motahari, B. Benatallah, and R. Saint-Paul. Protocol discovery from imperfect service interaction data. In Proceedings of the VLDB 2006 Ph.D. Workshop. ceur-ws.org/Vol-170, September 2006. [20] H. Motahari, R. Saint-Paul, B. Benatallah, and F. Casati. Protocol discovery from web service interaction logs. In ICDE 07: Proceedings of the IEEE International Conference on Data Engineering. IEEE Computer Society, April 2007. [21] W. Pauw and et. al. Web services navigator: Visualizing the execution of web services. IBM System Journal, 44(4):821–845, 2005. [22] J.-M. Petit, F. Toumani, J.-F. Boulicaut, and J. Kouloumdjian. Towards the reverse engineering of denormalized relational databases. In ICDE ’96: Proceedings of the Twelfth International Conference on Data Engineering, pages 218–227, Washington, DC, USA, 1996. IEEE Computer Society. [23] Y. Sismanis, P. Brown, P. J. Haas, and B. Reinwald. Gordian: Efficient and scalable discovery of composite keys. In VLDB, pages 691–702, 2006. [24] M. Steinle, K. Aberer, S. Girdzijauskas, and C. Lovis. Mapping moving landscapes by mining mountains of logs: Novel techniques for dependency model generation. In VLDB, pages 1093–1102, 2006. 35

[25] W. van der Aalst and et. al. Workflow mining: a survey of issues and approaches. DKE Journal, 47(2):237–267, 2003.

36

A

The Details of Experiments

This section provide more details about the experiments that is performed, summarized in Section 5 of the paper. The experiments is organized by dataset.

A.1

Retailer

Figure 8: Extract from Retailer’s data

A.1.1 A.1.1.1

The attributes All attributes

The number of Attribute sequenceid time stamp requestid responseid duration requestsize responsesize operation servicename serviceversion porttype binding senderuri receiveruri errortype

A.1.1.2

attributes = 15 Distinct values 2675 2548 480 480 26 14 14 5 1 1 1 1 1 1 1

Description String Date Integer Integer Integer Integer Integer String String URL String String URL URL String

Pruning

The thresholds Threshold type Lower Upper

Value 0.01 0.9

Attributes pruned 11 2

37

A.1.1.3

Attributes after pruning

Attribute requestid responseid

A.1.2 A.1.2.1

Distinct values 480 480

Atomic rules All atomic rules

Rule requestid=requestid requestid=responseid responseid=responseid

A.1.2.2

Description Integer Integer

Shared values # 480 480 480

Pruning

Type Lower threshold (applied on shared values) Upper threshold (applied on CC#)

A.1.2.3

Value 0.01 0.9

Rules pruned 0 0

The rules after pruning

Symbol Rule CC#* Avg Max R1 requestid=requestid 480 5.57 14 R2 requestid=responseid 480 5.57 14 R3 responseid=responseid 480 5.57 14 *CC# means Number of Connected Components = Number of **IQM means Interquatile mean

A.1.2.4

Min Median 1 5 1 5 1 5 Conversations

IQM** 5.26 5.26 5.26

Rule inclusion

Relationship R1 ⊂ R2 R1 ⊂ R3 R2 ⊂ R3

A.1.3 A.1.3.1

Composite rules Conjunctive Criteria (in the order that are applied)

Criterion Attribute definition constraints Inclusion Upper threshold (applied on CC#) Monotonicity with respect to conjunction No composite rule computed.

38

level: 1 0 3 0 0

A.2

PurchaseNode

Figure 9: Extract from PurchaseNode’s data

A.2.1 A.2.1.1

The attributes All attributes

The numbers of attributes = 26 Attribute nodeinstance id starttimelongmillis endtimelongmillis flowinstance id node id lastrateupdatelongmillis totalcount samplecount

Distinct values* 34307 14790 14464 7416 62 50 31 27

39

Description String Integer Integer String String Integer Integer Integer

avrgtime totaltime nodename x pos flow id y pos nodedescription totalweight endtime starttime lastrateupdate activecount status nodetype activeweight instancerate weightrate resourcestatus

A.2.1.2

Value 0.01 0.9

A.2.2.1

Attributes pruned 22 0

Attributes after pruning

Attribute nodeinstance id starttimelongmillis endtimelongmillis flowinstance id

A.2.2

Integer Integer String Integer String Integer String Integer Date Date Date Integer String String Integer Integer Integer String

The threshold

Threshold type Lower Upper

A.2.1.3

25 25 23 22 14 11 9 6 5 5 4 3 3 3 2 1 1 1

Distinct values 34307 14790 14464 7416

Description String Integer Integer String

Atomic rules All atomic rules

Rule nodeinstance id=nodeinstance id starttimelongmillis=starttimelongmillis endtimelongmillis=endtimelongmillis starttimelongmillis=endtimelongmillis flowinstance id=flowinstance id flowinstance id=starttimelongmillis flowinstance id=endtimelongmillis flowinstance id=nodeinstance id starttimelongmillis=nodeinstance id endtimelongmillis=nodeinstance id *SV means Shared Values

A.2.2.2

SV* 34307 14790 14464 13027 7416 0 0 0 0 0

Pruning

Type Lower threshold (applied on shared values) Upper threshold (applied on CC#)

40

Value 0.01 0.9

Rules pruned 5 0

A.2.2.3

The rules after pruning

Symbol R1 R2 R3 R4 R5

A.2.2.4

Rule nodeinstance id=nodeinstance id flowinstance id=flowinstance id starttimelongmillis=starttimelongmillis starttimelongmillis=endtimelongmillis endtimelongmillis=endtimelongmillis

CC# 2500 229 2744 2705 2735

Avg 2 21.8341 1.82216 1.84843 1.82815

Rule inclusion

Relationship R1 ⊂ R2

A.2.3 A.2.3.1

Composite rules Conjunction

Operator AND

Theoretical number of rules 26

Rules computed 1

Conjunctive Criteria (in the order that are applied) Criterion level: 1 2 Attribute definition constraints 6 9 Inclusion 1 0 Upper threshold (applied on CC#) 2 1 Monotonicity with respect to conjunction 0 0 Conjunctive Rule Symbol CC# Avg R3 ∧ R5 2968 1.68

A.2.3.2

Max 2

Min 1

Median 2

3 5 0 0 0

4 1 0 0 0

IQM 1.86

Disjunction

Operator OR

Theoretical number of rules 57

Rules computed 4

Disjunctive Criteria (in the order that are applied) Criterion level: 1 Associativity 2 Inclusion 9 Lower threshold (applied on CC#) 0 Monotonicity with respect to disjunction 1 Disjunctive Rule Symbol CC# R3 ∨ R4 2108 R4 ∨ R5 2111 R4 ∨ (R3 ∧ R5) 2233 R3 ∨ R4 ∨ R5 1986

Avg 2.37192 2.36855 2.23914 2.51762

Max 8 8 8 8

41

Min 1 1 1 1

2 7 12 0 0

Median 2 2 2 2

3 9 6 0 0

4 5 1 0 0 IQM 1.82 1.82 1.71 1.94

5 1 0 0 0

Max 2 22 4 8 4

Min 2 6 1 1 1

Median 2 22 2 1 2

IQM 2 21.90 1.81 1.14 1.82

A.3

Robostrike

Figure 10: Extract from Robostrike’s data The experiments presented in this section are the result of application of the approach on a simplified version of the dataset, in which fewer attributes are present. In fact, attributes that are inside nested constructs that are several levels deep (inside XML messages) are not considered as it is believed that they are not directly useful for correlation purpose.

A.3.1 A.3.1.1

The attributes All attributes

The numbers of attributes = 18

42

Attribute timestamp stat data convid code cntd uid usid logn retv iden gmid info msg name retc from to vers

A.3.1.2

Description Date String Integer String Integer Integer Integer Integer String Integer String Integer String String String String String String

Pruning

Threshold type Lower Upper

A.3.1.3

Distinct values 1681440 526571 92235 48526 46826 45840 29604 25268 16419 16409 1631 1000 23 10 3 2 2 1

Value 0.001 0.8

Attributes pruned 6 1

Attributes after pruning

Attribute stat data convid code cntd uid usid logn retv

Distinct values 526571 92235 48526 46826 45840 29604 25268 16419 16409

Description String Integer String Integer Integer Integer Integer String Integer

43

A.3.2

Atomic rules

A.3.2.1

All atomic rules

Rule stat=stat data=data convid=convid code=code cntd=cntd uid=uid usid=usid logn=logn uid=usid logn=retv retv=retv usid=cntd uid=cntd cntd=stat retv=data logn=data

A.3.2.2

Shared values 526571 92235 48526 46826 45840 29604 25268 16419 16416 16409 16409 11661 8987 1 1 1

CC# 1449 4007 120 5000 1593 120 3940 85 5000 3265 4974 1108 5000 N.A. N.A. N.A.

Pruning

Type Lower threshold (applied on shared values) Upper threshold (applied on CC#)

A.3.2.3

Rules pruned 33 5

The rules after pruning

Symbol R6 R7 R8 R9 R10 R11 R12 R13

A.3.2.4

Value 0.01 0.8

Rule convid=convid uid=uid logn=logn usid=cntd logn=retv usid=usid cntd=cntd stat=stat

CC# 120 120 85 1108 3265 3940 1593 1949

Avg 41.6667 41.6667 58.8235 4.51264 1.53139 1.26904 3.13873 2.56542

Max 396 396 528 3893 493 123 2757 2490

Min 2 2 2 1 1 1 1 1

Rule inclusion

Relationship R6 ⊂ R7 R6 ⊂ R8 R7 ⊂ R8 R10 ∨ R11 ⊂ (R12 ∧ R13)

A.3.3 A.3.3.1

Composite rules Conjunction

Operator AND

Theoretical number of rules 247

Rule computed 1

Conjunctive Criteria (in the order that are applied)

44

Median 19 19 25 1 1 1 1 1

IQM 21.83 21.83 32.70 1 1 1 1 1

Criterion Attribute definition constraints Inclusion Upper threshold (applied on CC#) Monotonicity with respect to conjunction

level: 1 24 3 0 0

2 55 1 0 0

3 70 0 0 0

4 56 0 0 0

5 28 0 0 0

6 8 0 0 0

7 1 0 0 0

4 55 21 50 0

5 50 4 30 0

6 27 0 9 0

7 8 0 1 0

Conjunctive Rule Symbol R12 ∧ R1

A.3.3.2

CC# 2320

Avg 2.15517

Max 2490

Min 1

Median 1

IQM 1

Disjunction

Operator OR

Theoretical number of rules 502

Rule computed 11

Disjunctive Criteria (in the order that are applied) Criterion Associativity Inclusion Lower threshold (applied on CC#) Monotonicity with respect to disjunction

level: 1 2 21 4 1

2 13 47 21 0

3 36 45 45 0

Disjunctive Rule Symbol R9 ∨ R10 R9 ∨ R11 R10 ∨ R12 R10 ∨ R13 R10 ∨ (R12 ∧ R13) R11 ∨ R12 R11 ∨ R13 R11 ∨ (R12 ∧ R13) R9 ∨ R10 ∨ R11 R10 ∨ R11 ∨ R12 R10 ∨ R11 ∨ R13

CC# 615 102 1042 1310 1618 1591 1947 2318 43 1040 1308

Avg 8.13008 49.0196 4.79846 3.81679 3.09023 3.14268 2.56805 2.15703 116.279 4.80769 3.82263

45

Max 4386 3953 3783 3534 3331 2759 2492 2492 4870 3785 3536

Min 1 1 1 1 1 1 1 1 1 1 1

Mediane 1 5 1 1 1 1 1 1 1 1 1

IQM 1 5.39 1 1 1 1 1 1 1.53 1 1

8 1 0 0 0

B

Correlation Problem as a Clustering Problem

In this section, we present some of our results when we tackle the correlation problem as a clustering problem. In performing clustering according to the traditional way, our goal is to group together conversation messages, which are deemed very similar and separate the dissimilar ones in different clusters. This is sometimes called horizontal clustering. Driven by a similarity measure the clustering algorithm places together those messages that reveal maximum similarity. In our problem, we performed horizontal clustering on the Retailer and Robostrike data sets, using the Simple K-means algorithm of the WEKA Data Mining tool (see http://www.cs.waikato.ac.nz/ml/weka/), letting the algorithm automatically decide what is the proper number of cluster in this data set. Note that in the domain of service logs, a cluster correspond to a conversation and, thus, these terms can be used interchangeably. The results showed two things: • The coherence within the clusters was very large, which is a good indication that WEKA produced good quality clusters. • There is a big discrepancy between the number of clusters produced by WEKA and those that correspond to real conversations in the data sets. Looking at the second observation a bit more in details reveals that there is a high overlap among the values of certain attributes in messages of different conversations. This fact alone drives the clustering algorithm to produce much fewer clusters than conversations. In addition, these groups do not necessarily correspond to actual conversations, but e.g., all messages with similar contents are grouped together. Hence, our initial argument that clustering reveals a particular type of conversation that is not potentially useful for message correlation is confirmed. SEQUENCEID TIME_STAMP OPERATION DURATION REQUESTSIZE RESPONSESIZE requestid responseid SERVICENAME SERVICEVERSION ERRORTYPE PORTTYPE BINDING RECEIVERURI SENDERURI 0.0

0.2

0.4

0.6

0.8

Figure 11: Dendrogram for Retailer databset In recent work [4], it has been shown how a categorical clustering algorithm can be used to (a) group the values of a relation such that ones with higher degrees of redundancy in the tuples are placed together, and (b) group the attributes of a relation such that these attributes contain more naturally co-occurring values or, simply, ones with higher degrees of correlation. This type of clustering is sometimes called vertical clustering. This technique is based on information theory and is highly applicable to attributes with categorical values, something that is in accordance with our heuristic assumptions.

46

The final result is a so-called dendrogram that offers a hierarchical representation of the attribute groupings according to the redundancy in their values. Hence, initially each attribute forms its own group. As redundancy is calculated within the attributes, attribute groups start forming by placing together attributes with the highest amount of redundancy. Groupings (or merges) that take place closer to the leaf level of the dendrogam contain very good candidates for (simple) key-based correlators. These will be correlators based on natural co-occurence of values, e.g., the same customer appearing in the same order over a number of messages. However, this is only one type of correlation. Our vertical clustering cannot reveal hints on more complex correlation patterns, e.g., reference-based correlation, in which correlator attributes may change from one set of messages to another in a same conversation. To evaluate the effectiveness of this approach on a key-based correlation scenario, we applied it on the Retailer dataset. The result is given in the dendrogram of Figure 11. In this figure, we observe that attributes whose values naturally co-occur get merged closer to the leaf nodes. For example, requestID and responseID that are key attributes for this dataset are merged together. However, in case of composite key-based rules it is not clear how to derive them from this output.

47

C

XML Schema of Robostrike dataset



48



49



50



51



52



53



54



55