Discovering Sentinel Rules for Business ... - Semantic Scholar

1 downloads 0 Views 302KB Size Report
Mar 5, 2009 - in negative blogging about a company could trigger a sentinel rule warning that revenue ... or reduce cost to cope with a reduction in revenue.
Discovering Sentinel Rules for Business Intelligence

Morten Middelfart and Torben Bach Pedersen

March 5, 2009

TR-24

A DB Technical Report

Title

Discovering Sentinel Rules for Business Intelligence c 2009 Morten Middelfart and Torben Bach Pedersen. Copyright ° All rights reserved.

Author(s)

Morten Middelfart and Torben Bach Pedersen

Publication History

March, 2009. A Technical Report.

For additional information, see the DB Tech Reports homepage: hdbtr.cs.aau.dki. Any software made available via DB Tech Reports is provided “as is” and without any express or implied warranties, including, without limitation, the implied warranty of merchantability and fitness for a particular purpose.

The DB Tech Reports icon is made from two letters in an early version of the Rune alphabet, which was used by the Vikings, among others. Runes have angular shapes and lack horizontal lines because the primary storage medium was wood, although they may also be found on jewelry, tools, and weapons. Runes were perceived as having magic, hidden powers. The first letter in the logo is “Dagaz,” the rune for day or daylight and the phonetic equivalent of “d.” Its meanings include happiness, activity, and satisfaction. The second letter is “Berkano,” which is associated with the birch tree. Its divinatory meanings include health, new beginnings, growth, plenty, and clearance. It is associated with Idun, goddess of Spring, and with fertility. It is the phonetic equivalent of “b.”

Abstract This paper proposes the concept of sentinel rules for multi-dimensional data that warns users when measure data concerning the external environment changes. For instance, a surge in negative blogging about a company could trigger a sentinel rule warning that revenue will decrease within two months, so a new course of action can be taken. Hereby, we expand the window of opportunity for organizations and facilitate successful navigation even though the world behaves chaotically. Since sentinel rules are at the schema level as opposed to the data level, and operate on data changes as opposed to absolute data values, we are able to discover strong and useful sentinel rules that would otherwise be hidden when using sequential pattern mining or correlation techniques. We present a method for sentinel rule discovery and an implementation of this method that scales linearly on large data volumes.

1

Introduction

The Computer Aided Leadership and Management (CALM) concept copes with the challenges facing managers that operate in a world of chaos due to the globalization of commerce and connectivity [17]; in this chaotic world, the ability to continuously act is far more crucial for success than the ability to long-term forecast. The idea in CALM is to take the Observation-OrientationDecision-Action (OODA) loop (originally pioneered by Top Gun fighter pilot John Boyd in the 1950s [15]), and integrate business intelligence (BI) technologies to drastically increase the speed with which a user in an organization cycles through the OODA loop. Using CALM, any organization can be described as a set of OODA loops that are continuously cycled to fulfil one or more Key Performance Indicators (KPI’s). One way to improve the speed from observation to action is to expand the ”horizon” by allowing the user to see data from the external environment, and not only for the internal performance of the organization. Another way is to give early warnings when factors change that might influence the user’s KPI’s, e.g., revenue. Placing ”sentinels” at the outskirts of the data available seeks to harness both ways of improving reaction time and thus organizational competitiveness. A sentinel rule is a relationship between two measures, A and B, in an OLAP database where we know, that a change in measure A at one point in time affects measure B within a certain warning period, with a certain confidence. If such a relationship exists, we call measure A the source measure, and measure B the target measure. Usually, the target measure is, or contributes to, a KPI. The source measure ideally represents the external environment, or is as close to the external environment as possible. Examples of source measures for an organization could be: the number of negative blog entries (external), the number of positive articles in papers (external), or the number of complaints from customers (internal, yet as close to external as possible). Examples of target measures could be: revenue or contribution margin. Imagine a company selling a product globally where we discovered the sentinel rule: ”IF negative blogs go up THEN revenue goes down within two months AND IF negative blogs go down THEN revenue goes up within two months”. Assume that Google is searched daily for negative blogs, and the number of negative blogs is stored in the company’s OLAP database. Also, the company’s BI solution can notify users based on sentinel rules. When a user receives notification that the number of negative blogs goes up, he will know that revenue will go down in two months with a certain confidence. Depending on the situation, the user might have a number of evasive actions such as: post positive blogs to sway the mood, or reduce cost to cope with a reduction in revenue. Regardless of the action, the sentinel rule has raised awareness of a problem and reduced the time from observation to action. Metaphorically, sentinels seek to do for organizations what radars do for navigation of ships. The novel contributions in this paper include the sentinel rule concept, and an algorithm that 1

discover sentinel rules on multi-dimensional data. We give a formal definition of sentinel rules, and we define the indication concept for rules and for source and target measures. In this context, we provide a contradiction elimination process that allows us to generate more general rules that are easy to interpret. We also provide a useful notation for sentinel rules. We conduct several experiments to validate that our algorithm scales linearly on large volumes of synthetic and realworld data and on databases with high complexity in terms of the number of measures. Since sentinel rules operate on data changes as opposed to absolute data values, and are at the schema level as opposed to the data level (such as association rules and sequential patterns), we can find strong rules that neither association rules nor sequential pattern mining would find. In addition, we found sentinel rules to be complementary to correlation techniques, since our solution finds ”micro-predictions” that hold for a smaller subset within a dataset; using correlation techniques alone such rules would be ”hidden in the average”. We believe that we are the first to propose the concept of sentinel rules, and to provide an algorithm and implementation for discovering them. The next section presents the formal definition, Section 3 presents an algorithm including an assessment of its complexity. Section 4 presents a scalability and a qualitative study. Section 5 presents the work related to the discovery of sentinel rules.

2

Problem Definition

Running Example: Imagine a company that sells products world-wide, and that we, in addition to the traditional financial figures such as revenue, Rev, have been monitoring the environment outside our organization and collected that information in three measures. The measure NBlgs represents the number of times an entry is written on a blog where a user is venting a negative opinion about our company or products. The measure CstPrb represents the number of times a customer contacts our company with a problem related to our products. The measure WHts represents the number of hits on our website, and this figure has been cleansed in order to represent human contact exclusively, eliminating traffic by robots etc. Table 1. Example dataset

In Table 1 we see a subset from our database, representing each quarter in year 2007 across three geographical regions. It should be noted that a subset like Table 1 can easily be extracted from a multi-dimensional database, i.e., if the desired data are the base level of the database no processing is needed, if the desired levels are higher than the base level, the data might or might not be preaggregated. However, both extraction and aggregation are typically basic built in functions of any multi-dimensional database. The three measures: NBlgs, CstPrb and WHts, representing the external environment around our company, have been presented along with the internal measure, Rev, representing our Revenue. The variable names: T, D2 , M1 ...M4 have been assigned to the dimensions and measures in order to create transparency to the formal definition in Section 2. We are interested in discovering whether we can use any of the external measures to predict a future impact on the internal Revenue measure; in other words we are looking for sentinel rules where one of the measures M1 ...M3 can give us an early warning about changes to M4 . To disT: Time 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4 2007-Q1 2007-Q2 2007-Q3 2007-Q4

D2 : Region Asia Asia Asia Asia EU EU EU EU USA USA USA USA

M1 : NBlgs 20 21 17 15 30 25 22 19 29 35 40 39

M2 : CstPrb 50 45 33 34 41 36 46 37 60 70 72 73

M3 : WHts 1.000 1.500 2.000 2.500 3.000 3.500 4.000 4.500 5.000 5.500 6.500 7.500

M4 : Rev 10.000 9.000 11.000 13.000 20.000 25.000 28.000 35.000 50.000 55.000 45.000 40.000

2

tinguish between which measures are ”causing” the other, we call the measures M1 ...M3 source measures and the measure M4 is called the target measure. Formal Definition: Let C be a multi-dimensional data cube containing a set of dimensions: D = {D1 , D2 ...Dn } and a set of measures: M = {M1 , M2 ...Mp }. We denote the members of the dimensions in D by d1 , d2 ...dn and we denote the corresponding measure values for any combination of dimension members by m1 , m2 ...mp . A measure value is a function, Mi , that returns the value of a given measure corresponding to the dimension members it is presented with. We will now provide a series of definitions that define a source measure, A, is a sentinel for a target measure, B, i.e., a guarded watchtower from which we monitor A in order to know about changes ahead of time to B. The sentinel rule between A and B is denoted A Ã B. We assume, without loss of generality, that there is only one time dimension, T , in C, and that T = D1 , and subsequently t = d1 . A fact, f , in C is then defined as: f = (t, d2 , d3 ...dn , m1 , m2 ...mp )

(1)

Given a fact f , the measure Mi is a function Mi (t, d2 , d3 ...dn ) = mi . The ”dimension” part of f , (t, d2 , d3 ...dn ), is called a cell. The shifting of a fact f , f 0 , is a fact with the same non-time dimension values (d2 ...dn ) as f , but for time period t+o, if it exists in C, i.e., a period of o members later on the time dimension. We denote the offset, o, and define the function as: Shift(C, f, o) = f 0 = (t + o, d2 , d3 ...dn , m01 , m02 ...m0p ) if f 0 ∈ C

(2)

Since we are interested in the change in data, we introduce the measure difference function, Diff. With Diff, we find the relative changes to each of the measures during the time period specified by the offset. Diff is defined as follows: Diff (C, f, o) = (t, d2 , d3 ...dn ,

m01 − m1 m02 − m2 m0p − mp , ... ) m1 m2 mp

(3)

where f = (t, d2 , d3 ...dn , m1 , m2 ...mp ) ∧ f ∈ C ∧ f 0 = Shift(C, f, o) = (t + o, d2 , d3 ...dn , m01 , m02 ...m0p ) ∧ f 0 ∈ C Given a threshold, α, we say that x ∈ Diff (C, f, o) is an indication on a measure, Mi , if: x = (t, d2 , d3 ...dn ,

m0p − mp m01 − m1 m0 − mi m0 − mi , ..., i , ..., )∧| i |=α m1 mi mp mi

We say that an indication on Mi , x, is positive, denoted Mi N, when quently that an indication, x, is negative, denoted Mi H, when ∗, meaning that Mi ∗ can be either Mi N or Mi H. Table 2. Indications between quarters. T: D2 : M1 : M2 : Time Region NBlgs CstPrb ’07:Q1→Q2 Asia neutral M2 H ’07:Q2→Q3 Asia M1 H M2 H ’07:Q3→Q4 Asia M1 H neutral ’07:Q1→Q2 EU M1 H M2 H ’07:Q2→Q3 EU M1 H M2 N ’07:Q3→Q4 EU M1 H M2 H ’07:Q1→Q2 USA M1 N M2 N ’07:Q2→Q3 USA M1 N neutral ’07:Q3→Q4 USA neutral neutral

M3 : WHts M3 N M3 N M3 N M3 N M3 N M3 N M3 N M3 N M3 N

M4 : Rev M4 H M4 N M4 N M4 N M4 N M4 N M4 N M4 H M4 H

3

m0i −mi mi

m0i −mi mi

(4)

> 0 and conse-

< 0. We define a wildcard,

In our running example, when assessing whether a relationship exists, we are not concerned with minor fluctuations, so we define a threshold of 10%, meaning that a measure has to change at least 10% up or down in order to be of interest. Furthermore, given the dataset we have, we are interested in seeing the changes that occur over quarters as presented in Table 1. This means that we set the threshold α = 10%

and then the offset o = 1 Quarter. In Table 2, we have calculated the changes from each quarter to the next and subjected each change to an evaluation against the threshold of 10% change. We denote positive indications by N and subsequently negative by H, if a change is less than 10% in either direction it is deemed ”neutral”. Please note that since we are dealing with changes between periods, we naturally get one less row for each region. ST (C, o, w) = {(Diff (C, f, o), Diff (C, Shift(C, f, w), o))|f ∈ C}

(5)

A Source-Target Set, ST , is defined as paired indications of changes over time, where the source and target measures have been shifted with the offset, o. The target measures have additionally been shifted with a warning period, w, which is the timeframe after which we should expect a change on a target measure, after an indication on a source measure has occurred. We say that (x, x0 ) ∈ ST (C, o, w) supports the indication rule AN → BN if x is an indication of AN and x0 is an indication of BN. In this case, we also say that x supports AN and x 0 supports BN. The support of an indication rule is the number of (x, x0 ) ∈ ST (C, o, w) which supports the rule. The support of indication rules AH → BH, AN → BH and AH → BN as well as the support for indications AH and BH are defined similarly. We denote the support of an indication and an indication rule by IndSupp followed by the name of the indication or indication rule, respectively, e.g., IndSuppAN and IndSuppAN→B N . A sentinel rule is an unambiguous relationship between A and B, thus we must first eliminate contradicting indication rules, if such exist, before we have a sentinel rule. We refer to this process as the contradiction elimination process, and we use it to remove indication rules with the same premise, but a different consequent, and vice versa, e.g., if both AN → BN and AN → BH or if both AN → BN and AH → BN are supported. To eliminate such contradictions, we pair the indication rules in two sets that do not contradict each other, and we denote these sets by A → B and A → inv (B), as follows: A → B = {AN → BN, AH → BH} and A → inv (B) = {AN → BH, AH → BN}. Here inv indicates an inverted relationship between the indications on A and B, e.g. if AN then BH, and vice versa. For the purpose of being able to deduct the support of the indication rule(s) we eliminate, we define functions for returning the premise and the consequent indication, IndPrem and IndCons, from an indication rule AN → BN as follows: IndPrem(AN → B N) = AN and IndCons(AN → B N) = B N. Furthermore, we define the complement of an indication as follows: AN = AH and AH = AN. We can now define a contradicting indication rule as a function, ContraRule, for an indication rule, IndRule, as follows: ContraRule(IndRule) = IndPrem(IndRule) → IndCons(IndRule)

(6)

ElimSupp(IndRule) = IndSuppIndRule − IndSuppContraRule(IndRule)

(7)

The support after elimination, ElimSupp, of an indication rule, IndRule, where the support of the contradicting indication rule, ContraRule(IndRule), has been eliminated can be calculated as shown in Formula 7.  {IndRulei | IndRulei ∈A → B ∧ ElimSupp(IndRulei ) > 0}      if IndSuppA→B >= IndSuppA→inv (B) ,  (8) MaxRule =    {IndRulei | IndRulei ∈A → inv (B) ∧ ElimSupp(IndRulei ) > 0}    if IndSuppA→B < IndSuppA→inv (B ) . MaxRule is the set of indication rule(s), IndRulei , in the set (A → B or A → inv (B )) with the highest IndSupp and where ElimSupp(IndRulei ) > 0. With MaxRule, we have identified the 4

best indication rule(s) for a sentinel rule that represents an unambiguous relationship between A and B, i.e., the non-contradicting indication rules with the highest ElimSupp. In other words, we have eliminated the contradicting indication rules where the premise contradicts the consequent, as well as the orthogonal indication rules where different premises have the same consequent. If the MaxRule set consists of only one indication rule, we refer to the sentinel rule based on this as a uni-directional rule. We denote the support of a sentinel rule by SentSupp, followed by the name of the sentinel rule, e.g., SentSuppAÃB . For a potential sentinel rule, A à B, we define SentSupp as the sum of the support of source measure indications for the indication rule(s) contained in the sentinel rule:   if AH → B∗ 6∈ MaxRule, IndSuppAN SentSuppAÃB = IndSuppAH (9) if AN → B∗ 6∈ MaxRule,   IndSuppAN + IndSuppAH otherwise. In Formula (9) we note the difference between the support of an indication rule, IndSupp, and a sentinel rule, SentSupp. Specifically, when calculating the support of a sentinel rule, SentSuppAÃB , we only consider the support of indications on the source measure (the premise), AN and AH. With indication rules, both indications on the source and target measure needs to occur. The reason is, that the consequential support of indications on the target measure, BN or BH, is taken into consideration when calculating the confidence of the sentinel rule in Formula (10). In the case of a uni-directional rule (the two first cases) we only consider the support of indications on the source measure that have the same direction as the one indication rule in MaxRule; this is done in order not to penalize otherwise good uni-directional rules in terms of confidence. We denote confidence by Conf, and define the confidence for a sentinel rule, A à B, as follows: P ElimSupp(IndRulei ) (10) ConfAÃB = IndRulei ∈MaxRule SentSuppAÃB The minimum threshold for SentSupp is denoted β, and the minimum threshold for Conf is denoted γ. With these definitions, we say that a sentinel rule, A à B, with an offset, o, and a warning period, w, exists in C when SentSuppAÃB = β and ConfAÃB = γ. Sentinel rule notation: To express sentinel rules with easy readability, we use à to show that there is a sentinel rule between a source measure, A, and a target measure, B. In the case, where a bi-directional rule represents an inverted relationship between the source and the target measure, we add inv to the target measure. In the case where the rule is uni-directional, we add N or H to both the source and the target measure to express the direction of the sentinel rule. In our running example, we limit ourselves to investigating whether sentinel rules exist between any of the source measures M1 ...M3 and the target measure M4 . We now need to compare the changes in M1 ...M3 to changes in M4 at a later time. In this case, we choose the timeframe of 1 quarter again, meaning that warning period w = 1 Quarter. In Table 3, we show the comparison between the source measure indications and the target measure indication one quarter later. The measure M4 is basically moved one line up -or as shown in Table 3; one quarter back. This means that all source measures for Asia changing 2007: Q2→Q3 Table 3. Target and source measure comparison T: D2 : M1 : M2 : M3 : Time Region NBlgs CstPrb WHts ’07:Q1→Q2 Asia neutral M2 H M3 N ’07:Q2→Q3 Asia M1 H M2 H M3 N ’07:Q1→Q2 EU M1 H M2 H M3 N ’07:Q2→Q3 EU M1 H M2 N M3 N ’07:Q1→Q2 USA M1 N M2 N M3 N ’07:Q2→Q3 USA M1 N neutral M3 N

M40 : Rev M40 N M40 N M40 N M40 N M40 H M40 H

5

as shown in the left column are now compared on the same line, within the same row, to the change on the target measure, M4 , for Asia changing 2007: Q3→Q4 and so on. The shift of M4 shown in the row with data for the period one quarter earlier is denoted M40 . Please note that since we are looking at changes between the periods selected on the time dimension, as noted earlier, we naturally get one less row for each geographical region, when we make the comparison across 1 quarter. Based on Table 3, we count the support for each combination of indication changes, the indication rules, for each potential sentinel rule; in addition, we can count the support of the relationship overall, basically the support means counting all rows that do not have a ”neutral” change on the source measure since we define indications as being either positive or negative. For example, we see summarized in Table 4, that the indication rule M1 H → M40 N is supported 3 times in the dataset shown in Table 3; we say that the indication rule M1 H → M40 N has a support of 3, and the sentinel rule M1 Ã M4 has a support of all indication rule combinations which in this case is 5. Table 4 through 6 lists the indication rules for each potential sentinel rule with their respective support (Formula 9). As mentioned earlier, the ideal sentinel rule describes changes bi-directionally so that it can ”predict” both positive and negative changes on the target measure. However, the relationship also needs to be non-contradictory in order to be useful as a sentinel rule. To do this, we eliminate the indications that contradict each other as described in Formulae 6 and 7. In Table 5 we find the a uni-directional rule where the two contradicting indication rules have equal support, thus we disregard these indications completely (Formula 9) and therefore SentSuppM2 ÃM4 =3. In Table 6 the contradiction elimination process does not eliminate both indication rules, it reduces the two indication rules to one and decreases ElimSupp (Formula 7) in the calculation of confidence. In order to identify the best sentinel rules, we set the thresholds β = 3 and γ = 60%. Table 7 through 9 show the sentinel rules from our running example and their respective conformance to the thresholds we have set. As seen in Table 8 and 9, we end up having uni-directional sentinel rules, since the indication rules M2 N → M40 N and M2 N → M40 H, as shown in Table 5, contradict each other and have equal support. In addition, the indication rules M3 N → M40 N and M3 N → M40 H contradict each other in Table 6. Of these, M3 N → M40 N is strongest and ”wins” the elimination process (Formula 8) as seen in Table 9. Based on this example, we have now found that there are two sentinel rules that can provide our company with an early warning. If we monitor the changes to M1 , the number of negative blog entries, we will know one quarter in advance whether to expect an increase or a decrease in Indication rule and sentinel rule support Table 4. M1 Ã M4 M1 M40 IndSupp M1 N M40 N 0 M1 H M40 H 0 M1 N M40 H 2 M1 H M40 N 3 SentSuppM1 ÃM4 = 5

Table 5. M2 Ã M4 M2 M40 IndSupp M2 N M40 N 1 M2 H M40 H 0 M2 N M40 H 1 M2 H M40 N 3 SentSuppM2 ÃM4 = 3

Table 6. M3 Ã M4 M3 M40 IndSupp M3 N M40 N 4 M3 H M40 H 0 M3 N M40 H 2 M3 H M40 N 0 SentSuppM3 ÃM4 = 6

Table 7. M1 Ã M4 M1 M40 ElimSupp M1 N M40 H 2 M1 H M40 N 3 SentSuppM1 ÃM4 = 5 ConfM1 ÃM4 = 55 = 100% Conformance: ok

Table 8. M2 Ã M4 M2 M40 ElimSupp M2 H M40 N 3 SentSuppM2 ÃM4 = 3 ConfM2 ÃM4 = 35 = 100% Conformance: ok

Table 9. M3 Ã M4 M3 M40 ElimSupp M3 N M40 N 2 SentSuppM3 ÃM4 = 6 ConfM3 ÃM4 = 62 = 33% Conformance: failed

6

M4 Revenue. If we monitor the number of times a customer contacts our company with a problem related to our products, M2 , we will know one quarter ahead whether to expect an increase in Revenue. This example demonstrates the usefulness of the sentinel concept, and the idea is that we attempt to place our sentinels as close to the external environment as possible and with as high reliability as possible. Using the notation defined earlier in this section, we can express the rules found in our running example as follows: NBlgs à inv (Rev ) and CstPrbH à Rev N

3

The FindSentinels Algorithm

The following algorithm has been implemented in SQL on a Microsoft SQL Server 2005 as explained in Section 4. The actual SQL code can be found in Appendix A. We assume without loss of generality that of the p measures in the dataset, C, M1 ...Mp−1 are the source measures and Mp is the target measure. Step 1 creates a temporary table where each unique value of (time dimension value) t, is sorted in ascending order and assigned an integer, Id, growing by 1 for each t. This temporary table will allow us to select values of t for comparison with a given distance in periods, regardless of the format of the period field, t, in the database. To optimize performance, we create an index on the period table. By joining 4 copies of each of the original dataset and the period table (one for each of the periods: t, t + o, t + w, and t + w + o), we create a Source-Target set (Formula (5)) and calculate indications (Formulae (3) and (4)) for our selected p-1 source measures and one target measure. We calculate these indications for each cell (dimension combination) in the dataset, and return -1, 0, or 1 depending on whether the indication is negative, neutral or positive against the threshold α. Step 2 counts the number of positive and negative indications on the source measure, and for each of these source measure indications, it summarizes the indications on the target measure. Since the indications are expressed as -1, 0 or 1, our contradiction elimination process can be carried out using sum. Step 3 retrieves the potential rules from previous output, meaning that a source measure needs to have at least one indication with a consequential indication on the target measure, i.e., ElimSupp 0. For each of these rules, we calculate the sum of the support of source measure indications, SentSupp, the sum of absolute indications on the target measure, AbsElimSupp, as well as MaxElimSupp which is max(ElimSupp). In addition, we calculate the Direction of the relationship between source and target measure where 1 is straightforward and -1 is inverted. The nature of Direction also helps us eliminate orthogonal rules since these will always have Direction=0. This is true because an orthogonal relationship means that both positive and negative indications on the source measure leads to only one type of indication on the target measure. Finally, we calculate the number of indication rules, IndRuleCount, in the potential sentinel rule. This information is used to distinguish between bi- and uni-directional rules. Using this information, we can now identify the sentinel rules that comply with the the criteria of SentSupp >= β and Conf >= γ. In addition, we can use the values of IndRuleCount, Direction, and MaxElimSupp to describe the sentinel rule in accordance with our notation. We store the output in a table called FinalResult.

7

Algorithm FindSentinels Input: A dataset, C, an offset, o, a warning period, w, a threshold for indications, α, a minimum SentSupp threshold, β, and a minimum Conf threshold, γ. Output: Sentinel rules with their respective SentSupp and Conf. Method: Sentinel rules are discovered as follows: 1. Scan the dataset C once and retrieve unique values of t into an indexed subset. Use the subset to reference each cell (t, d2 , ... ,dn ) ∈ C with the corresponding cells for {t + o, t + w, t + w + o} ∈ C. Output a SourceTarget set (Formula (5)) for each cell, (t, d2 , ... ,dn ), where the indications (Formulae (3) and (4)) on source measures, M1 ...Mp−1 , are calculated using {t, t + o} and the indications on target measure, Mp , is calculated using {t + w, t + w + o}. 2. For each positive and negative source measure indication, Mi Ind , in the output from Step 1, count the number of source measure indications as IndSuppi and sum the target measure indications as ElimSuppi . 3. Retrieve from the output from Step 2, each source measure, Mi ∈ M1 ...Mp−1 , where ElimSupp 0. For each of these source measures, calculate: SentSupp=sum(IndSupp), AbsElimSupp=sum|ElimSupp|, MaxElimSupp=max(ElimSupp), Direction=avg(sign(Mi Ind )*sign(ElimSupp)), and IndRuleCount as the number of different indications (positive, negative). Output the rules where SentSupp >= β and Conf = AbsElimSupp >= γ, use IndRuleCount=2 to SentSupp identify bi-directional rules and Direction to describe whether the relationship is straight-forward or inverted. For uni-directional rules (IndRuleCount= 1) use the four combinations of Direction and sign(MaxElimSupp) to describe the relationship.

Upon execution of the algorithm, FindSentinels, with Table 10. The FinalResult table SentinelRule SentSupp Conf the dataset from our running example as C, we get NBlgs->inv(Rev) 5 100 the output table named FinalResult as seen in Table CstPrb dec->Rev inc 3 100 10. We note that the result is similar to that of Tables 7 & 8, and we can easily recognize the sentinel rules: NBlgs à inv (Rev ) and CstPrbH à Rev N Computational Complexity: When we examine the individual statements the algorithm, FindSentinels, we notice that the size of output for each step is at most as large as the input, thus the computational complexity will be dominated by the size of the input. The size of the input for Step 3 is much smaller than for previous statements and can thus be disregarded. The retrieve of unique t in Step 1 can be performed in O(n) using a hash-based algorithm [9], where n is the size of the dataset, C. The indexing can be done in time O(p log p) where p is the number of periods, and since p = γ. We use the values of IndRuleCount, Direction, and MaxElimSupp to describe the sentinel rule in accordance with our notation. The output is stored in FinalResult.

14

SQL Statements for FindSentinels(C, o, w, α, β, γ) (1) SELECT T , ROW NUMBER() OVER (ORDER BY T ) AS Id INTO Period FROM (SELECT DISTINCT T FROM C) AS subselect (2) CREATE INDEX PeriodIndex ON Period(T ) (3) SELECT a.T , a.D2 , ... , a.Dn , Ind ((b.M1 -a.M1 )/a.M1 ,α) AS M1 ind, Ind ((b.M2 -a.M2 )/a.M1 ,α) AS M2 ind, ... Ind ((d.Mp -c.Mp )/c.Mp ,α) AS TMind INTO Result3 FROM C a, C b, C c, C d, Period pa, Period pb, Period pc, Period pd, WHERE a.t=pa.t AND b.t=pb.t AND c.t=pc.t AND d.t=pd.t AND a.D2 =b.D2 AND a.D2 =c.D2 AND a.D2 =d.D2 AND ... a.Dn =b.Dn AND a.Dn =c.Dn AND a.Dn =d.Dn AND pb.id=pa.id+o AND pc.id=pa.id+w pd.id=pa.id+w+o Custom function Ind is implemented as follows for source and target measure: SIGN(ROUND(((b.Mi -a.Mi )/a.Mi )*100/α, 0, 1)) AS Mi ind SIGN(ROUND(((d.Mp -c.Mp )/c.Mp )*100/α, 0, 1)) AS TMind (4) SELECT sourcemeasure, Ind, COUNT(*) AS IndSupp, SUM(TMind) AS ElimSupp INTO Result4 FROM (SELECT "M1 " AS SourceMeasure, M1 ind AS Ind, TMind FROM Result3 UNION ALL SELECT "M2 " AS SourceMeasure, M2 ind AS Ind, TMind FROM Result3 ... UNION ALL SELECT "Mp−1 " AS SourceMeasure, Mp−1 ind AS Ind, TMind FROM Result3) AS subselect WHERE Ind0 GROUP BY SourceMeasure, Ind (5) SELECT SourceMeasure, SUM(IndSupp) AS SentSupp, SUM(ABS(ElimSupp)) AS ElimSupp, MAX(ElimSupp) AS MaxElimSupp, AVG(SIGN(Ind)*SIGN(ElimSupp)) AS Direction, COUNT(*) AS IndRuleCount INTO Result5 FROM Result4 WHERE ElimSupp0 GROUP BY SourceMeasure

15

(6) SELECT SentinelRule, SentSupp, Conf INTO FinalResult FROM (SELECT SourceMeasure+’->inv(target)’ AS SentinelRule, SentSupp, ElimSupp*100/SentSupp AS Conf FROM Result5 WHERE Directiontarget’ AS SentinelRule, SentSupp, ElimSupp*100/SentSupp AS Conf FROM Result5 WHERE Direction>0 and IndRuleCount=2 UNION ALL SELECT SourceMeasure+’ inc->target dec’ AS SentinelRule, SentSupp, ElimSupp*100/SentSupp AS Conf FROM Result5 WHERE Directiontarget dec’ AS SentinelRule, SentSupp, ElimSupp*100/SentSupp AS Conf FROM Result5 WHERE Direction>0 and MaxElimSupptarget inc’ AS SentinelRule, SentSupp, ElimSupp*100/SentSupp AS Conf FROM Result5 WHERE Direction>0 and MaxElimSupp>0 and IndRuleCount=1 ) as subselect WHERE SentSupp>=β and Conf>=γ

16

B

Sentinels Rules vs. Sequential Pattern Mining

Sequential pattern mining identifies patterns of items that occur with a certain frequency in a dataset of transactions grouped in sequences, i.e., a sequence could look like this: {(ab)c}, where (ab) and c are transactions. The support of patterns such as ”item a leads to item c” are found by counting the number of times a transaction containing a is followed by a transaction containing c. To exemplify the difference between our solution and sequential pattern mining; specifically the difference between working on absolute data values and relative data changes, we can simply subject our running example dataset in Table 1 to sequential pattern mining. Since all values in this dataset are unique, we will only find patterns with an absolute support of only 1 and confidence of 100%, as follows: ”NBlgs=20 leads to Rev=9000” ”CstPrb=50 leads to Rev=9000” ”WHts=1000 leads to Rev=9000” ”NBlgs=20 and CstPrb=50 leads to Rev=9000” ”NBlgs=20 and WHts=1000 leads to Rev=9000” ”NBlgs=20 and CstPrb=50 and WHts=1000 leads to Rev=9000” ... The rules above above have been generated using only the first line in the dataset in Table 1, meaning that the complete set would generate 12x6=72 rules. If we think of a 1,000 record dataset with the same properties as Table 1, such a dataset would result in 6,000 rules. In other words, since the data-level rules generated by sequential pattern mining are value specific, as opposed to relative change ”events” at the schema level, we get a large output in which we do not find any significant patterns. Thus no meaningful, high-level rules can be found by sequential pattern mining directly.

17

C

Sentinels Rules vs. Correlation Techniques

If we compare sentinel rules with correlation techniques, one could intuitively think that correlations would be able to detect exactly the same relationships as sentinel rules, at least when comparing correlations to bi-directional sentinel rules. However, when subjecting data to correlation techniques [11, 23], we do not find the same explicit relationships as we do with sentinel rules. The reason that correlation techniques gives different results compared to our sentinel rules, lies in the fact, that correlation techniques are focused on identifying relationships with the minimal Euclidean distance between pairs, this means that there is a tendency to favor ”smooth” data series with few outliers. On the other hand, the intention of sentinel rules is to give a user an early warning, and from an operational perspective we can even say that we are particularly interested in the source measure changes (perhaps unusual/”outlier”) that appear to have consequential change effect on the target measure. This difference means (from a sentinel perspective) that correlation techniques ”blind us by the average” when our goal is to get an early warning. Tables 11-13 exemplify these differences by describing the relationship between a source measure, A, and a target measure, B, with both sentinel rules and correlation. The data series in Tables 11-13 have been pre-processed so that each line represents a combination of A for time period, t, and B for time period, t + w (w = 1). The distance in time between the lines is o. We operate with the same values for α, β, and γ as in our running example, specifically: α = 10%, β = 3, and γ = 60%. For correlation, we say that it needs to account for more than 50% of the variation (correlation-coefficient2 > 0.5) between the series in order to be relevant. In fact, it does not even make sense to consider correlations that account for 50% or less, since such correlations would be less precise than simply flipping a coin to decide whether target measure, B, will go up or down. We can now read the individual indications directly, and by correlating the two data series, we get a lagged correlation with the same warning period as the sentinel rules. Table 11 shows a bi-directional sentinel rule, A Ã inv(B), that is not discovered using correlation techniques, Table 12 shows a uni-directional sentinel rule, AH Ã BN, that is not discovered using correlation techniques. In Table 13, correlation and sentinel rules both find a uni-directional rule, AH Ã BN. The data from Table 11 is plotted in Fig. 2(a), note again that the top line (B) is shifted w = 1 period(s) backwards. We note the visible changes for A and B, where A has been scaled by a factor of 6 for better visibility. The changes that leads to sentinel rules have been highlighted by an arrow from A to B. When looking at Fig. 2(a) it seems clear that there is a relationship Table 11. A Ã inv(B) A B Indication 99 1000 89 1100 AH → BN 90 1001 98 1015 113 900 AN → BH 101 1025 AH → BN 108 1100 105 1090 109 1040 90 1145 AN → BH Sentinel rule: SentSuppAÃinv (B) = 4 ConfAÃinv (B) = 100% Correlation: Coefficient = -0.4307 Accounts for = 18.55%

Table 12. AH Ã BN A B Indication 500 1000 400 1200 AH → BN 363 1095 399 1200 350 1400 AH → BN 316 1265 323 1285 355 1410 300 1600 AH → BN 329 1755 Sentinel rule: SentSuppAHÃBN = 3 ConfAHÃBN = 100% Correlation: Coefficient = -0.6919 Accounts for = 47.87%

18

Table 13. AH Ã BN A B Indication 100 1000 90 1100 AH → BN 81 1210 AH → BN 73 1331 AH → BN 66 1464 AH → BN 59 1611 AH → BN 53 1772 AH → BN 48 1949 AH → BN 43 2144 AH → BN 39 2358 AH → BN Sentinel rule: SentSuppAHÃBN = 9 ConfAHÃBN = 100% Correlation: Coefficient = -0.9688 Accounts for = 93.85%

1400

1200

A*6 B

1200

1150 1100 1050

800 B

Value

1000

600

1000 950

400

900

200

850

0

800 1

2

(a) Linear Plot

3

4

5 6 Period

7

8

9

10

80

85

(b) Scatter Plot

90

95

100 105 110 115 120 A

Figure 2: Graphs based on data from Table 11. between the changes. However, when displaying the same data in a scatter plot, Fig. 2(b), to test the correlation visually, we find that the relationship between A and B seems to be rather chaotic or at least non-linear. This is the reason that we get a very poor correlation, whereas on the other hand, we find a strong sentinel rule based on four indications. Following these examples, we confirm the differences between sentinel rules and correlation thecniques by applying correlation techniques [11, 23] to our running example in Table 1. Correlation techniques find lagged correlations between all source measures and the target measure, but the only correlation that accounts for more than 50% of the variation, is identified between WHts and Rev which is in contrast to the bi- and uni-directional sentinel rules between NBlgs and Rev, and between CstPrb and Rev, which we found with our solution. In summary, we could say that sentinel rules are a set of ”micro-predictions” that are complementary to correlation techniques. Sentinel rules are useful for discovering strong relationships between a smaller subset within a dataset, and thus they are useful for detecting warnings whenever changes (that would otherwise go unnoticed) in a relevant source measure occur. Correlation techniques, on the other hand, are useful for describing the overall trends within a dataset.

19