Discovering Representative Episodal Association Rules from Event Sequences Using Frequent Closed Episode Sets and Event Constraints∗ Sherri K. Harms†
Jamil Saquer‡
Jitender Deogun§
Tsegaye Tadesse¶
Keywords: Representative Association Rules, Episodes, Event Sequences, Constraints, Algorithms. Abstract. Discovering association rules from time-series data is an important data mining problem [6]. The number of potential rules grows quickly as the number of items in the antecedent grows. It is therefore difficult for an expert to analyze the rules and identify the useful. A class of association rules called representative association rules was introduced in [12] to overcome this problem for transactional association rules. An approach for generating representative association rules for transactions that uses only a subset of the set of frequent itemsets called frequent closed itemsets was presented in [24]. In this paper, we employ formal concept analysis to develop the notion of frequent closed episodes from time-series data. We formalize the concept of representative association rules in the context of event sequences. Applying constraints to target highly significant rules further reduces the number of rules. We present two new algorithms, Gen-FCE to find Frequent Closed Episode sets and Gen-REAR to generate Representative Episodal Association Rules from the set of frequent closed episodes that meet the constraints. These algorithms results in a significant reduction of the number of rules generated, while maintaining the minimum set of relevant association rules and retaining the ability to generate the entire set of association rules with respect to the given constraints. We show how our method can be used to discover associations in a drought risk management decision support system and use multiple climatology datasets related to an automated weather station in Mead, NE as an example. Our method finds the drought association rules for all reasonable window widths and confidence levels on the Mead, NE drought risk management datasets in less than 30 seconds. It also produces a handful of interesting drought rules to the user, weeding out over 10000% more uninteresting rules than previous methods. This method is well suited for time series problems that have groupings of events that occur close together in time, but occur relatively infrequently over the entire dataset. It is also well suited for multiple time series problems that have periodic occurrences when the signature of one sequence is present in other sequences, event when the multiple time series are not globally correlated. ∗ This
research was supported in part by a NSF Digital Government Grant. of CSCE, University of Nebraska, Lincoln, NE 68588-0115, USA (
[email protected]), phone: (402) 472-5002, fax: (402)472-7767 ‡ CS Department, SW Missouri State, Springfield MO 65804, USA (
[email protected]) § Department of CSCE, University of Nebraska, Lincoln, NE 68588-0115, USA (
[email protected]) ¶ National Drought Mitigation Center, University of Nebraska, Lincoln, NE 68588, USA (
[email protected]) † Department
1
Introduction
Discovering association rules is an important data-mining problem. The problem was first defined in the context of the market basket data to identify customers’ buying habits [1]. For example, it is of interest to a supermarket manager to find that 80% of the customers who buy bagels also buy cream cheese and 5% of all customers buy both bagels and cream cheese. Here the association rule is bagels ⇒ cream-cheese, 80% is the confidence of the rule and 5% is its support. Several studies on developing efficient algorithms for discovering association rules that satisfy user-specified constraints such as minimum support and minimum confidence have been reported recently [2, 5, 9, 25, 27]. The problem of analyzing and identifying interesting rules becomes difficult as the number of rules increases. In most applications the number of rules discovered is usually large. Two different approaches to handle this problem have been reported: 1) identifying the association rules that are of special importance to the user, and 2) minimizing the number of association rules discovered [3, 13, 14]. Most of these approaches introduce additional measures for interestingness of a rule and prune the rules that do not satisfy the additional measures, as a post-processing step. A set of representative association rules, on the other hand, is a minimal set of rules from which all association rules can be generated, during the actual processing step. Usually, the number of representative association rules is much smaller than the number of all association rules, and no additional measures are needed for determining the representative association rules. Algorithms for discovering representative association rules were first reported in [11, 12]. These algorithms use the frequent itemsets to find the representative association rules. Recently, Saquer and Deogun developed a different approach for generating representative association rules [24]. This approach uses a subset of the set of frequent itemsets, called frequent closed itemsets, thus reducing the input size, leading to faster algorithms for generating representative association rules. Formal concept analysis techniques were used to find the frequent closed itemsets [22]. Similarly, Zaki [29, 30] used frequent closed itemsets to generate non-redundant association rules in CHARM. He showed that using frequent closed itemsets results in an exponentially smaller rule set (in the length of the longest frequent itemset) than the traditional approach. Algorithms for discovering associations in sequential data [6], and for discovering episodal associations [15] use all frequent episodes to find the episodal association rules [4, 16, 17]. These approaches produce the entire set of association rules and use significance criterion such as the J-measure for rule ranking to determine the most valuable rules [6]. Often in practice only a subset of rules is of interest. For example, users may only want rules that contain specific events, such as drought events in a weather application. While constraints can be applied as a post-processing step, integrating them into the mining algorithm can dramatically reduce the execution time [21, 23, 26]. Association rules with item constraints were presented in [26]. Similarly, Feng et al. [7] uses templates as constraints for inter-transaction associations. In this paper, we use closure as the basis for generating frequent sets in the context of sequential data. We then generate sequential association rules based on representative association rule approaches. We also integrate constraints in sequential association rule algorithms. By combining these techniques and applying them to sequential data, our method can be used to address several time series problems. This method is well suited for time series problems that have groupings of events that occur close together in time, but occur relatively infrequently over the entire dataset. It is also well suited for multiple time series problems that have periodic occurrences when the signature of one sequence is present in other sequences, event when the multiple time series are not globally correlated. The analysis techniques developed in this work facilitate the evaluation of the temporal associations between episodes of events and the incorporation of this knowledge into decision support systems. We apply this technique to the drought risk management problem. The rest of this paper is organized as follows. A formulation of event sequences and frequent episode sets is presented in Section 2. In Section 3, the theoretical basis for closed sets of episodes is developed. In Section 4, frequent closed episodes are defined and the Gen-FCE algorithm used to find frequent closed episodes is presented. This algorithm incorporates episodal constraints. In Section 5, basic concepts related to association rules and representative association rules are presented. In Section 6 we present Representative Episodal Association Rules and the Gen-REAR algorithm. In Section 7 we show how our algorithm efficiently finds rules that are of interest to the drought risk management problem and provide a performance overview. We conclude and outline additional research in Section 8.
2
2
Events and Episodes
Our overall goal is to analyze event sequences, discover recurrent patterns of events, and then to generate sequential association rules. Our approach is based on the concept of representative association rules combined with event constraints. We first present definitions and terminology related to sequential datasets, where time is a key element. Time series data in continuous domains is inherently inexact, due to the unavoidable imprecision of measuring devices and clocking strategies, and natural occurrences [8]. This forces us to work with the approximate version of the data when trying to break it into discrete segments, and then cluster these segments into similar groups. Typically, the time series is normalized, by using a process described in [8]. After normalizing the dataset, it is discretized by forming subsequences using a sliding window [6]. Using a sliding window of size δ, every normalized time stamp value xt is used to compute each of the new sequence values yt−δ/2 to yt+δ/2 . Thus, the dataset has now been divided into segments, each of size δ. The discretized version of the time series is obtained by using some clustering algorithm and a suitable similarity measure [6]. We consider each cluster identifier as a single event type, and the set of cluster labels as the class of events E. Das et al. [6] presents a detailed discussion of appropriate time series clustering methods and similarity measures. This is an iterative process. Different clustering methods, similarity measures, and different window sizes produce different discretized versions of the same dataset.
Figure 1: Example event sequence Example 1 Consider the event sequence of 1-month Standardized Precipitation Index (SPI) values from Clay Center, Nebraska from January to December 1998 shown in Figure 1. SPI values show rainfall deviation from normal for a given location at a given time [19, 20]. For example, July 1998 was extremely wet compared to normal July precipitation, whereas December 1998 was severely dry relative to normal December precipitation. For this application, a sliding window width of 1 month was used, and the data was clustered into 7 clusters, based on the distribution of SPI values. The 7 clusters are: A. Extremely Dry (SP Ivalue ≤ −2.0), B. Severely Dry (SP Ivalue ≤ −1.5), C. Moderately Dry (SP Ivalue ≤ −0.5), D. Normal (−0.5 < SP Ivalue < 0.5), E. Moderately Wet (SP Ivalue ≥ 0.5), F. Severely Wet (SP Ivalue ≥ 1.5, and G. Extremely Wet (SP Ivalue ≥ 2.0). The resulting sequence of cluster identifiers is shown in Figure 2. We refer to the newly formed version of the time series as an event sequence. Formally, an event sequence is a triple (tB , tD , S) where tB is the beginning time, tD is the ending time, and S is a finite, time-ordered sequence of events [17, 18]. That is, S = (etB , etB+1p , etB+2p , . . . etB+dp = etD ), where p is the step size between each event, d is the total number of steps in the time interval from time tB to time tD , and D = B + dp. Each eti is a member of a class of events E, and ti ≤ ti+1 for all i = B, . . . , D − 1p. A sequence of events S includes events from a single class of events E. Example 2 In Figure 2, the class of events E = {A, B, C, D, E, F, G}, the starting time t B = 1 (January), the ending time tD = 12 (December), the step size p = 1, and the number of steps d = 11.
3
C
D
1
2
E
3
D
4
C
D
G
D
5
6
7
8
C
9
D
E
B
10
11
12
Figure 2: Example event sequence and two windows of width 6. A window on an event sequence S is an event subsequence W = {etj , . . . , etk }, where tB ≤ tj , and tk ≤ tD + 1 [18, 17]. The width of the window W is width(W ) = tk − tj . The set of all windows W on S, with width(W ) = win is denoted as W(S, win). The width of the window is pre-specified. Example 3 Figure 2 shows two windows of width 6 on the sequence from the previous example. The window indicated with a solid line starting at time 3 and ending at time 9 is the subsequence E,D,C,D,G,D. The window indicated with the dashed line starting at time 4 and ending at time 10 is the subsequence D,C,D,G,D,C. An episode in an event sequence is a combination of events with a partially specified order [18, 17]. It occurs in a sequence if there are occurrences of the events in an order consistent with the given order, within a given time bound (window width). Formally, an episode P is a pair (V, type), where V is a collection of events. The type of an episode is parallel if no order is specified, and serial if the events of the episode have a fixed order. An episode is injective if no event type occurs more than once in the episode. Example 4 Consider episodes (a), (b), (c), and (d) in Figure 3. Episode (a) is a serial episode: it occurs in a sequence only if there are events C and D that occur in this order within the width of one window. Episode (b) is a parallel episode: no constraints on the relative order of C and E are given, as long as they appear together in a window. Episode (c) occurs in a sequence if there are occurrences of C and E and these precede the occurrence of B; no constraints on the relative order of C and E are given. Episode (d) occurs in a sequence if there are occurrences of C and E and these precede the occurrence of another C. Episode (d) is similar to episode (c) except that episode (c) is injective, whereas episode (d) is not. Even though episode (c) is injective, in a window where events B, C, and E occur together, there may be multiple events of types B, C, and E.
C
C
C
E
E
C
(a)
C
B
D
(b)
E (c)
(d)
Figure 3: Example episodes.
3
Closed Sets of Episodes
We extend the work of Mannila et al. [18] to consider closed sets of episodes. We use formal concept analysis as the basis for developing our notion of closed sets [28]. Informally, a concept is a pair of sets: set of objects (windows or episodes) and set of features (events) common to all objects in the set. Definition 1 An episodal data mining context is defined as a triple (W(S, win), E, R) where W(S, win) is a set of all windows of width win defined on the event sequence S, and E is a set of episodes in the event sequence S, and R ⊆ W × E. A data-mining context is a formal definition of a database. When the database is an event sequence, we consider the event sequence S to be the database, W(S, win) to be the records, and E to be the items within the records. For w ∈ W and e ∈ E we write (w, e) ∈ R to mean that the window w contains the episode e. 4
Example 5 An example of a data mining context for event sequences is shown in Table 1 where an X is placed in the wth row and eth column to indicate that (w, e) ∈ R. This example shows parallel episodes of length one with a window width of 4 months, generated from the event sequence in Figure 2. As we slide the window through the data, we get 15 windows w1 , . . . , w15 . By definition, the first window contains only the first point in the sequence, and the last window contains only the last time point [18]. Table 1: Example of a Data Mining Context for Event Sequences A w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15
B
X X X X
C X X X X X X X X X X X X
D X X X X X X X X X X X X
E
F
G
X X X X X X X X X X X X
Definition 2 Let (W, E, R) be an episodal data mining context, X ⊆ W, and Y ⊆ E. Define the mappings α, β as follows: β : 2W → 2 E ,
β(X) = {e ∈ E | (w, e) ∈ R ∀ w ∈ X},
α : 2E → 2W ,
α(Y ) = {w ∈ W | (w, e) ∈ R ∀ e ∈ Y }.
The mapping β(X) associates with X the set of episodes that are common to all the windows in X. Similarly, the mapping α(Y ) associates with Y the set of all windows containing all the episodes in Y . Intuitively, β(X) is the maximum set of episodes shared by all windows in X and α(Y ) is the maximum set of windows possessing all the episodes in Y 1 . Example 6 Consider the event sequence presented in Figure 2, and Table 1, and assume that we are only interested in parallel episodes of length one. Let Y = {D, G}. Then, α(Y ) = {w7 , w8 , w9 , w10 }, and β(α(Y )) = β({w7 , w8 , w9 , w10 }) = {C, D, G} 6= Y . Similarly, let X = {w7 , w8 , w9 }. Then, β(X) = {C, D, G}, α(β(X)) = α({C, D, G}) = {w7 , w8 , w9 , w10 } 6= X. It is easy to see that in general, for any set Y of episodes, β(α(Y )) 6= Y . This leads to the following definition. Definition 3 A set of episodes Y that satisfies the condition β(α(Y )) = Y is called a closed set of episodes [28]. Example 7 From the example above, let Y 0 = {C, D, G}. β(α(Y 0 )) = Y 0 . Therefore, Y 0 is a closed set of episodes of length one.
4
Frequent Closed Episodes
The frequency of an episode is defined as the fraction of windows in which the episode occurs. Given an event sequence S, and a window width win, the frequency of an episode P of a given type in S is: f r(P, S, win) =
| w ∈ W(S, win) : P occurs in w | | W(S, win) |
1 When referring to a set of episodes, all episodes in the set must have the same type (such as parallel or serial), as defined above.
5
1) Generate Candidate Frequent Closed Target Episodes of length 1 (CF C 1,B ); 2) k = 1; 3) while (CF Ck,B 6= ∅) do 4) Read the sequence S, one window at a time, and let F CEk,B be the elements in CF Ck,B that have a closure that has not been generated, and that are frequent with respect to min fr 5) Generate Candidate Frequent Closed Target Episodes CF C k+1,B from F CEk,B 6) k++; 7) end while Sk−1 8) return i=1 {F CEi,B .closure and F CEi,B .f requency}; Figure 4: Gen-FCE algorithm. Given a frequency threshold min fr, P is frequent if f r(P, S, win) ≥ min f r. Definition 4 A frequent closed set of episodes (FCE) is a closed set of episodes that satisfy the minimum frequency threshold. Closure of an episode set X ⊆ E, denoted by closure(X), is the smallest closed episode set containing X and is equal to the intersection of all frequent episode sets containing X. To generate frequent closed target episodes, we develop an algorithm called Gen-FCE, shown in Figure 4. Gen-FCE is a combination of the Close-FCI algorithm [24], the WINEPI frequent episode algorithms [18], and the Direct algorithm [26]. Gen-FCE generates F CE with respect to a given set of Boolean target constraints B, an event sequence S, a window width win, an episode type, a minimum frequency min fr, and a window step size p. The Gen-FCE algorithm requires one database pass during each iteration. The data structures used consists of two sets, a set of candidate frequent closed target episodes, CF C, and a set of frequent closed target episodes, F CE. The set CF Ck,B stores the set of candidate frequent closed target episodes with length k (measured in terms of number of events in the episode) that meet the target Boolean constraints B. F CEk,B stores the set of frequent closed target episodes of length k that meet the target Boolean constraints B. Each item in CF Ck,B and F CEk,B has three components – an episode set component, a closure component and a frequency component. We start with CF C1,B containing all episodes of length 1 that meet the target constraints. At each iteration k, we extract the episodes from CF Ck,B that meet the minimum frequency threshold and whose closure has not been generated, and store them into F CEk,B . This is done by incorporating closure into the WINEPI algorithm from [18]. The set of candidate episodes for the next iteration CF Ck+1,B is generated from the set F CEk,B . This complex step makes use of the following definitions, properties and lemma to produce the minimal set of candidate episodes that potentially meet the frequency and target constraints and that are not in the closure of a previously generated episode. Basically, we show that this search space can be pruned by making use of how the episodes are iteratively generated, how closures are computed, and how the constraints are defined. We first look at how episodes are generated. An episode P 0 is a subepisode of P if the graph representing 0 P is an induced subgraph of the graph representing P . The episode P is referred to as a superepisode of P 0 . Example 8 For example, in Figure 3, episode (b) would be a subepisode of (c) if (c) occurs within a given window width, since (b) is a subgraph of (c). Also, (c) is a superepisode of (b). Episode (b) is also a subepisode of episode (d). Proposition 4.1 When generating frequent episodes, it is only necessary to test the occurrences of episodes whose subepisodes are frequent [17]. In general, all subepisodes are at least as frequent as the superepisode. Thus, if a subepisode does not meet the minimum frequency, then a super-episode cannot meet the minimum frequency. For example, in Figure 3, episode (c), if parallel episode (CE) is not frequent, then episode (c) cannot be frequent. After we find a potentially frequent episode, we verify that its closure has not already been generated. Proposition S 4.2 Let X be an episode set of length k and X = {X1 , X2 , · · · , Xm } be a set of (k − 1)-subsets of X where Xi ∈X = X. If ∃ Xi ∈ X such that X ⊆ closure(Xi ), then closure(X) = closure(Xi ) [22]. 6
This proposition means that the episode X results in redundant computations of frequent closed episodes because closure(X), which is equal to closure(Xi ) was already generated. Therefore, X can be pruned from CF Ck+1,B . Lastly, we prune episodes that do not meet the target constraints. We incorporate constraints similar to the Direct algorithm [26]. This approach is known to work well at low minimum supports and in large datasets [26]. The basic idea is to get from the user a Boolean expression of constraints, B, in disjunctive normal form (DNF). That is, B is of the form D1 ∨ D2 ∨ . . . ∨ Dm , where each disjunct Di is of the form αi1 ∧ αi2 ∧ . . . ∧ αini . Given a class of event E, each element αij is either eij or 6= eij for some eij ∈ E. When generating candidate episodes of size k + 1, we use the following lemma from [26]. 2 Lemma 4.3 For any (k + 1)-episode X which satisfies B, there exists at least one k-subset that satisfies B unless each Di which is true on X has exactly k + 1 non-negated elements, where B is the Boolean expression of constraints in DNF. Using this lemma, we start with F CEk,B and generate candidates for CF Ck+1,B , by adding single events. We then add candidates for the conjunctive elements of size k + 1 in B. Because this join is an expensive cross-product operation, when there are no constraints, we use the candidate generation algorithm from [17]. This algorithm generates candidates by combining pairs of frequent episodes of size k that overlap in the first k − 1 elements to generate candidates of size k + 1. Example 9 From the event sequence S given in Figure 2, using parallel episodes, let win = 4, min fr = 4, and B = {C ∨ D ∧ E}. Then F CE1,B = {C}. When generating CF C2,B , we produce episode of length 2 from F CE1,B and the set of possible events in S, F = {A, B, C, D, E, F, G}. At this point, CF C2,B = {AC, BC, CD, CE, CF, CG}. But since both episodes {D} and {E} are frequent, the target constraint D ∧ E needs to be considered. Thus, episode {DE} must be added to the set of candidates. It is clear from this example, that the lemma prunes the candidate sets appropriately for a given set of target constraints. By combining the candidate generation and pruning when the target constraints are not met, or when the frequency threshold is not met, or the closure was previously generated, we have shown that our Gen-FCE algorithm meets its objectives. That is, it generates the frequent closed episodes F CE with respect to a given set of Boolean target constraints B, an event sequence S, a window width win, an episode type, a minimum frequency min fr, and a window step size p. Example 10 Applying the Gen-FCE on the event sequence S given in Figure 2 with win = 4, min f r = 4, no constraints and parallel injective episodes, there are 15 windows as shown in Table 1. The following 8 frequent closed episodes are found: F CE = {B, C, D, E, CD, DE, CDE, CDG} with frequencies 4/15, 12/15, 12/15, 8/15, 11/15, 7/15, 6/15, 4/15, respectively. On the other hand, the WINEPI algorithm on the same set of inputs also produced the episodes G, CE, CG, and DG. Each of these episodes are in the closure of one of the previously generated episodes. 3 Example 11 Using the same scenario as above, with serial injective episodes, the following 9 frequent closed episode sets are found: F CE = {B, C, D, E, G, CD, DC, DE, CDE} with frequencies 4/15, 12/15, 12/15, 8/15, 4/15, 9/15, 6/15, 6/15, and 4/15, respectively. On the other hand, the WINEPI algorithm on the same set of inputs also produced the episode CE, which is in the closure of CDE.
5
Representative Association Rules
We use the set of frequent closed episodes to generate the representative episodal association rules. In order to define and understand representative episodal association rules, we first review the concepts of association rules and representative association rules. 2 This
lemma is rewritten in terms of episodes within event sequences rather than transactions of itemsets. we redefine the windows to exclude the windows that are incomplete, our algorithm generates an even smaller set of frequent closed episode sets. This may be advantageous in large datasets where the end point data is not relatively important. 3 If
7
The problem of discovering transactional association rules is described formally as follows [1, 2]. Let I = {i1 , i2 , · · · , im } be a set of m literals, called items. Let D = {t1 , t2 , · · · , tn } be a database of n transactions. 4 Each transaction tj ∈ D is a subset of I. Any subset of items X ⊂ I with |X| = k is called a k-itemset. A transaction T ∈ D contains an itemset X ⊂ I if X ⊂ T . The support of an itemset X, denoted sup(X), is the percentage of transactions T in the database D that contain X, sup(X) = |{T ∈ D : X ⊂ T }|/|D|. An itemset is called frequent if its support is greater than or equal to a pre-specified threshold value. Thus, the support of the itemset is similar to our definition of frequency for episodes. An association rule r is a rule of the form X ⇒ Y where both X and Y are nonempty subsets of I and X ∩ Y = ∅. X is called the antecedent of r and Y its consequent. The support and confidence of the association rule r : X ⇒ Y are denoted by sup(r) and conf (r), respectively, and defined as sup(r) = sup(X ∪ Y )
and conf (r) = sup(X ∪ Y )/sup(X).
Support of r : X ⇒ Y is simply a measure of its statistical significance and confidence of r is a measure of the conditional probability that a transaction contains Y given that it contains X. The task of the association data-mining problem is to find all association rules with support and confidence greater than user specified threshold values. We use the notation AR(s, c) to denote the set of all association rules with minimum support s and confidence c. The number of association rules is usually huge. Representative association rules (RAR) were introduced in [12] to overcome this problem and to reduce the number of rules presented to a user. The cover of a rule r : X ⇒ Y , denoted by C(r), is the set of association rules that can be generated from r. That is, C(r : X ⇒ Y ) = {X ∪ U ⇒ V | U, V ⊆ Y, U ∩ V = ∅, and V 6= ∅}. An important property of the cover operator stated in [12]is that if an association rule r has support s and confidence c, then every rule r 0 ∈ C(r) has support at least s and confidence at least c. This property means that C is a well-defined inference operator for association rules. Using the cover operator, a set of representative association rules with minimum support s and minimum confidence c, RAR(s, c), is defined as follows [12]: RAR(s, c) = {r ∈ AR(s, c) | 6 ∃r 0 ∈ AR(s, c), r 6= r 0 and r ∈ C(r0 )}. That is, a set of representative association rules is a least set of association rules that S cover all the association rules and from which all association rules can be generated. Clearly, AR(s, c) = {C(r) | r ∈ RAR(s, c)}. Once the set of representative association rules is found, the user may formulate queries about the association rules that are covered (or represented) by a certain rule of interest for given support and confidence values. Let length of a rule X ⇒ Y be the number of items in X ∪ Y . The following are important properties of RAR [11, 12]: Proposition 5.1 Let r : X ⇒ Y and r 0 : X 0 ⇒ Y 0 be two different association rules, then 1. If r is longer than r 0 , then r 6∈ C(r0 ). 2. If r is shorter than r 0 , then r ∈ C(r0 ) iff X ∪ Y ⊂ X 0 ∪ Y 0 and X ⊇ X 0 . 3. If r and r0 are of the same length, then r ∈ C(r 0 ) iff X ∪ Y = X 0 ∪ Y 0 and X ⊃ X 0 . 4 As defined here, each transaction either contains or does not contain an item. The quantity of each item within a transaction is not shown.
8
1) Let k be the size of the longest frequent closed episode in FCE; 2) while (k > 1) do 3) Generate REARk using Propositions 2-4, and make sure each generated rule is not covered by a previously generated rule 4) k + +; 5) end while 6) return REAR; Figure 5: Gen-REAR algorithm. Proposition 5.2 Let r : X ⇒ Z\X ∈ AR(s, c) and maxSup = max({sup(Z 0 ) | Z ⊂ Z 0 ⊆ I} ∪ {0}). Then, r ∈ RAR(s, c) if the following two conditions are satisfied: i. maxSup < s or maxSup/sup(X) < c. ii. 6 ∃X 0 , ∅ ⊂ X 0 ⊂ X such that X 0 ⇒ Z\X 0 ∈ AR(s, c). The first condition guarantees that r is not in the cover of any association rule with length greater than the length of r. The second condition guarantees that r is not in the cover of any association rule that has the same length as r. Proposition 5.3 Let ∅ 6= X ⊂ Z ⊂ Z 0 ⊆ I and sup(Z) = sup(Z 0 ). Then, there is no rule r : X ⇒ Z\X ∈ AR(s, c) such that r ∈ RAR(s, c). Proposition 5.3 holds because r ∈ C(X ⇒ Z 0 \X). The above properties led to the development of the algorithms GenAllRepresentatives [12], FastGenAllRepresentatives [11], and frequent closed itemsets [24] for discovering representative association rules based on the Apriori algorithm [2].
6
Representative Episodal Association Rules
We view the database D as an episodal data-mining context (defined in Section 3), and adapt the idea of representative association rules to episodes. In an episodal data mining context, the event sequence S is the database, W(S, win) is the set of windows (records), and E is the set of episodes within the windows (items within the records). We use the concepts developed in the previous section to find the set of representative episodal association rules, denoted by REAR. We use the set of frequent closed episodes F CE produced from the Gen-FCE algorithm to generate the set of representative episodal association rules in the Gen-REAR algorithm shown in Figure 5. Gen-REAR is a modification of the Generate-RAR [24] that uses Propositions 5.1-5.3 from Section 5 and the proposition that for an episode set X, sup(X) = sup(closure(X)) [24]. Gen-REAR generates REAR for a given set of frequent closed episodes F CE with respect to a minimum confidence c. The set of frequent closed episodes F CE is separated into sets of equal length closed episodes and the length of the maximal closed episode sets k is found. First, the largest rules (of size k) are generated and added to REAR. Next, rules of size (k − 1) are generated and added to REAR and so on. Finally, representative episodal association rules of size 2 are generated and added to REAR. The generation of episodal association rules of size k is controlled as follows: Let c be the minimum confidence value, Z be a frequent closed episode set of size k and maxSup be the maximum support of Z 0 , where Z 0 ⊂ Z as in Proposition 3. If there is no closed superset of Z, maxSup will be assigned the value zero. If maxSup has the same value as the support of Z then no representative episodal association rule can be generated from Z. Otherwise the process of generating representative episodal association rules iteratively looks at a combination of events X within the episode Z. X ⇒ Z\X is a representative episodal association rule if (Z.support/X.support ≥ c and maxSup/X.support < c). Example 12 Running the Gen-REAR algorithm using as input the frequent closed episode sets produced in the previous example, and using a minimum confidence of 5/8, the following 4 representative episodal association rules are generated: REAR = {G → CD, E → CD, C → D, D → C}. For comparison, 12 episodal association rules would be generated if closures and representative association 9
rule were not used. The additional episodal association rules are: {E → C, E → D, G → C, G → D, EC → D, ED → C, CG → D, DG → C}. These additional rules generate no additional information. In this example, representative episodal association rules generated 66% fewer rules, while maintaining the targeted rules of interest. Moreover if needed, all association rules can be generated from the given set of representative association rules. Example 13 Using the same scenario as above, with serial injective episodes, the following 6 representative episodal association rules are generated: REAR = {CE → CDE, DE → CDE, DC → DCD, C → CD, D → CD, E → DE}. The antecedent events are left in the consequent, to clarify the order of events. In this example, there is no gain by using closures and representative association rules. Example 14 Our method also allows us to target episodes. For example, to target precipitation episodes that deviate from normal for the example given in Figures 1 and 2 and Table 1, we use the constraints C ∨ E ∨ G with parallel episodes, win = 4, and min f r = 4. The following 4 frequent closed episode sets are found: F CE = {C, E, CE, CG} with frequencies 12/15, 8/15, 6/15, 4/15, respectively. Using this FCE and a minimum confidence of 5/8, the following 2 representative episodal association rules are generated: REAR = {G → C, E → C}. Using our technique on multiple time series while constraining the episodes to a user-specified target set, we can find relationships that occur across the sequences. For example, drought mitigation experts are more interested in how the precipitation affects soil moisture than the relationship between one month’s precipitation and the next. Additionally, they are interested in the relationship with El Ni˜ no and La Ni˜ na events to precipitation amounts. We accomplish these tasks by individually discretizing and clustering each time series, with respect to the same time granularity. Then, in the Gen-FCE algorithm, we read from each sequence and generate the episodes in F CE from all sequences at the same time. The rules generated by Gen-REAR cover the targeted episodes from the multiple time series. Sample empirical results related to the drought risk management problem are given in the next section.
7
Experimental Results
We are in the process of developing an advanced Geospatial Decision Support System (GDSS) to improve the quality and accessibility of drought related data for drought risk management [10]. Our objective is to integrate spatio-temporal knowledge discovery techniques into the GDSS using a combination of data mining techniques applied to geospatial time-series data by: 1) finding relationships between user-specified target episodes and other climatic events and 2) predicting the target episodes. The REAR approach will be used to meet the first objective. In this paper we validate the effectiveness of the REAR approach to meet this objective and compare it to the WINEPI algorithm [18]. As an experiment, we use the REAR approach to find relationships between drought episodes at the automated weather station in Mead, NE, and other climatic episodes, from 1989-1999. There is a network of agricultural research stations in Nebraska with automated weather stations that can serve as long-term reference sites to search for key patterns and link to climatic events. We use data from a variety of sources: 1. Satellite vegetation data from USGS’s EROS Data Center (US National Oceanic and Atmospheric Administration (NOAA) Advanced Very High Resolution Radiometer AVHRR biweekly dataset, 19891999), 2. Standardized Precipitation Index (SPI) data from the National Drought Mitigation Center (NDMC), 3. Precipitation and soil moisture data such as daily rainfall amount and the Soil Moisture Index (SMI) for both corn and grass from the High Plain Regional Climate Center (HPRCC), and 4. Palmer Drought Severity Index (PDSI) from the National Climatic Data Center (NCDC). 5 5 Available
from: http://www.ncdc.noaa.gov/onlineprod/drought/xmgrg3.html)
10
The data for the satellite and climatic indices are grouped into seven categories, i.e. extremely dry, severely dry, moderately dry, near normal, moderately wet, severely wet, and extremely wet. In this preliminary study, the vegetation conditions are assessed using the Standardized Vegetation Index (SVI) based on the NOAA AVHRR satellite data. The 1-month, 3-month, 6-month, 9-month, and 12-month SPI values are grouped into the same seven categories to show the precipitation intensity relative to normal precipitation for a given location and a given month. The SMI for corn and the SMI for grass also were grouped into the same seven categories to show the soil moisture intensity relative to normal soil moisture values for a given location and a given month for corn and grass, respectively. After normalizing and discretizing each dataset using the seven categories above, we performed experiments focused on finding out whether the method discovers interesting rules from the sequences, and whether the method is robust. We experimented with several different window widths, minimal frequency values, minimal confidence values, for both parallel and serial episodes.6 When using constraints, we specified droughts (the three dry categories in each data source) as our target episodes. The experiments were ran on a AMD Athlon 1.3GHz PC with 256 MB main memory, under the Windows 2000 operating system. Algorithms were coded in C++. The 9 datasets were combined into one flat file.
7.1
WINEPI vs. Gen-FCE
Tables 2 and 3 represent performance statistics for finding frequent closed episodes in the drought risk management dataset for Mead, NE with various frequency thresholds using the Gen-FCE and WINEPI algorithms for injective parallel and serial episodes, respectively. The number of frequent closed episodes decreases rapidly as the frequency threshold increases, as shown in Figure 6a. Table 2:
Performance characteristics for parallel episodes with Gen-FCE and WINEPI, Mead, NE drought monitoring database. Frequency threshold 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Candidates 9541 2856 1152 586 356 224 158 154 99 91
Gen-FCE Freq. Closed Iterations Episodes 4335 6 889 5 310 4 138 4 70 3 41 3 23 3 15 2 10 2 8 2
Total time (s) 15 3 0 1 0 0 0 0 0 0
Candidates 11034 2874 1152 586 356 224 158 154 99 91
WINEPI Frequent Iterations Episodes 5808 7 907 5 310 4 138 4 70 3 41 3 23 3 15 2 10 2 8 2
Total time (s) 9 1 0 0 0 0 0 0 0 0
Table 3:
Performance characteristics for serial episodes with Gen-FCE and WINEPI, Mead, NE drought monitoring database. Frequency threshold 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50
Candidates 17282 4686 1704 807 567 347 245 245 135 119
Gen-FCE Freq. Closed Iterations Episodes 3900 6 628 5 229 4 102 4 58 3 35 3 15 2 15 2 9 2 8 2
Total time (s) 6932 203 10 2 1 1 0 0 0 0
Candidates 17284 4687 1704 807 567 347 245 245 135 119
WINEPI Frequent Iterations Episodes 3950 6 629 5 229 4 102 4 58 3 35 3 15 2 15 2 9 2 8 2
Total time (s) 6932 205 10 1 1 0 0 0 0 0
At the lower frequency thresholds Gen-FCE generates significantly fewer parallel frequent episodes than WINEPI. For example, using a frequency threshold of 0.05, Gen-FCE generates 4335 frequent closed parallel 6 For all experiments, the window width is 2 months, the frequency threshold is 0.10, and the confidence threshold is 0.70 unless otherwise specified.
11
Figure 6: Gen-FCE and WINEPI algorithms, Mead, NE drought monitoring dataset. episodes while WINEPI generates 34% more (5808) episodes. Because of the nature of serial episodes combined with the sliding window, Gen-FCE and WINEPI generate an equivalent number of serial episodes. For example, using a frequency threshold of 0.05, GenFCE generates 3900 frequent closed serial episodes while WINEPI generates 8% more (4218) episodes. When an event leaves a window and the same event reenters the window during the same time-stamp, the same parallel episode from the last timestamp would be produced, whereas a different serial episode would be produced. For serial episodes it is more likely for the closure of an episodes to equal the episode itself, as compared with parallel episodes. An example of the processing time required using the Gen-FCE and WINEPI algorithms for producing parallel and serial episodes is shown in Tables 2 and 3 respectively. For this example, the processing times for these two algorithms are almost equal in each case.7 WINEPI is an efficient algorithm that takes advantage of the knowledge generated using the sliding window. Increasing the window size considerably increases the frequent episode generation time and the number of frequent episodes, as shown in Figure 6b. At a window width of 4 months, Gen-FCE generates 38575 frequent parallel closed episodes while WINEPI generates 44% more (55554) episodes. Although not shown here, larger window widths produce time-savings for the Gen-FCE algorithm.8 This is because it becomes more likely that the same events are repeated together in a longer time window. The quality of candidate generation is shown in Table 4 for a parallel episode run with a frequency threshold of 0.10 and a window width of 2 months. The dataset has many frequently occurring events (73% of all the events are frequent). As the size of the episodes grows, the infrequent episodes are gradually weeded out. On dense datasets, the infrequent episodes would be eliminated quickly for both the Gen-FCE and the WINEPI algorithms. Because the climatology datasets have many frequent “normal” weather patterns, the 7 Several C++ Standard Template Library algorithms were used in computing the closures. We plan to rewrite this portion of Gen-FCE to decrease its running time. 8 For a window width of 5 months, Gen-FCE generated the parallel episodes for this dataset in 2.6 hours where the WINEPI took over 6.3 hours.
12
Table 4: Number of candidate and frequent episodes during the first 5 iterations with Gen-FCE and WINEPI, Mead, NE drought monitoring database. Episode Size 1 2 3 4 5
Possible Episodes 63 3969 2.5 × 105 1.6 × 107 9.9 × 108
Gen-FCE Candidates 63 1033 1233 494 33
Gen-FCE Frequent 46 276 410 148 9
Match 73% 27% 33% 30% 27%
WINEPI Candidates 63 1035 1241 498 37
WINEPI Frequent 46 278 418 152 13
Match 73% 27% 34% 31% 35%
use of constraints to represent target episodes is critical to the drought risk management problem.
7.2
Drought Episodes
By nature, droughts occur infrequently. Since drought-monitoring experts are particularly interested in drought episodes, our system must provide these episodes quickly and without the distractions of the other non-interesting episodes.9 Parallel episodes are useful to the drought risk management problem when considering events that occur together, but without order specified. Serial episodes are useful to the drought risk management problem when trying to predict future drought risk considering the current and past weather conditions. Table 5 represent performance statistics for finding frequent closed drought episodes in the drought risk management dataset for Mead, NE with various frequency thresholds using the Gen-FCE algorithm. Constraints are not part of the WINEPI algorithm, so no comparison to WINEPI is provided for drought episodes. Gen-FCE performs extremely well when finding the drought episodes. The number of frequent closed episodes decreases rapidly as the frequency threshold increases, as shown in Figure 7a. For the sample dataset at a frequency threshold of 0.10 and a window width of 2 months, Gen-FCE produces 6 frequent drought parallel episodes while WINEPI produces 2200% more (138) episodes. With the same parameters, Gen-FCE produces 6 frequent drought serial episodes while WINEPI produces 1600% more (102) episodes. An example of the processing time required using the Gen-FCE and WINEPI algorithms for producing drought parallel and serial episodes is shown in Table 5. For these examples, the processing is done in real time. Even when the window width is expanded to 12 months, the running time is still less than 30 seconds. Because we are working with a fraction of the possible number of episodes, our algorithms are extremely efficient. When finding all frequent episodes for the sample dataset using a window width of 5 months, the running time was 1 second for parallel drought episodes, 2.6 hours for Gen-FCE and 6 hours for WINEPI. This illustrates the benefits of using closures and constraints when working with the infrequently occurring drought events. Table 5: Performance characteristics for parallel and serial drought episodes with Gen-FCE, Mead, NE drought monitoring database, window width 2 months. Frequency threshold 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45
Candidates 279 167 72 42 37 30 28 28 27
Parallel Freq. Closed Iterations Episodes 86 4 29 3 11 2 6 2 5 2 3 2 2 2 2 1 0 1
Total time (s) 0 0 0 0 0 0 0 0 0
Candidates 525 335 153 93 83 69 65 63 63
Serial Frequent Iterations Episodes 77 3 24 2 10 2 6 2 5 2 3 2 2 2 2 1 0 1
Total time (s) 2 1 1 0 0 0 0 0 0
As the window size increases, so does the frequent episode generation time and the number of frequent episodes, as shown in Figure 7b. When using drought constraints, the increase is at a much slower pace 9 Finding relationships with flood occurrences in climatic datasets is also an important climatic monitoring activity. In both cases the extreme values that occur infrequently in the datasets are of interest.
13
Figure 7: Drought frequent injective parallel and serial episodes found by Gen-FCE for the Mead, NE drought monitoring dataset. than using the algorithms without constraints, as shown in Figure 6b. For the sample dataset and a window width of 3 months, Gen-FCE produces 72 frequent drought parallel episodes while WINEPI produces 8711% more (6344) episodes, and Gen-FCE without constraints produces 7517% more (5477) episodes. With the same parameters, Gen-FCE produces 53 frequent drought serial episodes while WINEPI produces 5779% more (3116) episodes and Gen-FCE without constraints produces 5742% more (3096) episodes. The quality of candidate generation is shown in Table 6 for a parallel drought episode run. Because the dataset has many frequently occurring events (73% of all the events are frequent as shown in Table 4), and we are only interested in a small number of infrequently occurring (drought) events (less than 40% of the total number of events) Gen-FCE efficiently weeds out the uninteresting episodes when using drought constraints. During the first few iterations hardly of the candidates turn out to be frequent and meet the drought constraint (9% during the second iteration). Thus, fewer iterations are necessary, and the algorithm completes in a fraction of the time it takes to compute all frequent episodes. Our next step is to find relationships between the frequent episodes. Table 6: Number of candidate and frequent drought episodes during the first 3 iterations with GenFCE and WINEPI, Mead, NE drought monitoring database. Episode Size 1 2 3
Possible Episodes 27 729 19683
Gen-FCE Candidates 27 136 4
14
Gen-FCE Frequent 17 12 0
Match 63% 9% 0%
7.3
REAR vs. WINEPI Association Rules
Figure 8 represents performance statistics for finding association rules in the drought risk management dataset for Mead, NE with various confidence thresholds and window widths using the Gen-REAR and WINEPI AR algorithms for injective parallel and serial episodes. The number of rules decreases rapidly as the confidence threshold increases and increases rapidly as the window width widens. In all cases, GenREAR produces fewer rules than the WINEPI AR algorithm. Using the Gen-REAR approach, all the rules can be generated if desired, even though the meaning of the additional AR’s is captured by the smaller set of REAR’s. As the confidence level decreases, the Gen-REAR algorithm produces significantly fewer rules than the WINEPI algorithm as shown in Figure 8a. For the sample dataset at a confidence threshold of 0.20, GenREAR produces 1342 parallel episodal rules while WINEPI AR produces 275% more (5038) rules. With the same parameters, Gen-REAR produces 745 serial episodal rules while WINEPI AR produces 207% more (2290) rules. As the window width widens, Gen-REAR overwhelmingly produces fewer rules than the WINEPI algorithm as shown in Figure 8b. As the window widens, the WINEPI AR algorithm quickly becomes computationally infeasible to use for the drought risk management problem, especially for parallel episodes. For the sample dataset at a window width of 3 months, Gen-REAR produces 7244 parallel episodal rules while WINEPI AR produces 123% more (16,159) rules. With the same parameters, Gen-REAR produces 853 serial episodal rules while WINEPI AR produces 133% more (1994) rules.
Figure 8: Association rules for the Mead, NE drought monitoring dataset.
7.4
Drought Rules
Table 7 represents performance statistics for finding drought REAR’s in the drought risk management dataset for Mead, NE with various confidence thresholds using the Gen-REAR algorithm. Constraints are not part of the WINEPI AR algorithm, so no comparison to WINEPI AR is provided for drought episodes.
15
Table 7: Number of rules and rule generation time for drought parallel and serial episodes with Gen-REAR, Mead, NE drought monitoring database. Confidence threshold 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75
Parallel Episodes Distinct Rule gen. rules time (s) 24 0 24 0 19 0 13 0 10 0 8 0 8 0 5 0 4 0 2 0 2 0 1 0
Serial Episodes Distinct Rule gen. rules time (s) 14 0 12 0 9 0 7 0 6 0 5 0 4 0 2 0 2 0 2 0 2 0 1 0
Gen-REAR performs extremely well when finding drought REAR’s as illustrated in Figure 9. The number of REAR’s decreases rapidly as the confidence interval increases as shown in Figure 9a. For the sample dataset at a confidence threshold of 0.20 and a window width of 2 months, Gen-REAR produces 24 drought parallel episodal rules while WINEPI AR produces 20892% more (5038) rules and Gen-FCE without constraints produces 5492% more (1342) rules. With the same parameters, Gen-REAR produces 14 drought serial episodal rules while WINEPI AR produces 16257% more (2290) rules and Gen-REAR without constraints produces 5364% more (765) rules. The number of REAR’s increases as the window width increases as shown in Figure 9b. For the sample dataset at a window width of 3 months, Gen-REAR produces 30 parallel drought episodal rules while WINEPI AR produces 53763% more (16159) rules and Gen-REAR without constraints produces 24047% (7244) more rules. With the same parameters, Gen-REAR produces 8 serial drought episodal rules while WINEPI AR produces 24825% more (1994) rules and Gen-REAR produces 10563% more (853) rules. The savings are obvious. The Gen-REAR algorithm finds the drought REAR’s for all reasonable window widths and confidence levels on the Mead, NE drought risk management dataset in less than 30 seconds.
8
Conclusion
This paper presents Gen-REAR, a new approach for generating representative episodal association rules. We also presented Gen-FCE, a new approach used to generate the frequent closed episode sets that conform to user-specified constraints. Our approach results in a large reduction in the input size for generating representative episodal association rules for targeted episodes, while retaining the ability to generate the entire set of association rules. We also studied the gain in efficiency of generating targeted representative episodal association rules as compared to the traditional WINEPI algorithm. Although the Gen-FCE algorithm uses the Direct algorithm [26] for finding constraints, other constraint algorithms could be used. Also, the Gen-REAR approach can be combined with interesting measures to further prune the non-interesting rules. We used a multiple time series drought risk management problem to demonstrate the method’s application to complex problems. The goal of this work was to demonstrate the new approaches rather than to solve the given application. As demonstrated by the experiments, our method efficiently finds relationships between climatic episodes and droughts by using constraints, closures and representative episodal association rules. As the window width grows or as the confidence threshold decreases, our approach outperforms WINEPI. Clearly, the results produced by these methods need to be coupled with human interpretation of the rules and an interactive approach to allow for iterative changes in the exploration process. Other problem domains could also benefit from this approach, especially when there are groupings of events that occur close together in time, but occur relatively infrequently over the entire dataset. Additional suitable problem domains are when the entire set of multiple time series is not correlated, but there are periodic occurrences when the signature of one sequence is present in other sequences. Currently, there is no commercial product that addresses these types of problems. 16
Figure 9: Drought REAR’s generated with Gen-REAR, Mead, NE drought monitoring dataset. For future work, we plan to extend Gen-FCE to handle a user-defined time lag. We plan to explore the usefulness of serial episodes to predict future drought risk based on the current and past weather conditions and expand the methods to consider the spatial extent of the relationships. Additionally, we are incorporating these approaches into an advanced geospatial decision support system for drought risk management.
References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD 1993 International Conference on Management of Data [SIGMOD 93], pages 207–216, Washington D.C., 1993. [2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th VLDB Conference, pages 478–499, Santiago, Chile, 1994. [3] R. Bayardo, R. Agrawal, and D. Gunopupulos. Constraint-based rule mining in large, dense databases. In Proceedings of ICDE-99, 1999. [4] C. Bettini, X. S. Wang, and S. Jajodia. Discovering frequent event patterns with multiple granularities in time sequences. IEEE Transactions on Knowledge and Data Engineering, 10(2):222–237, March 1998. [5] S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 255–264, 1997. [6] G. Das, K-I. Lin, H. Mannila, G. Ranganathan, and P. Smyth. Rule discovery from time series. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining [KDD 98], pages 16–22, New York, NY, August 1998.
17
[7] L. Feng, H. Lu, J. X. Yu, and J. Han. Mining inter-transaction associations with templates. In Proceedings of the 1999 International Conference on Information and Knowledge Management [CIKM’99], Kansas City, Missouri, USA, November 1999. [8] D. Q. Goldin and P. C. Kanellakis. On similarity queries for time-series data: Constraint specification and implementation. In Proceedings of the 1995 International Conference on the Principles and Practice of Constraint Programming, pages 137–153, Marseilles, France, September 1995. [9] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM-SIGMOD International Conference on Management of Data, Dallas, Texas, USA, May 2000. [10] S. K. Harms, S. Goddard, S. E. Reichenbach, W. J. Waltman, and T. Tadesse. Data mining in a geospatial decision support system for drought risk management. In Proceedings of the 2001 National Conference on Digital Government Research, pages 9–16, Los Angelos, California, USA, May 2001. [11] M. Kryszkiewicz. Fast discovery of representative association rules. In Lecture Notes in Artificial Intelligence, volume 1424, pages 214–221. Proceedings of RSCTC 98, Springer-Verlag, 1998. [12] M. Kryszkiewicz. Representative association rules. In Lecture Notes in Artificial Intelligence, volume 1394, pages 198–209. Proceedings of the Practical Applications of Knowledge Discovery and Data mining [PAKDD 98], Springer-Verlag, 1998. [13] Ng. R. T. Lakshmanan and L. Han. Exploratory mining and pruning optimization of constrained association rules. In Proceedings of the ACM SIGMOD-98, 1998. [14] B. Liu, W. Hsu, and Y. Ma. Pruning and summarizing the discovered associations. In Proceedings of the Fifth International Conference on Knowledge Discovery and Datamining [KDD99], San Diego, CA, USA, August 15-18 1999. [15] H. Mannila and P. Ronkainen. Extended abstract. Similarity of Event Sequences, 1997. [16] H. Mannila and H. Toivonen. Discovering generalized episodes using minimal occurrences. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining [KDD 96], pages 146– 151, Portland, Oregon, August 1996. [17] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering frequent episodes in sequences. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining [KDD 95], pages 210– 215, Montreal, Canada, August 1995. [18] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Technical report, Department of Computer Science, University of Helsinki, Finland, 1997. Report C-1997-15. [19] T. B. McGee, N. J. Doeskin, and J. Kleist. The relationship of drought frequency and duration to time series. In Proceedings of the 8th Conference on Applied Climatology, pages 179–184, Boston, MA, January 1993. American Meteorological Society. [20] T. B. McGee, N. J. Doeskin, and J. Kliest. Drought monitoring with multiple time scales. In Proceedings of the 9th Conference on Applied Climatology, pages 233–236, Boston, MA, January 1995. American Meteorological Society. [21] R. Ng, L. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, June 1998. [22] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Efficient mining of association rules using closed itemset lattices. Information Systems, 24:25–46, 1999.
18
[23] J. Pei and J. Han. Can we push more constraints into frequent pattern mining? In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining [KDD2000], pages 350–354, Boston, MA, USA, August 20-23 2000. [24] J. Saquer and J. S. Deogun. Using closed itemsets for discovering representative association rules. In Proceedings of the Twelfth International Symposium on Methodologies for Intelligent Systems [ISMIS 2000], Charlotte, NC, October 11-14 2000. [25] A. Savasere, E. Omiencinsky, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proceedings of the 21st VLDB Conference, Zurich, Switzerland, 1995. [26] R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining [KDD97], pages 67–73, 1997. [27] H. Toivonen. Sampling large databases for association rules. In Proceedings of the 22nd VLDB Conference, pages 134–145, Bombay, India, 1996. [28] R. Wille. Restructuring lattice theory: an approach based on hierarchies of concepts. In I. Rivali, editor, Ordered sets, pages 445–470. Reidel, Dordecht-Boston, 1982. [29] M. Zaki. Generating non-redundant association rules. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining [KDD2000], pages 34–43, Boston, MA, USA, August 20-23 2000. [30] M. Zaki and C.-J. Hsiao. Charm: An efficient algorithm for closed association rule mining. Technical report, Department of Computer Science, Rensselaer Polytechnic Institute, December 1999. Technical Report 1999-10.
19