Knowl Inf Syst (2010) 23:73–98 DOI 10.1007/s10115-009-0207-1 REGULAR PAPER
Mining dynamic association rules with comments Bin Shen · Min Yao · Zhaohui Wu · Yunjun Gao
Received: 22 April 2008 / Revised: 22 February 2009 / Accepted: 8 March 2009 / Published online: 24 April 2009 © Springer-Verlag London Limited 2009
Abstract In this paper, we study a new problem of mining dynamic association rules with comments (DAR-C for short). A DAR-C contains not only rule itself, but also its comments that specify when to apply the rule. In order to formalize this problem, we first present the expression method of candidate effective time slots, and then propose several definitions concerning DAR-C. Subsequently, two algorithms, namely ITS2 and EFP-Growth2, are developed for handling the problem of mining DAR-C. In particular, ITS2 is an improved two-stage dynamic association rule mining algorithm, while EFP-Growth2 is based on the EFP-tree structure and is suitable for mining high-density mass data. Extensive experimental results demonstrate that the efficiency and scalability of our proposed two algorithms (i.e., ITS2 and EFP-Growth2) on DAR-C mining tasks, and their practicability on real retail dataset. Keywords Dynamic association rule · Comment · Support vector · Confidence vector · Mining algorithm 1 Introduction Association rule mining, which was first proposed by Agrawal et al. [1], is one of the most important topics in the area of data mining, and has many successful applications, especially in the analysis of consumer market-basket data. The problem of mining association rules can be expressed as follows [2]. Let I = {I1 , I2 , . . . , Im } be a set of items, and D be a set of transactions in which each transaction T contains a set of items. An association rule is an B. Shen · M. Yao (B) · Z. Wu · Y. Gao College of Computer Science and Technology, Zhejiang University, 310027 Hangzhou, China e-mail:
[email protected] B. Shen Ningbo Institute of Technology, Zhejiang University, 315100 Ningbo, China e-mail:
[email protected];
[email protected] Y. Gao School of Information Systems, Singapore Management University, Singapore 178902, Singapore
123
74
B. Shen et al.
implication of the form X ⇒ Y , where X ⊂ I, Y ⊂ I , and X ∩ Y = φ. This implication means that a data record containing the set of items X is likely to contain the item set Y as well. The association rule X ⇒ Y holds in the database D with confidence c if c% of transactions in D that contain X also contain Y . The association rule X ⇒ Y has support s if s% of transactions in D contain X ∪ Y . Thus, the association rule with support s and confidence c can be denoted as “X ⇒ Y [s, c]”. So far, many association rule mining algorithms have been proposed in the database literature [2–6]. Most of them assume that association rules are static and consistently effective, but in reality characteristics of a database as well as corresponding association rules may change over time. For example, when analyzing the transaction data of a supermarket, we may get an association rule “Cigarettes ⇒ Gifts [s = 2.0%, c = 80.0%] ”. This rule indicates that 80.0% of the customers who buy cigarettes also like to buy gifts, and meanwhile 2.0% of transactions in the database support the rule. Traditional methods take it for granted that this rule is effective all the year round. However, if we study the relevant data carefully, we may find that the rule is very effective during holidays (e.g., the Spring Festival), whereas in other time its support is low. Therefore, if we utilize this rule to guide daily sales, it might lose its significance; but in holidays, it is valuable. This example illustrates that the validity of an association rule may change over time. In order to describe the dynamic property of an association rule, Liu et al. [7] propose the new technology of mining dynamic association rules in databases, which adopts the support vector SV and the confidence vector CV to describe the dynamic characteristic of these rules. The basic idea of their method is as follows. The entire database is first divided into a series of consecutive and disjoint subsets according to time, and then a support vector and a confidence vector for each rule are generated which contain the support and the confidence of the rule in each subset, respectively. According to the SVs and the CVs, detailed evolution information of the rules over the time in which the data was collected can be obtained using histogram analysis and time series analysis approaches. Taking 1 years’ transaction data from the above supermarket case as an example, we can divide the transaction data into 12 subsets according to months. Applying dynamic association rule mining, we may discover a rule as follows: “Cigarettes ⇒ Gifts” (s = 2.0%, c = 80.0%, SV = [5.2%, 3.8%, 0.8%, 1.2%, 2.2%, 0.6%, 0.5%, 1.0%, 0.9%, 3.0%, 0.5%, 2.3%] C V = [81.0%, 86.1%, 30.6%, 50.0%, 46.8%, 30.4%, 26.9%, 63.1%, 76.2%, 89.7%, 72.2%, 88.9%]). It means that the rule “Cigarettes ⇒ Gifts” holds with confidence 80.0% and support 2.0% in the whole dataset; while in each month of the year, its supports are 5.2%, 3.8%, . . . , 2.3%, respectively, and its confidences are 81.0%, 86.1%, . . . , 88.9%, respectively. From the aforementioned dynamic association rule, we can see the time-varying property of the rule. Although dynamic association rules with support vector and confidence vector can reflect the dynamic property of the rules to a certain extent, there are still several limitations in the existing dynamic association rules. First, it is difficult for decision-makers to apply these rules. Even though SVs and CVs are contained in dynamic association rules, they do not tell decision-makers how to work according to this type of rules. Secondly, its division of subsets is limited to be consecutive and disjoint, which is not flexible enough for real-life applications. In real applications, decision-makers hope that dynamic association rules can give the relevant usage comments associated with the rules. When the validity of a rule changes with
123
Mining dynamic association rules with comments
75
time, a more valuable dynamic association rule should be able to provide information about when to apply this rule. Take the above supermarket case as an example again, as the efficiency of the rule “cigarettes ⇒ gifts” changes dynamically, the decision-maker would like to know when is suitable for applying this rule. The distribution of transaction data shows that in the Spring Festival and the National Day of China, the rule has its best performance, followed by time intervals of per day 8 p.m. to 9 p.m. and per day 9 p.m. to 10 p.m. In this case, a more valuable dynamic association rule can be written as follows: “Cigarettes ⇒ Gifts”[s = 2.0%, c = 80.0%] Usage comments: Effective time slots
Support (%)
Confidence (%)
The Spring Festival
20.0
89.0
The National Day of China
13.0
82.0
Per day 8 p.m. to 9 p.m.
12.0
84.0
Per day 9 p.m. to 10 p.m.
11.8
82.4
From the above usage comments of “Cigarettes ⇒ Gifts”, we can see that the Spring Festival is the most effective time interval of this rule, and the National Day is in the second place. The time intervals of per day 8 p.m. to 9 p.m. and per day 9 p.m. to 10 p.m. are in the third and fourth places respectively with respect to the rule. The listed time intervals are the most effective time intervals of the rule, whereas the other time intervals which are ineffective are all omitted in the usage comments. The dynamic association rule with the usage comments indicates that the rule has different validity at different times: in the whole database it has the support of 2.0% and the confidence of 80.0%, meanwhile, for the time intervals of the Spring Festival, the National Day of China, per day 8 p.m. to 9 p.m. and per day 9 p.m. to 10 p.m., it is particularly good, and its supports are, respectively, 20.0%, 13.0%, 12.0% and 11.8%. The dynamic association rules like this is able to better satisfy the actual needs of policy makers. In order to mine this class of new association rules, in this paper, we explore the problem of mining dynamic association rules with comments (DAR-C for short), and present the formal definition of DAR-C as well as its corresponding solution. The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 gives the expression method for candidate effective time slots, and defines DAR-C rule formally. Section 4 presents the algorithms (including ITS2 and EFP-Growth2) for mining DAR-C rule. Section 5 conducts the experimental evaluation and reports our findings. Finally, Sect. 6 concludes the paper.
2 Related work The problem of mining association rules in transaction databases which was introduced by Agrawal et al. [1] has been studied extensively. Thereinto, some of them take the time factor into consideration in mining association rules. Agrawal et al. [8] first proposed the problem of mining sequential patterns in 1995. The main goal was to find sequential patterns like “5% of customers bought ‘Foundation’ and ‘Ringworld’ in one transaction, followed by ‘Second Foundation’ in the subsequent transaction”. Subsequently, a series of work has been done for mining sequential patterns [9–11].
123
76
B. Shen et al.
Apart from the sequential patterns, various time-related rules or patterns, such as cyclic association rules [12], episode rules [13], segment-wise periodic patterns [14], follow-up correlation patterns [5], calendar based patterns [15,16], and inter-transaction rules [17], have been proposed and studied in depth. These work extends the traditional form of association rules, and can well model the underlying sequential relationship between item sets. Although the existing work takes the time factor into account, they still assume that the data characteristics and the underlying associations hidden in the data are stable over time, and thus these rules or patterns are also static. In the data mining literature, some work considers dealing with data collected in different time slots including maintenance of discovered rules [3], active data mining [18], changes measurement between two datasets [19] and mining changes in association rules [20–23]. Maintenance of discovered rules [3] tries to solve the problem of how to incrementally update discovered association rules, when new transaction data are added to the original transaction database, or old transaction data are removed from the transaction database. The incremental updating is triggered, if there are additions/deletions of transactions in the transaction database. Active data mining [18] mines rules from each dataset collected in different time slots. These rules and their parameters (e.g., support and confidence) are added to a rule base. Then, users can specify a history pattern in a trigger which is fired when such a pattern trend is exhibiting. This schema is designed for representing and querying the history pattern of discovered association rules’ parameters. In [19], a framework for quantifying the difference between two datasets was developed. Although these techniques can be used to track the variation of association rules with time, none of them aims at providing dynamic association rules with comments. Several research work has been done for mining changes in association rules [20–23]. Specifically, Dong et al. [21] focused on mining border descriptions of emerging patterns from dataset pairs. Emerging patterns are defined as patterns for which support increases from one dataset to the other with a big enough ratio, and borders are used as concise descriptions of large collections of those emerging patterns. This technology can well mine emerging patterns, but it is limited to find novel patterns by comparing a pair of datasets. Au and Chan [22] considered the changes in associations, and proposed to mine changes in association rules. They tried to predict how the association rules would change over time using fuzzy approach. Au and Chan [23] also employed the fuzzy sets and residual analysis to reveal the regularities governing how the rules change in different time intervals. Nevertheless, these work are only limited in a small aspect of prediction. The [24] investigated mining changes in the context of decision tree classification. Motivated by this, Liu et al. [20] tried to find rule changes that occurred in the new time period when compared with the old time period. They proposed a technique which utilizes chi-square test to identify the set of fundamental changes in two given datasets collected from two time slots. Unlike these existing approaches, our approach is developed for mining dynamic association rules with comments. Liu and Rong [7] also considered the problem that the efficiency of the associations hidden in the data changes over time, and proposed the concept of mining dynamic association rules in the database. A dynamic association rule contains not only the support and the confidence of the rule, but also the support vector and the confidence vector. Currently, Shen et al. [25] analyzed the shortcomings of the definition of dynamic association rules in [7], and presented a new definition of dynamic association rules and corresponding mining algorithms, i.e., ITS algorithm and EFP-Growth algorithm. Our work in this paper aims at the limitation of dynamic association rules, and proposes to mine a new type of association rules, called dynamic association rules with comments.
123
Mining dynamic association rules with comments
77
3 Problem formulation Our DAR-C mining aims at addressing the following problem: Given a set of time-stamped transactions and a set of candidate effective time slots, find all dynamic association rules with their usage comments, where the comments can indicate when to apply this rule. In this section, we first present the expression method of candidate effective time slots, followed by several definitions with respect to DAR-C based on candidate effective time slots. 3.1 The expression method of candidate effective time slots Calendar algebra [15,16,26,27] has been studied in the temporal data mining in order to provide a formal specification for constructing temporal expressions in terms of closely related granularities (e.g., year, month, day). To formulate the candidate effective time slots, calendar schema and simple calendar-based pattern, proposed by Li et al. [15,16], are adopted in this work. Moreover, the corresponding concepts and operations are introduced to facilitate users express their desired candidate effective time slots. A calendar schema is determined by a hierarchy of calendar concepts. For example, a calendar schema can be (year, month, day). If we would like to give definition domain for each time granularity further, a calendar schema can be (year: [2002, . . . , 2006], month: [1, . . . , 12], day: [1, . . . , 31]). Based on the calendar schema, a set of simple calendar-based patterns can be defined. For instance, given the above calendar schema, a simple calendarbased pattern can be “the first day of the tenth month every year”, which is remarked as ∗, 10, 1. In this basic way, a simple calendar-based pattern in a specified calendar schema is formed by setting some of the calendar units with specific numbers while leaving other units “free” (therefore it is read as “every” and remarked as “*”). In the following, we will define formally calendar schema and simple calendar-based pattern. Definition 1 (calendar schema) [16] A calendar schema is a relational schema (in the context of relational databases) with a valid constraint δ: R = (G n : Dn , G n−1 : Dn−1 , . . . , G 1 : D1 )
(1)
where G i (i = 1, . . . , n) is a time granularity (e.g., year, month, week, day, etc.) and Di (i = 1, . . . , n) is a finite subset of positive integers. The valid constraint is a Boolean function δ : R → {0, 1} on Dn × Dn−1 × · · · × D1 specifying which combinations of the values in Dn × Dn−1 × · · · × D1 are valid. As an example, we may have a calendar schema R = (year: [2002, . . . , 2006], month: [1, . . . , 12], day: [1, . . . , 31]) with a valid constraint, which evaluates whether the combination year, month, day is valid, e.g., 2006, 10, 1 is valid, but 2006, 2, 31 is not. Definition 2 (Calendar Pattern) [16] Given a calendar schema R = (G n : Dn , G n−1 : Dn−1 , . . . , G 1 : D1 ), simple calendar-based pattern (or calendar pattern for short) in the calendar schema R can be defined as dn , dn−1 , . . . , d1 , in which di (i = 1, . . . , n) is a set that satisfies the condition of di ⊆ Di or di = “*” (that denotes a wild-card symbol). If di ⊆ Di , it means that di is in Di and its time granularity is designated by G i . If di is the wild-card symbol “*”, it indicates that di designates the entire domain of G i . In fact, each calendar pattern represents the time intervals denoted by a set of tuples in Dn × Dn−1 × · · · × D1 that are valid. For example, given the calendar schema R = (year: [2002, . . . , 2006], month: [1, . . . , 12], day: [1, . . . , 31]), the calendar pattern ∗, 10, 1 represents the time interval “the first day of the tenth month every year” in the calendar schema R.
123
78
B. Shen et al.
Actually, the desired time intervals may not be expressed by a single calendar pattern. In this case, we can combine calendar patterns using the union operation. The union of calendar patterns e1 ∪ e2 · · · ∪ en represents the union of time intervals denoted by e1 , e2 , . . . , en−1 and en respectively. For instance, “the Thanksgiving Day of 1997 and 1998” can be denoted as 1997, 11, 27 ∪ 1998, 11, 30, meaning that the union of time intervals 1997, 11, 27 and 1998, 11, 30. Consequently, users are able to specify user-defined candidate effective time slots conveniently. As an example, given a set of time-stamped transactions in year 1997 and 1998, and assume users want to set a set of candidate effective time slots, which are “Thanksgiving Day”, “the New Year”, “the Christmas Day”, “the Valentine Day”, “Every Jan. every year”, “Every Feb. every year”, . . ., “Every Dec. every year”, respectively. In this case, we can set calendar schema R to be (year: [1997, 1998], month: [1, . . . , 12], day: [1, . . . , 31]), and set the candidate effective time slots to be 1997, 11, 27 ∪ 1998, 11, 30, ∗, 1, 1, ∗, 12, {24, 25}, ∗, 2, 14, ∗, 1, ∗, . . . , ∗, 11, ∗ and ∗, 12, ∗ in R. 3.2 Dynamic association rules with comments 3.2.1 Rule definition Based on the above expression method of candidate effective time slots, we introduce some definitions of DAR-C in this subsection. Let I ={i 1 , i 2 , . . . , i m } be an item-set, D be the data set related to a task, and |D| be the cardinality of set D. Each transaction T in data set D is a set of items, denoted by T ⊆ I . If X and Y are item-sets, X ⊂ I, Y ⊂ I, X ∩ Y = φ, then an association rules can be expressed as X ⇒ Y with support s and confidence c, where s = PD (X ∪ Y ) and c = PD (Y |X ). Each transaction is associated with a timestamp T I D which indicates the time the transaction occurred. The entire transaction dataset D is collected in the time interval t. Given calendar schema R and a set of calendar patterns Se in R: Se ={e1 , e2 , . . . , en }, we call ei (i = 1, . . . , n) is a candidate effective time slot and Se is a set of candidate effective time slots. The set of transactions covered by ei is denoted as T [ei ]. The transaction set that contains item-set X in the T [ei ] is denoted as T [X, ei ], and its number is |T [X, ei ]|. For example, let us consider a calendar schema R = (year: [2005, 2006], month: [1, . . . , 12], day: [1, . . . , 31]), and a set of candidate effective time slots Se in R, where Se contains ei (i = 1, . . . , 45). The candidate effective time slots e1 , . . . , e45 are 2005, ∗, ∗, 2006, ∗, ∗, ∗, 1, ∗, . . . , ∗, 12, ∗, ∗, ∗, 1, . . . , ∗, ∗, 31 and indicate “year 2005”, “year 2006”, “every Jan. every year”, . . ., “every Dec. every year”, “the first day every month”, . . ., “the 31st day every month” respectively. Then the set of transactions occurred in e1 = 2005, ∗, ∗ can be denoted as T [e1 ]. The set of transactions containing item-set X = {Cigarettes, Gifts} in T [e1 ] can be denoted as T [X, e1 ], and its number is |T [X, e1 ]|. Using the above notations, we can define the supports and the confidences of rules in candidate effective time slots, which are presented in Definitions 3 and 4, respectively. Definition 3 Given association rule r : X ⇒ Y , a calendar schema R and a candidate effective time slot e in R, the support of r in the candidate effective time slot e is defined as: Sr,e = |T [X ∪ Y, e]|/|T [e]|
(2)
The rule r : X ⇒ Y has support sr,e in e if sr,e % of transactions occurred in e contain X ∪ Y. For example, for the rule “r : Cigarettes ⇒ Gifts”, the calendar schema R = (year: [2005, 2006], month: [1, . . . , 12], day: [1, . . . , 31]) and r ’s candidate effective time slot
123
Mining dynamic association rules with comments
79
e = 2005, ∗, ∗ in R, the support sr,e of r in e can be computed as the number of transactions containing item-set {Cigarettes, Gifts} in T [e] divided by the number of transactions of T [e], which means that sr,e % of transactions occurred in e support item-set {Cigarettes, Gifts}. Definition 4 Given association rule r : X ⇒ Y , a calendar schema R and a candidate effective time slot e in R, the confidence of r in the candidate effective time slot e is defined as: Cr,e = |T [X ∪ Y, e]|/|T [X, e]|
(3)
The rule r : X ⇒ Y holds in the transaction set T [e] with confidence cr,e if cr,e % of transactions in T [e] that contain X also contain Y . Taking the rule “r : Cigarettes⇒Gifts”, the calendar schema R = (year: [2005, 2006], month: [1, . . . , 12], day: [1, . . . , 31]) and the candidate effective time slot e = 2005, ∗, ∗ as an example again, the confidence cr,e of r in e can be computed as the number of transactions supporting item-set {Cigarettes, Gifts} in T [e] divided by the number of transactions containing item-set {Cigarettes} in T [e]. Furthermore, it indicates that cr,e % of transactions in T [e] that contain item-set {Cigarettes} also contain {Gifts}. Based on the above definitions of the support and the confidence in candidate effective time slots, we can define the effective time slots for certain rules. Definition 5 Given association rule r : X ⇒ Y , a calendar schema R, a candidate effective time slot e in R, the minimal effective support threshold s , and the minimal effective confidence threshold c , if the support Sr,e and the confidence Cr,e of r in candidate effective time slot e satisfy the following conditions: Sr,e ≥ s ; Cr,e ≥ c
(4) (5)
then r can be called as an effective association rule in e, and e is called as an effective time slot of r . For instance, for the rule “r : Cigarettes ⇒ Gifts” and the calendar schema R = (year: [2005, 2006], month: [1, . . . , 12], day: [1, . . . , 31]), if we consider the candidate effective time slot e is ∗, 10, {1, . . . , 7} in R and suppose s is 0.1, c is 0.2, the support Sr,e of r in e is 13.0%, and the confidence Cr,e of r in e is 82.0%, we can infer that the two conditions Sr,e ≥ s and Cr,e ≥ c are satisfied. Therefore, we can say that “Cigarettes ⇒ Gifts” is an effective association rule in ∗, 10, {1, . . . , 7}, and ∗, 10, {1, . . . , 7} is an effective time slot of the rule “Cigarettes ⇒ Gifts”. Definition 6 The set of effective time slots of r, Se , is formed by ei s such that belong to the effective time slots of r . That is, Se ={e1 , e2 , . . . , ek } and e j ( j = 1, . . . , k) is the effective time slot of r . According to Definition 5, we can examine whether ei (i = 1, . . . , n) is an effective time slot of r or not one by one. If it is an effective time slot of r , it will be added to Se . Thus Se
can be obtained. Based on the above definitions, we can give the definition of comments of rules. Definition 7 The comments Su of an association rule r can be written as a set of u(r, e), where u(r, e) is a triple: u(r, e) = (e, Sr,e , Cr,e ). The elements of Su are in descending order by the size of Sr,e , where e is the effective time slot of r, Sr,e is the support of r in e, and Cr,e is the confidence of r in e.
123
80
B. Shen et al.
Table 1 Su of the rule “Cigarettes ⇒ Gifts”
Effective time slots
Support (%)
Confidence (%)
The National Day of China
20.0
89.0
The Labor Day
13.0
82.0
The Spring Festival of China
12.0
84.0
For example, let us consider the rule “r : Cigarettes ⇒ Gifts”, the calendar schema R = (year: [2005, 2006], month: [1, . . . , 12], day: [1, . . . , 31]). Suppose that r has three effective time slots {e1 , e2 , e3 }, e1 = ∗, 10, {1, . . . , 7} , e2 = ∗, 5, {1, . . . , 3} , e3 = 2005, 2, {2, . . . , 13} ∪ 2006, 1, {22, . . . , 31} ∪ 2006, 2, {1, 2}, which indicate the National Day of China, the Labor Day, and the Spring Festival of China respectively. Also suppose that Sr,e1 , Sr,e2 , Sr,e3 are 20%, 13.0%, 12.0% respectively, and Cr,e1 , Cr,e2 , Cr,e3 are 89.0%, 82.0%, 84.0%, respectively. The comments Su of the rule “r : Cigarettes ⇒ Gifts” can be written as the set {u(r, e1 ), u(r, e2 ), u(r, e3 )}, where u(r, e1 ) = (e1 , 20%, 89.0%), u(r, e2 ) = (e2 , 13.0%, 82.0%), and u(r, e3 ) = (e3 , 12.0%, 84.0%). For the sake of legibility, in the following, we list the set Su of r in Table 1: Based on the above definition of rules’ comments, we can conclude the form of a complete DAR-C rule below. Definition 8 A complete DAR-C rule includes three parameters: comments Su , a support s and a confidence c. Its form is shown as follows: X ⇒ Y [s, c, Su ]
(6)
Here the support s equals PD (X ∪ Y ), the percentage of transactions in the whole dataset D that contain X ∪ Y . The confidence c equals PD (Y |X ), which denotes the probability that a transaction contains Y given that it contains X . The comments Su can be obtained by Definition 7, which is used to describe the dynamic property of an association rule. Taking the rule “r : Cigarettes ⇒ Gifts” as an example, if we set the calendar schema, the effective time slots and the corresponding supports and confidences of r in these effective time slots are the same as above, and suppose s = 2.0%, c = 80.0%, a complete DAR-C can be written as: Cigarettes ⇒ Gifts[s = 2.0%, c = 80.0%, Su ] where Su is given in Table 1. For the sake of clarity, we list this DAR-C as follows: Cigarettes ⇒ Gifts[s = 2.0%, c = 80.0%] Usage comments: Effective time slots
Support (%)
Confidence (%)
The National Day of China
20.0
89.0
The Labor Day
13.0
82.0
The Spring Festival of China
12.0
84.0
Thus, a DAR-C rule gives not only the implication of the rule, but also the usage comments of the rule, which provide information about how to apply this rule. Our DAR-C mining aims at addressing the following problem: Given a set D of timestamped transactions, a calendar schema R and a set Se of candidate effective time slots, find
123
Mining dynamic association rules with comments
81
all dynamic association rules with their usage comments X ⇒ Y [s, c, Su ], where Su are the comments, s is the support of the rule and c is the confidence of the rule. 3.2.2 Characteristics of DAR-C From the above definitions of DAR-C, comparing with traditional association rules, we can find the following characteristics of DAR-C: (1) DAR-C considers the problem that the efficiency of the associations hidden in the data changes over time, and employs the usage comments to reflect the dynamic property of rules. (2) DAR-C gives not only the implication form of a rule, but also the corresponding usage comments, which point out the effective time slots of the rule. These effective time slots and their corresponding supports and confidences are helpful for deciding when to apply the rule. (3) Users can designate large quantities of domain-related candidate effective time slots, such as traditional holidays, each month, each season, each Sunday, each hour and so on. However, the disadvantage is that these candidate effective time slots should be provided in advance for the process of DAR-C mining. (4) The designation of candidate effective time slots is quite flexible. The candidate effective time slots can be basic time intervals or/and periodical time intervals. And the candidate effective time slots can be covered, intersectant, or disjunctive with each other.
4 Mining algorithms for DAR-C According to the definitions about DAR-C, the problem of mining DAR-C can be decomposed into two sub-problems: 1. Find all the frequent item-sets L = {l1 , . . . , ln }, their corresponding supports s, their support counts |T [l j , ei ]| ( j = 1, . . . , n) in each candidate effective time slot ei , and the numbers of transactions |T [ei ]| in ei . In Sects. 4.2 and 4.3, we propose the algorithms ITS2 and EFP-Growth2 to solve this sub-problem. 2. Use the frequent item-sets L, their supports s, their support counts in ei and the numbers of transactions in ei to generate the desired DAR-C. In Sect. 4.4, we put forward the function of Comment-generation() to solve this sub-problem. Notation In order to describe the framework and the algorithms clearly, some notations are introduced here. Assume that the whole dataset is D, and each transaction is associated with a time-stamp, called T I D. We should specify a set of candidate effective time slots {e1 , e2 , . . . , ek } in advance, and provide it to the mining algorithms. Subsequently, the whole frequent item-sets L = {l1 , . . . , ln } and the support count of l j in ei marked as |T [l j , ei ]| can be mined. Similarly, the number of transactions in ei is denoted as |T [ei ]|. Let the support count vector Vl j ,e be {|T [l j , e1 ]|, . . . , |T [l j , ek ]|} and the support set s be {s1 , . . . , sn }, k e ] means the union set of where s j ( j = 1, . . . , n) is the support of l j . The notation T [∪i=1 i transactions occurred in ei (i = 1, . . . , k). Let the obtained rule set be R = {r1 , . . . , rn }, and the support count vector Vr j ,e be {|T [r j , e1 ]|, . . . , |T [r j , ek ]|}, where |T [r j , ei ]| (i = 1, . . . , k) is the support count of r j in
123
82
B. Shen et al.
A Set of Candidate Effective Time slots User Mapping
Calendar Schema
Generate
L ,
DAR-C Generation
s , T [l j , ei ] 2
and T [ei ]
1 Decision-maker
Fig. 1 Illustration of DAR-C mining framework
ei . The usage comments of r j is denoted as Su (r j ), and u(r j , ei ) = (ei , Sr j ,ei , Cr j ,ei ), where Sr j ,ei , Cr j ,ei are the support and the confidence of r j in ei , respectively. In the following, we use the above notations to describe the mining framework, ITS2 algorithm, EFP-Growth2 algorithm and Comment-generation() function in details, respectively. 4.1 Mining framework The illustration of DAR-C mining framework is as shown in Fig. 1. The first part is used to generate L , s, |T [l j , ei ]| and |T [ei ]| based on candidate effective time slots. Through the mapping on the calendar schema, we construct the set of candidate effective time slots, and then use the DAR-C mining algorithms to generate L , s, |T [l j , ei ]| and |T [ei ]|. The second part generates the desired rules by the Comment-generation() function and submits them to decision-makers. 4.2 ITS2 algorithm The ITS2 algorithm is simple and quite easy for understanding, which consists of two phrases. First, we employ one of existing high performance association rule mining algorithms [28] to generate all of the frequent item-sets L and their supports s. Next, scan the union dataset k e ] of transactions occurred in e to get V T [∪i=1 i i l j ,e ( j = 1, . . . , n) and |T [ei ]|(i = 1, . . . , k). k e ], The second phrase includes the following steps. First, for each transaction t in T [∪i=1 i we obtain all of the frequent item-sets in t, and cumulate |T [ei ]| and |T [l j , ei ]|. Second, we put |T [l j , e1 ]|, . . . , |T [l j , ek ]| together to form Vl j ,e . The outline of this algorithm is described in Fig. 2. The function of High-performanceassociation-mining-algorithm calls a high-performance association mining algorithm, such as MAFIA [28], to find the frequent item-sets L and their supports s. 4.3 EFP-Growth2 algorithm ITS2 algorithm is an algorithm which consists of two phrases. Except the traditional mining procedure for discovering frequent item-sets, ITS2 needs an additional database scan to generate Vl j ,e and |T [ei ]|. Here we propose another DAR-C mining algorithm EFP-Growth2, whose framework is similar to the FP-Growth algorithm’s [29]. EFP-Growth2 extends the FP-tree of the FPGrowth algorithm, and constructs a compact data structure called Extended FP tree (or EFPtree for short). Based on the data structure of EFP-tree, EFP-Growth2 can not only generate
123
Mining dynamic association rules with comments Algorithm:
83
ITS2
Input: dataset D , the candidate effective time slot ei ( i = 1, Output:
L , s , Vl ,e ( j = 1, , n ) and T [ei ] ( i = 1, j
, k ), min_sup
,k )
Methods: 1)
( L , s )=High-performance-association-mining-algorithm;//Obtain frequent item-sets and their supports
2)
For each t ∈ T [
k
ei ] do { i=1
3)
Ltemp =subset( L , t );
4)
For i = 1,
5)
, k do
If t ∈ T [ei ] then {
T [ei ] ++;
6)
For each frequent item-set l j ∈ Ltemp do
7)
T [l j , ei ] ++; //cumulate the support count of l j in ei }
8) 9)
//obtain all of the frequent item-sets in set t
For every frequent item-set l j ∈ L , let T [l j , ei ] ( i = 1,
10) Return L with their corresponding
s , Vl ,e j
( j = 1,
}
, k ) form Vl ,e j
, n ) and T [ei ] ( i = 1,
,k )
Fig. 2 ITS2 algorithm
frequent item-sets effectively, but can also well keep and cumulate |T [l j , ei ]|. It inherits the virtues of FP-Growth, such as high performance in mining high density data. At the same time, it can well treat the task of mining DAR-C. 4.3.1 Extended frequent-pattern tree Compared with FP-tree, the compact data structure of EFP-tree is designed based on the following observations: Observation 1 DAR-C mining needs to find not only the frequent item-set l j and its support s j , but also the support count |T [l j , ei ]| of frequent item-set l j in each ei . It is necessary to add an additional vector Ve containing |T [l j , e1 ]|, . . . , |T [l j , ek ]| to each node of FP-tree. Observation 2 |T [l j , ei ]| is similar to the frequent item-set l j ’s support s j . The only difference is that support s j equals the count of transactions in the whole dataset D that contain l j , while |T [l j , ei ]| equals the count of transactions in ei that contain l j . In fact, |T [l j , ei ]| is a special support. So we can cumulate s j and |T [l j , ei ]| at the same time in the process of the whole database scan. Thus, we give the data structure of EFP-tree below. Definition 9 (EFP-tree). An extended frequent-pattern tree (or EFP-tree for short) is a tree structure defined below. (1) It consists of one root labeled as “null”, a set of item prefix sub-trees as the children of the root and a frequent-item-header table.
123
84
B. Shen et al.
Head Table itemname
count
Ve
f
4
[2,1,2,2]
c
4
[2,1,2,2]
a
3
[2,0,2,1]
b
3
[1,2,1,2]
m
3
[2,0,2,1]
P
3
[1,1,1,2]
Root nodelinks
c:1:[0,1,0,1]
f:4:[2,1,2,2] b:1:[0,1,0,1]
b:1:[0,1,0,1]
c:3:[2,0,2,1]
p:1:[0,1,0,1]
a:3:[2,0,2,1] m:2:[1,0,1,1]
b:1:[1,0,1,0]
p:2:[1,0,1,1]
m:1:[1,0,1,0]
Fig. 3 The EFP-tree corresponding to Example 1
Table 2 A transaction database as a running example
ID
Items bought
Time stamp
(Ordered) frequent items
01
f, a, c, d, g, i, m, p
2005, 1, 1
f, c, a, m, p
02
a, b, c, f, l, m, o
2005, 1, 2
f, c, a, b, m
03
b, f, h, j, o
2006, 1, 1
f, b
04
b, c, k, s, p
2006, 1, 2
c, b, p
05
a, f, c, e, l, p, m, n
2006, 2, 2
f, c, a, m, p
(2) Each node in the item-prefix sub-tree consists of four fields: item-name, count, vector Ve and node-link, where item-name registers the item that this node represents, count registers the number of transactions supporting l j , l j is the item-set represented by the portion of the path reaching this node, vector Ve contains element |T [l j , ei ]| (i = 1, . . . , k), and node-link links to the next node in the EFP-tree carrying the same item-name, or null if there is none. (3) Each entry in the frequent-item-head table consists of four fields: item-name, count, vector Ve and node-link (a pointer pointing to the first node in the EFP-tree carrying the item-name). In order to describe the above data structure clearly, we give a running example below. Example 1 Consider a calendar schema R = (year: [2005, 2006], month: [1, . . . , 12], day: [1, . . . , 31]), and let the transaction database be the first three columns of Table 2, where time stamps of transactions are calendar patterns in R. Let the minimum support threshold be 60%. Assume there are four-candidate effective time slots e1 = 2005, 1, ∗, e2 = 2006, 1, ∗, e3 = 2005, ∗, ∗ and e4 = 2006, ∗, ∗. So e1 covers the transactions “01” and “02”; e2 covers the transactions “03” and “04”; e3 covers the transactions “01” and “02”; e4 covers the transactions “03”, “04” and “05”. After running the EFP-tree construction in EFP-Growth2 algorithm, the tree, together with the associated node-links, are shown in Fig. 3 (In part 4.3.2, we explain the process of EFP-tree construction in detail).
123
Mining dynamic association rules with comments Algorithm:
85
EFP-Growth2
Input: dataset D , the candidate effective time slot ei ( i = 1, Output: L with their supports s , their
, k ), min_sup
Vl j ,e ( j = 1, , n ), and T [ei ] ( i = 1, , k )
Method: 1) 2)
Tree ← EFP-Construction2( D , e1 ~ ek , min_sup); Initialize L : Create an empty frequent item-set φ , L ← { φ }, φ .sup ← the transaction number of D ; For each element (denoted as φ . Vei ) in φ . Ve do φ . Vei ← T [ei ] ;
3)
( L , Vl
j ,e
,s)
←
EFP-Growth2( Tree , null).
Fig. 4 EFP-Growth2 algorithm
From the definition of EFP-tree and Fig. 3, we can find the difference between EFP-tree and FP-tree: EFP-tree has an additional field of vector Ve in each node of the item-prefix sub-tree and each entry of the frequent-item-head table. The purpose of this additional field is to keep and cumulate |T [l j , ei ]|. 4.3.2 EFP-Growth2 algorithm Adopting the framework of algorithm FP-Growth, based on EFP-tree we put forward the algorithm EFP-Growth2 (Fig. 4). Analysis There are three procedures in EFP-Growth2 algorithm: (a) EFP-Construction2() procedure which constructs the EFP-tree; (b) initialization of the set of frequent patterns L; (c) call of EFP-Growth2() to mine EFP-tree. In the following, we give the functions of EFP-Construction2() and EFP-Growth2() based on EFP-tree. (1) EFP-Construction2() function Because the data structure of EFP-tree is similar to the FP-tree’s, the function of EFPConstruction2() is also similar to the FP-tree construction function of FP-Growth algorithm. The difference between EFP-tree and FP-tree is that EFP-tree has an additional field of vector Ve in each node of tree and each entry of head table, so we only need to focus on producing the values in Ve based on the EFP-tree construction function. There are two database scans in the process of EFP-tree construction. We get the vector Ve of each entry of head table in the first database scan. In the second scan, we cumulate the support count and Ve in each node of tree at the same time. The detailed description is listed in Fig. 5. Consider the Example 1 described before. The tree construction process is as follows. First, a scan of the database gets the set of 1-frequent items, their supports, and their Ve . For instance, the support count of 1-frequent item “f” is 4, Ve1 , . . . , Ve4 are 2, 1, 2, 2, respectively. So the entry of “f” can be denoted as (f:4:[2,1,2,2]). Thus, a list of frequent items, ( f:4:[2,1,2,2]), (c:4:[2:1,2,2]), (a:3:[2,0,2,1]), (b:3:[1,2,1,2]), (m:3:[2,0,2,1]), (p:3:[1,1, 1,2]), in which items are ordered in frequency-descending order, can be derived. For convenience of later discussions, the 1-frequent items in each transaction are also listed in frequency-descending order in the fourth column of Table 2. We also obtain |T [e1 ]|, . . . , |T [e4 ]|, which are 2, 2, 2, 3, respectively.
123
86
B. Shen et al.
Function: 1)
EFP-Construction2( D , e1 ~ ek , min_sup)
Scan the dataset D once. Obtain T [ei ] . Collect F , the set of 1-frequent items. Get the support of each frequent item and
Ve of each frequent item. Sort F in support-descending order as FList ,
the list of frequent items; 2)
Create the root of Tree , and label it as “null”. For each transaction Trans in D execute steps (3)~(4);
3) Select all of the frequent items in Trans , and sort them according to the order of FList . Let the sorted frequent-item list in Trans be [ p | P ], where p is the first element and P is the remaining list. Then call insert_tree([ p | P ], T , TID ), where TID is the time-stamp of Trans ; 4)
The function insert_tree([ p | P ], T , TID ) is executed as follows. a) If T has a child N , and N .item-name= p .Item-name, increase N ’s count by 1. Check every
ei ( i = 1,
, k ), if TID ∈ ei , increase the i -th element’s count of Ve of N by 1.
b) If T don’t have a child N such that N .item-name= its count initialized to 1. Check every ei ( i = 1,
p .Item-name, create a new node N , with
, k ), if TID ∈ ei , let the i -th element of Ve of
N be 1, else be 0, its parent link linked to T , and its node-link linked to the nodes with the same item-name via the node-link structure. After a) or b) is performed, if P is nonempty, call insert_tree([ p | P ], T , TID ) recursively.
Fig. 5 The function of EFP-Construction2()
Second, create the root of EFP-tree, which is labeled as “null”. Then the EFP-tree is constructed as follows by the second database scan. For the sake of clarity, node-link of EFP-tree is not discussed in the following process. 1. The first transaction is in e1 and e3 , so Ve1 , Ve3 and the count of each node are incremented by 1. Thus the scan of the first transaction leads to the construction of the first branch of the EFP-tree:(f:1:[1,0,1,0]), (c:1:[1,0,1,0]), (a:1:[1,0,1,0]), (m:1:[1,0,1,0]), (p:1:[1,0,1,0]). 2. For the second transaction, item list f,c,a,b,m shares a common prefix with f,c,a,m,p. The second transaction belongs to e1 and e3 . So along the prefix, Ve1 , Ve3 and the count of each node are incremented by 1. Two new nodes (b:1:[1,0,1,0]) and (m:1:[1,0,1,0]) are created. The first node (b:1:[1,0,1,0]) is linked as a child of (a:2:[2,0,2,0]), and the second new node (m:1:[1,0,1,0]) is linked as the child of node (b:1:[1,0,1,0]). 3. For the third transaction, it is in e2 , e4 , and item list f, b shares the common prefix f with the f-prefix subtree. So node (f:2:[2,0,2,0]) is changed to (f:3:[2,1,2,1]), and a new node (b:1:[0,1,0,1]) is created and is linked as a child of (f:3:[2,1,2,1]). 4. The fourth transaction belongs to e2 , e4 , and it does not share any item with the prefix tree. So the scan of the fourth transaction leads to a new branch of EFP-tree: (c:1:[0,1,0,1]), (b:1:[0,1,0,1]), (p:1:[0,1,0,1]). 5. For the last transaction, it is in e5 , and is the same as the first transaction. So Ve5 and the count of each node along the path is incremented by 1. The above change process of EFP-prefix-tree is shown in Fig. 6. In order to facilitate tree traversal, node-link structure which links the same item-name nodes can be added to the above process and Fig. 6. Thus after running the function of EFP-Construction2(), the final established EFP-tree is shown as Fig. 3.
123
Mining dynamic association rules with comments
Root
87
Root
Root
Root f:1:[1,0,1,0]
f:2:[2,0,2,0]
f:3:[2,1,2,1]
c:1:[1,0,1,0]
c:2:[2,0,2,0]
c:2:[2,0,2,0]
a:1:[1,0,1,0]
a:2:[2,0,2,0]
a:2:[2,0,2,0]
m:1:[1,0,1,0]
m:1:[1,0,1,0]
b:1:[1,0,1,0]
m:1:[1,0,1,0]
b:1:[1,0,1,0]
p:1:[1,0,1,0]
p:1:[1,0,1,0]
m:1:[1,0,1,0]
p:1:[1,0,1,0]
m:1:[1,0,1,0]
(0) Initialization (1) Scan the 1st transaction
(2) Scan the 2nd transaction
Root f:3:[2,1,2,1] c:2:[2,0,2,0]
b:1:[0,1,0,1]
a:2:[2,0,2,0]
b:1:[0,1,0,1]
(3) Scan the 3rd transaction
Root c:1:[0,1,0,1]
f:4:[2,1,2,2]
b:1:[0,1,0,1]
c:3:[2,0,2,1]
p:1:[0,1,0,1]
a:3:[2,0,2,1]
c:1:[0,1,0,1] b:1:[0,1,0,1]
p:1:[0,1,0,1]
m:1:[1,0,1,0]
b:1:[1,0,1,0]
m:2:[1,0,1,1]
b:1:[1,0,1,0]
p:1:[1,0,1,0]
m:1:[1,0,1,0]
p:2:[1,0,1,1]
m:1:[1,0,1,0]
(4) Scan the 4th transaction
b:1:[0,1,0,1]
(5) Scan the 5th transaction
Fig. 6 The change of the EFP-tree during the EFP-tree generation
(2) EFP-Growth2() function Following up, we propose the function of EFP-Growth2(), which is similar to the FPgrowth() procedure in FP-Growth algorithm (Fig. 7). The difference is that while generating frequent patterns, we should generate not only their supports, but also their Ve . According to Observations 1 and 2 in Sect. 4.3.1, each element in Ve is a special support. So in the process of mining EFP-tree, we can generate the support and elements in Ve at the same time for each frequent item-set. We examine the process of mining EFP-tree shown in Fig. 3 through Example 1 as well. This EFP-tree contains multiple prefix paths. We examine the mining process by starting from the bottom entry of the header table. For node p, we generate its immediate frequent pattern (p:3:[1,1,1,2]), and two paths in the EFP-tree, which are ( f:4:[2,1,2,2]), (c:3:[2,0,2,1]), (a:3:[2,0,2,1]), (m:2:[1,0,1,1]), (p:2:[1,0,1,1] ) and ( c:1:[0,1,0,1]), (b:1:[0,1,0,1]), (p:1:[0,1,0,1] ). Hence, we obtain p’s prefix paths ( f:2:[1,0,1,1]), (c:2:[1,0,1,1]), (a:2:[1,0,1,1]), (m:2:[1,0,1,1] ) (or simply, ( f cam:2:[1,0,1,1] ) ) and ( cb:1:[0,1,0,1] ). The two prefix paths of p form p’s subpatternbase, which is named as p’s conditional pattern base. Then an EFP-tree on this conditional pattern-base (which is named as p’s conditional EFP-tree), “{c:3:[1,1,1,2]}| p”, can be constructed, and the frequent pattern (cp:3:[1,1,1,2]) is derived. The mining for p terminates.
123
88
B. Shen et al. EFP-Growth2( Tree , α )
Function: 1)
If Tree contains a single prefix path
2)
then {
//Mining single prefix-path EFP-tree
3)
Let P be the single prefix-path part of Tree ;
4)
Let Q be the multi-path part with the top branching node replaced by a null root;
5)
For each combination (denoted as β ) of the nodes in the path P do { Generate pattern c ← β
6)
α , c .sup ← minimum support of nodes in β ; //Generating c and c .sup
For each element (denoted as c . Vei ) in c . Ve do
7) 8)
c . Vei ← minimum Vei of nodes in β ; }
9)
Let freq_pattern_set( P ) be the set of patterns so generated;
10)
L ← L freq_pattern_set( P );
//Generating c . Vei
//Add freq_pattern_set( P ) to L
} 11)
else {
//Mining multi-path EFP-tree
12)
Let Q be Tree ;
13)
For each item
ai in Q do { Generate pattern β ← α i α , β .sup ← α i .sup, β . Ve ← α i . Ve ; Construct β ’s conditional pattern-base and then β ’s conditional EFP-tree Treeβ ; If Treeβ ≠ φ then
14) 15) 16)
( Lβ , Vl
17)
j ,e
, s ) ← EFP-Growth2( Treeβ , β );
18)
Let freq_pattern_set( Q ) be the set of patterns so generated;
19)
L ← L freq_pattern_set( Q );
20)
For each frequent pattern p in freq_pattern_set( P ) do
}
//Add freq_pattern_set( Q ) to L
}
21)
For each frequent pattern q in freq_pattern_set( Q ) do {
22)
Generate pattern fp ← p
q , fp .sup ← the minimum one between p .sup and q .sup;
23)
For each element (denoted as fp . Vei ) in fp . Ve do
fp . Vei ← the minimum one between p . Vei and q . Vei ; }
24)
25) Let freq_pattern_set( P ) × freq_pattern_set( Q ) be the set of patterns so generated, where
×
is
cross-product;
26)
L ← L freq_pattern_set( P ) × freq_pattern_set( Q );
27) Return L with their corresponding Vl ,e and s . j
Fig. 7 The function of EFP-Growth2()
For node m, its immediate frequent pattern is (m:3:[2,0,2,1]), and its paths are ( f:4:[2,1,2,2]), (c:3:[2,0,2,1]), (a:3:[2,0,2,1]), (m:2:[1,0,1,1] ) and ( f:4:[2,1,2,2]), (c:3:[2,0,2,1]), (a:3:[2,0,2,1]), (b:1:[1,0,1,0]), (m:1:[1,0,1,0]). Similar to the above discussion,
123
Mining dynamic association rules with comments
89
Fig. 8 { f ca:3:[2,0,2,1]}|m, a conditional EFP-tree for item m
Root item
head of node-linkes
f:3:[2,0,2,1]
f c
c:3:[2,0,2,1]
a a:3:[2,0,2,1] Table 3 Conditional pattern-bases and conditional EFP-tree generated in the processing of mining EFP-tree shown in Fig. 3 Item
Conditional pattern base
Conditional EFP-tree
p
{( f cam:2:[1,0,1,1] ) , ( cb:1: [0,1,0,1] )}
{c:3:[1,1,1,2]}| p
m
{( f ca:2:[1,0,1,1] ) , ( f cab:1:[1,0,1,0] )}
{ f ca:3:[2,0,2,1]}|m
b
{( f ca:1:[1,0,1,0] ) , ( f :1: [0,1,0,1] ) , ( c:1: [0,1,0,1] )}
φ
a
{( f c:3:[2,0,2,1] )}
{ f c:3:[2,0,2,1]}|a
c
{( f :3:[2,0,2,1] )}
{ f :3:[2,0,2,1]}|c
f
φ
φ
m’s conditional pattern-base is {( f ca:2:[1,0,1,1] ) , ( f cab:1:[1,0,1,0] )}. Hence m’s conditional EFP-tree is { f ca:3:[2,0,2,1]}|m, which has a single frequent pattern path { f ca: 3:[2,0,2,1]}, as shown in Fig. 8. By calling EFP-Growth2({ f ca:3:[2,0,2,1]}|m), this conditional EFP-tree can be mined recursively. Because this conditional EFP-tree only contains one single prefix path, it can be mined by outputting all the combinations of the items in the path. Therefore, the set of frequent items involving m is {(m:3:[2,0,2,1]), (am:3:[2,0,2,1]), (cm:3:[2,0,2,1]), (fm:3:[2,0,2,1]), (cam:3:[2,0,2,1]), (fam:3:[2,0,2,1]), (fcm:3:[2,0,2,1]), (fcam:3:[2,0,2,1])}. Similarly, for node b, it derives (b:3:[1,2,1,2]) and its conditional pattern-base is {( f ca:1:[1,0,1,0] ) , ( f : 1 : [0, 1, 0, 1] ) , ( c : 1 : [0, 1, 0, 1] )}. Since it generates no frequent item, the search for frequent patterns associated with b terminates. Then for node a, it derives (a:3:[2,0,2,1]) and its conditional pattern-base is {( f c : 3 : [2, 0, 2, 1] )}. Hence, its conditional EFP-tree is { f c:3:[2,0,2,1]}|a, which contains one single-path. Thus its set of frequent items can be derived by taking all the combinations of the items in this path. It is {(fa:3:[2,0,2,1]), ca:3:[2,0,2,1], fca:3:[2,0,2,1]}. Node c’s immediate frequent pattern is (c:4:[2,1,2,2]) and its conditional pattern base is {( f :3:[2,0,2,1] )}. Therefore, the conditional EFP-tree { f :3:[2,0,2,1]}|c derives frequent pattern ( f c:3:[2,0,2,1]). Node f derives only ( f :4:[2,1,2,2]) and it has no conditional pattern-base. Thus the whole EFP-tree is mined. We summarize the conditional pattern-bases and the conditional EFP-trees generated in Table 3. 4.4 Comment generation After the frequent item-sets L , s, Vl j ,e and |T [ei ]| (i = 1, . . . , k) are obtained based on ITS2 and EFP-Growth2 algorithm, function Comment-generation() can be called to get the final DAR-C. It includes the steps given in Fig. 9. First of all, call association rules generation function to get the rule-set R. Then according to the corresponding definitions, we can calculate the
123
90
B. Shen et al. Function: Input:
Comment-generation
L , s , Vl ,e ( j = 1, , n ), and T [ei ] ( i = 1, j
, k ), the minimal effective support threshold
s ′ , minimal effective confidence threshold c′ , and min_conf Output:
rule-set R , comments Su ( R ) , s , and c
1)
( R , c )=rule-generation-sub-algorithm; //call association rules generation function
2)
For each rule rj ∈ R do { Let
3)
Vrj ,e
be
VX j
Y j ,e
, and s of rj be s of X j
Yj ;
4)
Create comments Su ( rj ) of rj , and let it be a empty set φ ;
5)
For i = 1,
, k do {
6)
S rj ,ei = T [ X j Y j , ei ] T [ei ] ;
7)
Crj ,ei = T [ X j Y j , ei ] T [ X j , ei ] ;
8)
If ( S r
≥ s ′ ) and ( Crj ,ei ≥ c′ ) then
insert u ( rj , ei ) =( ei , S r
9)
j ,ei
, Cr
j ,ei
) into Su ( rj ) ;
Sort u ( rj , ei ) in Su ( rj ) in descending order of
10) 11)
j ,ei
}
S rj ,ei ; }
Return R with their corresponding Su ( R ) , s and c
Fig. 9 Comment-generation() function
supports and the confidences in the candidate effective time slots for each rule r j ∈ R. If the candidate effective time slot ei satisfies (Sr j ,ei ≥ s ) and (Cr j ,ei ≥ c ), then u(r j , ei ) = (ei , Sr j ,ei , Cr j ,ei ) is inserted into Su (r j ). Thus the usage comments Su (r j ) can be generated. Detailed descriptions are shown in Fig. 9, where X j is the left-hand-side of r j and Y j is the right-hand-side of r j .
5 Performance evaluation Our proposed mining algorithms (including ITS2 algorithm and EFP-Growth2 algorithm) for DAR-C are coded in C++. Note that the first phase of the ITS2 algorithm was implemented by using MAFIA algorithm proposed in [28]. In this section, we report our experimental results on mining DAR-C from three aspects. First, we evaluate the execution performance of algorithms. Second, we verify the relevant parameters on the performance of algorithms. Finally, we apply DAR-C mining to a real application. In our experiments, we choose ITS, EFP-Growth algorithm [25] for comparison. Since ITS algorithm and EFP-Growth algorithm are designed for mining dynamic association rules, it can not tackle the DAR-C rule mining task directly. In order to finish the mining procedure of DAR-C rules, we have to repeat them multiple times, and then do some additional post-mining processing. All the experiments are conducted on a PC with CPU AMD Sempron 2400 + 1.67 GHz and 512 MB main memory, running Windows XP Professional Edition.
123
Mining dynamic association rules with comments Fig. 10 Performance comparison on FoodMart 2000
91 40 ITS ITS2 EFP-Growth EFP-Growth2
Second
30
20
10
0
0.002
0.004
0.006
0.008
0.01
Minimal degree of support
5.1 Experiment 1: performance comparisons We use FoodMart 2000, T10I4D100K and Connect4 to test the performance of these algorithms. FoodMart 2000 is a real retail dataset in SQL Sever 2000, and its data is very sparse. T10I4D100K is generated by a generator, which is provided by IBM Almaden Laboratory Data Mining Research Group [30], and is between sparse data sets and intensive data sets. Connect4 is downloaded from UCI Machine Learning deposit, which is a very dense dataset. The subsets of dataset are set as follows. For FoodMart 2000 dataset, choose the sale transactions in 1998, and regard the items which are purchased by the same customer at the same time as a basket. Then, in order to overcome the shortcomings that the bottom items’ support is too small, we generalize the bottom items to the upper class according to the goods classes. Finally, we get a dataset with 41011 transactions. Divide them into 730 sub-datasets D1 ∼ D730 , where each of the front 729 sub-datasets contains 56 transactions, the last one contains 187 transactions. For T10I4D100K dataset, it is divided into 730 subsets D1 ∼ D730 too, where each of the front 729 sub-datasets includes 137 transactions, and the last one includes 127 transactions. We also divide Connect4 dataset into 730 sub-datasets D1 ∼ D730 . Among them, each of the front 729 subsets respectively contains 92 transactions, while the last has 489 transactions. The timestamp of sub-dataset is set up as following. Add the corresponding timestamp T I D for every sub-dataset, which is 2005, 1, 1 ∼ 2006, 12, 31, respectively. Let the calendar schema R be (year: [2005, 2006], month: [1, . . . , 12], day: [1, . . . , 31]), and let 45 candidate effective time slots e1 , . . . , e45 be 2005, ∗, ∗, 2006, ∗, ∗, ∗, 1, ∗, . . . , ∗, 12, ∗, ∗, ∗, 1, . . . , ∗, ∗, 31, which means “year 2005”, “year 2006”, “every Jan. every year”, . . ., “every Dec. every year”, “the first day every month”, . . ., “the 31st day every month” respectively. Because ITS and EFP-Growth can only produce dynamic information in conjoint time slots, in order to obtain all of the dynamic information in the above candidate effective time slots, we need to divide the candidate effective time slots into 3 groups, which are group(“year 2005”, “year 2006”), group(“every Jan. every year”, . . ., “every Dec. every year”), group(“the first day every month”, . . ., “the 31st day every month”). Thus we should repeat ITS and EFP-Growth three times to finish the task of mining DAR-C. The experiment results are shown in Figs. 10, 11, and 12, where the ordinates are the execution time, and the abscissas are the minimal support thresholds. The results show that ITS2 and EFP-Growth2 have the satisfied performance under normal circumstances. For
123
92
B. Shen et al.
Fig. 11 Performance comparison on T10I4D100K
1000 ITS ITS2 EFP-Growth EFP-Growth2
Second
800 600 400 200 0 0.0005
0.001
0.002
0.003
Minimal degree of support Fig. 12 Performance comparison on Connect4
400 ITS
Second
300
ITS2 EFP-Growth EFP-Growth2
200
100
0 0.995
0.99
0.985
0.98
Minimal degree of support
FoodMart 2000 dataset, the ranking of algorithms are: ITS2 > ITS > EFP-Growth2 > EFPGrowth. For T10I4D100K dataset, in the high-support threshold domain, ITS2 has the best performance, while in the low-support threshold domain, EFP-Growth2 is the best one. For Connect4 dataset, EFP-Growth2 has the best performance, while the execution time of ITS2 is increased fast in the low-support threshold domain. Because the ITS2 algorithm and the ITS algorithm are in the same class, they perform well for mining sparse datasets with high-support threshold. The EFP-Growth2 algorithm and EFP-Growth algorithm are established based on the highly compressed EFP-tree, so they can get a better performance on dense datasets such as Connect4. The ITS algorithm and the EFPGrowth algorithm are not fit for DAR-C mining task. In order to finish the DAR-C mining task, we have to repeat them multiple times, and do some additional post-mining processing. So we find that the ITS2 algorithm can get a better performance than the ITS algorithm, and the EFP-Growth2 algorithm can perform better than the EFP-Growth algorithm. 5.2 Experiment 2: parameter Study Varying support thresholds can significantly affect the performance of algorithms. From Figs. 10, 11, and 12, we can find that the execution time of these algorithms increase with the decrease of support thresholds.
123
Mining dynamic association rules with comments Table 4 Transaction number of subsets
93
Total number of trans.
Trans. number in D1 ∼ D729
Trans. number in D730
20K
27
40K
54
317 634
60K
82
222
80K
109
539
100K
137
127
Fig. 13 Scalability with transaction number
600
Second
400
ITS ITS2 EFP-Growth EFP-Growth2
200
0 20K
40K
60K
80K
100K
Number of transactions
In the following, we vary three other parameters: the transaction number, the number of candidate effective time slots k and the number of average covered transactions per candidate k effective time slot i=1 |T [ei ]|/k. At first, in the experiment of varying transaction number, we choose T10I4D100K dataset as the benchmark dataset, and set the support threshold to be 0.001. Choose 20K, 40K, 60K, 80K, 100K transactions from T10I4D100K dataset, and divide them into 730 subsets D1 ∼ D730 . The transaction numbers of each subset are listed in Table 4. The settings of the time-stamps, the calendar schema and the candidate effective time slots are the same as the Experiment 1’s. The testing result is shown as in Fig. 13. We can discover that the execution time of ITS2 and the execution time of EFP-Growth2 progress with the transaction number in a way similar to linear growth, and have a good scalability. And then, we vary the number of candidate effective time slots k. We choose T10I4D100K as the benchmark dataset, and set the support threshold to be 0.001. Let the settings of the sub datasets, the time-stamps and the calendar schema be the same as the Experiment 1’s. Keep k i=1 |T [ei ]|/k fix, and let k be 5, 10, 15 and 20 respectively, where the candidate effective time slots are listed in Table 5. In order to finish the mining process, the ITS algorithm and the EFP-Growth algorithm need to be repeated 2 times. Figure 14 shows the testing results. It k is shown that, if i=1 |T [ei ]|/k is fixed, the execution time of ITS2 and the execution time of EFP-Growth2 linearly increase with the number of candidate effective time slots. k At last, we examine the varying of i=1 |T [ei ]|/k. We also select T10I4D100K as the benchmark. The settings of the support threshold, sub-datasets, time-stamps of sub-datasets and the calendar schema are also the same as the Experiment 1’s. Let the number of candidate [e2 ]| effective time slots k be 2, and remain it unchanged. We take |T [e1 ]|+|T as the benchmark, 2 and label it as 1×base, where e1 is the candidate effective time slot “every Jan. every year”,
123
94
B. Shen et al.
Table 5 The settings of the candidate effective time slots k
Candidate effective time slots
Calendar schema
5
“every Jan. every year”,. . .,“every Mar. every year”, “the 1st day every month”, “the second day every month” “every Jan. every year”,. . .,“every Jun. every year”, “the first day every month”,. . .,“the fourth day every month” “every Jan. every year”,. . .,“every Sep. every year”, “the first day every month”,. . .,“the sixth day every month” “every Jan. every year”,. . .,“every Dec. every year”, “the first day every month”,. . .,“the eighth day every month”
∗, 1, ∗ , . . . , ∗, 3, ∗ , ∗, ∗, 1 , ∗, ∗, 2
10
15
20
Fig. 14 Scalability with the number of candidate effective time slots
∗, 1, ∗ , . . . , ∗, 6, ∗ , ∗, ∗, 1 , . . . , ∗, ∗, 4 ∗, 1, ∗ , . . . , ∗, 9, ∗ , ∗, ∗, 1 , . . . , ∗, ∗, 6 ∗, 1, ∗ , . . . , ∗, 12, ∗ , ∗, ∗, 1 , . . . , ∗, ∗, 8
400
Second
300
200 ITS ITS2 EFP-Growth EFP-Growth2
100
0
5
10
15
20
Candidate effective time slots Table 6 The settings of the candidate effective time slots k The candidate effective time slots i=1 |T [ei ]|/k
Calendar patterns
1 ×base
∗, 1, ∗ , ∗, ∗, 1
3 × base 6 ×base 9 × base 12 × base
“every Jan. every year”, “the first day every month” “from Jan. to Mar. every year”, “from the first day to the third day every month” “from Jan. to Jun. every year”, “from the first day to the 6th day every month” “from Jan. to Sep. every year”, “from the first day to the ninth day every month” “from Jan. to Dec. every year”, “from the first day to the twelfth day every month”
∗, {1, . . . , 3}, ∗ , ∗, ∗, {1, . . . , 3} ∗, {1, . . . , 6}, ∗ , ∗, ∗, {1, . . . , 6} ∗, {1, . . . , 9}, ∗, ∗, ∗, {1, . . . , 9} ∗, {1, . . . , 12}, ∗ , ∗, ∗, {1, . . . , 12}
k |T [ei ]|/k to be 3×base, e2 is “the 1st day every month”. Then we set the value of i=1 6×base, 9×base and 12×base respectively for scalability test, where the candidate effective time slots are listed in Table 6. Figure 15 shows the corresponding experimental results. We can conclude that, while keeping k unchanged, k the execution time of ITS2 and the execution time of EFP-Growth2 grow slowly with i=1 |T [ei ]|/k by the approximate linear way. The above experiments indicate that both ITS2 and EFP-Growth2 show good performance and scalability with various datasets and parameters. They are quite fit for DAR-C mining.
123
Mining dynamic association rules with comments Fig. 15 Scalability with k i=1 |T [ei ]|/k
95 400
Second
300
200 ITS
100
ITS2 EFP-Growth EFP-Growth2
0 1×base
3×base
6×base
9×base
12×base
Transaction number per periods of time
5.3 Application We apply the idea of mining dynamic association rules with comments to FoodMart 2000 retail dataset, which comes from SQL Server 2000. It contains 269720 records in the year 1997 and 1998. The corresponding pre-processing is described below. Chosen the records in the year 1997 and 1998, we take items purchased by the same customer at the same time as a basket. Because the supports of the bottom items are small, we generalize the bottom items to the goods classes. At last, we obtain 62568 transactions with time-stamps. Let the calendar schema R1 be (year: [1997, 1998], month: [1, . . . , 12], day: [1, . . . , 31]), and R2 be (week: [1, . . . , 105], day: [1, . . . , 7]). According to the real situations of FoodMart, a set of candidate effective time slots can be provided, which are “Thanksgiving Day”, “the New Year”, “the Christmas Day”, “the Valentine Day”, “Every spring every year”,. . ., “Every winter every year”, “Every Jan. every year”, “Every Feb. every year”,. . ., “Every Dec. every year”, “Every Mon. every week” ,. . ., “Every Sat. every week” and “Every Sun. every week” etc. Then a set of calendar patterns can be established, which are R1 : 1997, 11, 27 ∪ 1998, 11, 30, ∗, 1, 1, ∗, 12, {24, 25}, ∗, 2, 14, ∗, {3, 4, 5}, ∗, . . . , ∗, {12, 1, 2}, ∗, ∗, 1, ∗, . . . , ∗, 12, ∗, and R2 : ∗, 1, . . . , ∗, 7. Let the minimal support threshold be 0.1%, the minimal confidence threshold be 0.1%, and the minimal effective support threshold s be 0.3%. Under the above conditions, if we mine this mart’s dataset, a set of DAR-C rules can be obtained. Take one of them as an example: 17 ⇒ 11[s = 0.17%, c = 11.8%] Usage comments: Effective time slots
Support (%)
Confidence (%)
The New Year
3.17
66.7
The Christmas Day
0.61
22.2
The Thanksgiving Day
0.53
20.0
Every May. every year
0.30
20.0
The above dynamic association rule with comments shows that this rule performs best in the following effective time slots: “the New Year”, “the Christmas Day”, “the Thanksgiving
123
96
B. Shen et al.
Day” and “every May every year”, where the support of “the new year” is 18.6 times the support of the whole database, and the confidence of “the new year” is 5.7 times the confidence of the whole database. If we apply the association rule under the guidance of the above comments, we can obtain a better application effects. This indicates that giving the rule’s usage comments has certain help to decision-makers.
6 Conclusions In this paper, we propose a new type of dynamic association rules, namely dynamic association rules with comments (DAR-C). DAR-C can give not only the rule itself but also the corresponding usage comments which indicate when to apply the rule. In order to describe this rule, we first present the expression method for the candidate effective time slots, and then define the corresponding concepts of DAR-C. Subsequently, the mining framework and mining algorithms (i.e., ITS2 and EFP-Growth) are proposed. In particular, ITS2 algorithm is a modified two-stage dynamic association rules mining algorithm, and easy to understand for people. EFP-Growth2 adopts the compact structure of EFP-tree, and suitable for mining high-density mass data. Extensive experimental results demonstrate the effectiveness and scalability of the proposed algorithms, and their practicability on real retail dataset. Acknowledgments We would like to express our gratitude to anonymous reviewers for giving us valuable and helpful comments to improve the technical quality and presentation of this paper. This paper is supported in part by the National Science Fund for Distinguished Young Scholars under Grant No. 60525202. Besides, the paper is also supported in part by the National Natural Science Foundation of China under Grant No. 60533040 and 60525202.
References 1. Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD conference on management of data, pp 207–216 2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases, pp 487–499 3. Cheung DW, Han J, Ng VT, Wong CY (1996) Maintenance of discovered association rules in large databases: an incremental updating technique. In: Proceedings of the 12th international conference on data engineering, pp 106–114 4. Cheng J, Ke Y, Ng W (2008) A survey on algorithms for mining frequent itemsets over data streams. Knowl Inf Syst 16(1):1–27 5. Zhang S, Huang Z, Zhang J, Zhu X (2008) Mining follow-up correlation patterns from time-related database. Knowl Inf Syst 14(1):81–100 6. Ke Y, Cheng J, Ng W (2008) An information-theoretic approach to quantitative association rule mining. Knowl Inf Syst 16(2):213–244 7. Liu J, Rong G (2005) Mining dynamic association rules in databases. In: Proceedings of the international conference on computational intelligence and security, pp 688–695 8. Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 11th international conference on data engineering, pp 3–14 9. Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu M (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, pp 355–359 10. Garofalakis MN, Rastogi R, Shim K (1999) SPIRIT: sequential pattern mining with regular expression constraints. In: Proceedings of the 25th international conference on very large data bases, pp 223–234 11. Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th international conference on extending database technology, pp 3–17
123
Mining dynamic association rules with comments
97
12. Ozden B, Ramaswamy S, Silberschatz A (1998) Cyclic association rules. In: Proceedings of the 14th international conference on data engineering, pp 412–421 13. Qin M, Hwang K (2004) Frequent episode rules for Internet anomaly detection. In: Proceedings of the 3rd IEEE international symposium on network computing and applications, pp 161–168 14. Han J, Gong W, Yin Y (1998) Mining segment-wise periodic patterns in time-related databases. In: Proceedings of the 4th international conference on knowledge discovery and data mining, pp 214–218 15. Verma K, Vyas OP (2005) Efficient calendar based temporal association rule. SIGMOD Record 34(3): 63–70 16. Li Y, Ning P, Sean Wang X, Jajodia S (2003) Discovering calendar-based temporal association rules. Data Knowl Eng 44(2):193–218 17. Lu H, Han J, Feng L (1998) Stock movement prediction and n-dimensional inter-transaction association rules. In: Proceedings of the 3rd ACM-SIGMOD workshop on research issues on data mining and knowledge discovery, vol 12, pp 1–7 18. Agrawal R, Psaila G (1995) Active data mining. In: Proceedings of the 1st international conference on knowledge discovery and data mining, pp 3–8 19. Ganti V, Gehrke J, Ramakrishnan R (1999) A framework for measuring changes in data characteristics. In: Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, pp 126–137 20. Liu B, Hsu W, Ma Y (2001) Discovering the set of fundamental rule changes. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, pp 335–340 21. Dong G, Li J (2005) Mining border descriptions of emerging patterns from dataset pairs. Knowl Inf Syst 8(2):178–202 22. Au W-H, Chan KCC (2002) Fuzzy data mining for discovering changes in association rules overtime. In: Proceedings of the 2002 IEEE international conference on fuzzy systems, vol 2, pp 890–895 23. Au W-H, Chan KCC (2005) Mining changes in association rules: a fuzzy approach. Fuzzy Sets Syst 149(1):87–104 24. Liu B, Hsu W, Han H-S, Xia Y (2000) Mining changes for real-life applications. In: Proceedings of the 2nd international conference on data warehousing and knowledge discovery, pp 337–346 25. Shen B, Yao M (2007) Research on a new kind of dynamic association rule and its mining algorithms. http://www.paper.edu.cn/paper.php?serial_number=200712-3 26. Ramaswamy S, Mahajan S, Silberschatz A (1998) On the discovery of interesting patterns in association rules. In: Proceedings of the 24th international conference on very large data bases, pp 368–379 27. Lee W-J, Jiang J-Y, Lee S-J (2008) Mining fuzzy periodic association rules. Data Knowl Eng 65(3): 442–462 28. Burdick D, Calimlim M, Gehrke J (2001) MAFIA: a maximal frequent itemset algorithm for transactional databases. In: Proceedings of the 17th international conference on data engineering, pp 443–452 29. Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequentpattern tree approach. Data Min Knowl Discov 8(1):53–87 30. IBM Almaden Research Center (2009) Quest synthetic data generation code. http://www.almaden.ibm. com/cs/projects/iis/hdb/Projects/data_mining/datasets/syndata.html 31. Deepa Sheenoy P, Srinivasa KG, Venugopal KR, Patnaik LM (2005) Dynamic association rule mining using genetic algorithms. Intell Data Anal 9(5):439–453
Author Biographies Bin Shen received his Ph.D. degree in computer science from Zhejiang University, China in 2007. He is currently with the department of management at the Ningbo Institute of Technology, Zhejiang University, Ningbo, China. His current research interests mainly include data mining, machine learning, and information management.
123
98
B. Shen et al. Min Yao received the B.E. degree in Radio Technique from Hefei University, China, in 1982, the M.E. degree in Computer Science from Hefei University of Technology, China, in 1986, and the Ph.D. degree in Biomedical Engineering and instrument from Zhejiang University, China, in 1995. He is currently a professor in the College of Computer Science and Technology at Zhejiang University, Hangzhou, China. His research interests include knowledge discovery, pervasive computing, service computing.
Zhaohui Wu received the Ph.D. degree in computer science from Zhejiang University, China, in 1993. From 1991 to 1993, he was with the German Research Center for Artificial Intelligence (DFKI) as a joint Ph.D. student. Currently, he is a Professor of computer science with Zhejiang University and the Director of the Institute of Computer System and Architecture. He has authored more than 100 refereed papers. His major interests include artificial intelligence, semantic grid, and ubiquitous computing. Prof. Wu has served as the Program Committee Member for various international conferences and is on the editorial boards of several journals.
Yunjun Gao received his Ph.D. degree in computer science from Zhejiang University, China in 2008. Since March 2008 he is a postdoctoral research fellow at the School of Information Systems, Singapore Management University, Singapore. He is a member of the ACM and ACM SIGMOD. His current research interests mainly include spatial databases, spatio-temporal databases, mobile/pervasive computing, and geographic information systems, etc.
123