Data Mining and Knowledge Discovery, 7, 153–185, 2003 c 2003 Kluwer Academic Publishers. Manufactured in The Netherlands.
Extracting Share Frequent Itemsets with Infrequent Subsets BROCK BARBER HOWARD J. HAMILTON∗
[email protected] Department of Computer Science, University of Regina, 3737 Wascana Parkway, Regina, SK S4S 0A2, Canada Editors: Fayyad, Mannila, Ramakrishnan Received November 15, 1996; Revised February 1, 2002
Abstract. Itemset share has been proposed as an additional measure of the importance of itemsets in association rule mining (Carter et al., 1997). We compare the share and support measures to illustrate that the share measure can provide useful information about numerical values that are typically associated with transaction items, which the support measure cannot. We define the problem of finding share frequent itemsets, and show that share frequency does not have the property of downward closure when it is defined in terms of the itemset as a whole. We present algorithms that do not rely on the property of downward closure, and thus are able to find share frequent itemsets that have infrequent subsets. The algorithms use heuristic methods to generate candidate itemsets. They supplement the information contained in the set of frequent itemsets from a previous pass, with other information that is available at no additional processing cost. They count only those generated itemsets that are predicted to be frequent. The algorithms are applied to a large commercial database and their effectiveness is examined using principles of classifier evaluation from machine learning. Keywords: frequent itemsets, share measure, share frequent itemsets, heuristic data mining, quantitative itemsets, association rules
1.
Introduction
Increasingly large and complex databases are being accumulated by business, government and scientific organizations. Knowledge discovery in databases (or data mining) applies concepts and techniques from fields such as machine learning, statistics and database management to extract useful information from databases (Frawley et al., 1991). One data mining problem that has received considerable attention is the discovery of association rules from market basket data. The problem was first introduced in the context of bar code data analysis (Agrawal et al., 1993). The tremendous amount of data that is collected using bar code scanners represents a potential wealth of information, given adequate methods of transforming the data into meaningful information. The goal of bar code data analysis is to identify buying patterns through the examination of itemsets, groups of items purchased together in transactions. From any itemset, an association rule can be derived which, given the purchase of a subset of the items in an itemset, predicts the probability of the purchase of the remaining items. Such information gives insight into questions such as how to market ∗ To
whom correspondence should be addressed.
154
BARBER AND HAMILTON
these products more effectively, how to group them in store layout or product packages, or which items to offer on sale to boost the sale of other items. Due to the wide applicability of bar code data analysis and the obvious benefit that can be achieved from such analysis, the discovering of association rules from retail sales data has been the focus of a large body of research (e.g., Park et al., 1995; Agrawal and Shafer, 1996; Megiddo and Srikant, 1998; Hidber, 1999). Principles developed through this research have been applied in other domains, such as telecommunications and health data (Ali et al., 1997), census data (Brin et al., 1997b; Bayardo et al., 1999), spatial data (Koperaki and Han, 1995), text data (Silverstein et al., 1998) and distributed databases (Cheung et al., 1996). Although the problem and our methods are general, we choose to present the problem in terms of the retail sales domain, because it provides extremely accessible intuitions for explaining problems, concepts and solutions. A retail organization may offer thousands of products and services. The number of possible combinations of these products and services is potentially huge. In the general case, the examination of all possible combinations is impractical and methods are required to focus effort on those itemsets that are considered important to an organization. The most commonly used measure of the importance of an itemset is its support, the percentage of all transactions that contain the itemset (e.g., Agrawal et al., 1993; Mannila et al., 1994; Han and Fu, 1995; Hipp et al., 1998; Hidber, 1999). Itemsets that meet a minimum support threshold are referred to as frequent itemsets. The rationale behind the use of support is that a retail organization is only interested in those itemsets that occur frequently. However, the support of an itemset tells only the number of transactions in which the itemset was purchased. The exact number of items purchased is not analyzed and the precise impact of the purchase of an itemset cannot be measured in terms of stock, cost or profit. This shortcoming of the support measure prompted the development of a measure called itemset share, the fraction of some numerical value, such as total quantity of items sold or total profit, that is contributed by the items when they occur in an itemset (Carter et al., 1997). A key property of support is that it is downward closed with respect to the lattice of all possible itemsets. That is to say, if an itemset is frequent by support, then all of its subsets must also be frequent by support. This closure property has permitted the development of efficient algorithms that traverse only a portion of the itemset lattice, yet find all possible frequent itemsets. Share frequency can also be defined so that it has the downward closure property by requiring that each item in a frequent itemset be frequent when it occurs in the itemset, as was done by Carter et al. (1997). However, when examining numerical data such as quantity sold, cost or profit, the share of an itemset can increase as it is extended by adding items. If the frequency requirement were based on the total share of the itemset, then itemsets that satisfy the minimum share criteria might have subsets that do not. As a result, there is a class of frequent itemsets that cannot be discovered using the downward closed share frequency definition. In this paper, we describe algorithms to discover share frequent itemsets, including those that have infrequent subsets. Since we cannot rely on the closure property to restrict the number of itemsets that must be examined, heuristics are used to predict whether itemsets should be counted. The performance of the algorithms is evaluated using concepts developed for classification systems in machine learning.
155
EXTRACTING SHARE FREQUENT ITEMSETS
In Section 2, a brief review of support and share is provided, as well as an example illustrating the utility of the share measure. In Section 3, we discuss the closure property and a definition of share frequency that includes itemsets with infrequent subsets. Section 4 describes the algorithms examined in this study. In Section 5, we describe the methods used to evaluate the performance of the algorithms. A description of the software implementation is provided in Section 6. In Section 7, our experimental results are presented. Finally, in Section 8, a summary of the work is provided and we discuss areas of future research. 2.
Review of support and share measures
The problem of discovering association rules from transaction data can be decomposed into two subtasks (Agrawal et al., 1993). The first subtask is to find all itemsets that meet a minimum frequency requirement. The second subtask is the generation of association rules from the frequent itemsets. The second step is relatively easy compared to the first (Zaki et al., 1997). The focus of this paper is the first task, the extraction of frequent itemsets from transaction data. 2.1.
The support measure
We now summarize itemset methodology formally as follows (Agrawal et al., 1996). Let I = {I1 , I2 , . . . , Im } be a set of literals, called items. Let D = {T1 , T2 , . . . , Tn } be a set of n transactions, where for each transaction T ∈ D, T ⊆ I . A set of items X ⊆ I is called an itemset. A transaction T contains an itemset X if X ⊆ T . Each itemset X is associated with a set of transactions TX = {T ∈ D | T ⊇ X } which is the set of transactions which contain the itemset X . The support s of itemset X equals |TX |/|D|. We illustrate support using the small sample transaction database shown in Table 1. The TID column gives the transaction identifier values. Beneath each item name are values indicating quantity of item sold. We recognize, of course, that support is defined over the Table 1.
Example transaction database. TID
Item A
Item B
Item C
Item D
T1
1
0
1
14
T2
0
0
6
0
T3
1
0
2
4
T4
0
0
4
0
T5
0
0
3
1
T6
0
0
1
13
T7
0
0
8
0
T8
4
0
0
7
T9
0
1
1
10
T10
0
0
0
18
156
BARBER AND HAMILTON
Table 2.
Itemset support. Itemset
s
A
0.30
B
0.10
C
0.80
D
0.70
AB
0.00
AC
0.20
AD
0.30
BC
0.10
BD
0.10
CD
0.50
ABC
0.00
ABD
0.00
ACD
0.20
BCD
0.10
ACBC
0.00
binary domain {0,1}, with a value of 1 indicating the presence of an item in a transaction, and a value of 0 indicating the absence of an item. We use the explicit quantities to illustrate limitations of the support measure. In our calculation of support, any non-zero quantity in the table is treated as a 1. Table 2 shows the support for each possible itemset. 2.2.
Limitations of the support measure
As previously indicated, support has been used as a fundamental measure for determining the importance of an itemset. The frequency of occurrence of an itemset certainly provides useful information. In addition, support is measured relative to a stable foundation, the total number of transactions being examined. However, transaction data often contain richer information than whether or not the itemset exists in the transaction, such as quantity sold, unit cost, unit profit, or other numerical attributes. By analyzing this information, we can get a more insightful picture of the relative importance of itemsets. Consider the information provided by quantity sold. Many products are purchased in multiples, such as frozen concentrated juice or carbonated beverages. Since support does not consider this, frequency information derived from support may be misleading. In the sample database, the 1-itemset C has a higher support than the 1-itemset D because it occurs in one more transaction than D. However, the total quantity of item D sold is 67, while that of item C is 26, so in fact D is sold more frequently. Similarly, itemsets BC and BD have equal support, since each occurs in a single transaction, but the quantity sold of items B and D in itemset BD is higher than the quantity sold of items B and C in itemset BC. Again, support provides a misleading picture of frequency in terms of the quantity of items sold.
EXTRACTING SHARE FREQUENT ITEMSETS
157
The original support measure also does not allow for accurate financial calculations or comparisons. For target marketing, measures should take into account both the frequency of an item contributing to a predictive rule and the value of the items in the prediction (Masand and Piatetsky-Shapiro, 1996). The support measure allows for neither of these, so measures based on specific numbers of items, such as percentage of gross sales, costs or net profit, cannot be calculated, and business payoff cannot be maximized. Again examine itemsets BC and CD. Assume that each item is sold for $1.00. Using support, there is no reason to consider one itemset more important than the other, even though BD generates $12.00 of revenue and BC generates only $2.00. Support fails as a measure of relative importance whenever the number of items plays a significant role in determining the value of the relevant sales. Various approaches have been suggested for extending support to quantitative measures (Srikant and Agrawal, 1996; Buchter and Wirth, 1998), for weighting each item using a constant value across all transactions (Cai et al., 1998; Pei et al., 2001), and for weighting each transaction using a single constant value without regard to the items in the transaction (Lu et al., 2001). A simpler and more flexible approach is to use itemset share (Carter et al., 1997) as a measure, because it allows each item in each transaction to be assigned a separate weight. The recent proposal of value added association rules (Lin et al., 2002) is consistent with itemset share. 2.3.
The share measure
The share measure has been proposed (Carter et al., 1997) as an alternative measure of the importance of itemsets. In informal terms, share is the percentage of a numerical total that is contributed by the items in an itemset. In this section, we provide a formal description of the share measure. We start with a definition of the source of the numerical information, a measure attribute. Definition 1. A measure attribute (MA) is a numerical attribute associated with each item in each transaction. A numerical attribute can have an integer type, such as quantity sold, or a real type such as profit margin, unit cost, or total revenue. Definition 2. The transaction measure value, denoted as tmv(I p , Tq ), is the value of a measure attribute associated with an item I p in a transaction Tq . The quantity sold values in Table 1 are the transaction measure values of the items in each transaction. For example, tmv(D, T1) = 14. Definition 3. The global measure value of an item I p , denoted as MV(I p ), is the sum of the transaction measure values of I p in every transaction in which I p appears, where MV(I p ) =
Tq ∈TI p
tmv(I p , Tq )
(1)
158
BARBER AND HAMILTON
Using the sample data, MV(A) = tmv(A,T1) + tmv(A,T2) + tmv(A,T3) + tmv(A,T4) + tmv(A,T5) + tmv(A,T6) + tmv(A,T7) + tmv(A,T8) + tmv(A,T9) + tmv(A,T10) = 1 + 0 + 1 + 0 + 0 + 0 + 0 + 4 + 0 + 0 = 6. Similarly, MV(B) = 1, MV(C) = 26 and MV(D) = 67. Definition 4. The total measure value (MV) is the sum of the global measure values for all items in I in every transaction in D, where MV =
m
MV(I p )
(2)
p=1
The total measure value provides a stable baseline, similar to the total number of transactions used in the support measure. In the sample database, MV = MV(A) + MV(B) + MV(C) + MV(D) = 6 + 1 + 26 + 67 = 100. Definition 5. A k-itemset is an itemset X = {x1 , x2 , . . . , x k }, X ⊆ I, 1 ≤ k ≤ m, of k distinct items. Each itemset X has an associated set of transactions TX = {Tq ∈ T | Tq ⊇ X }, which is the set of transactions that contain the itemset X . Definition 6. The local measure value of an item xi in an itemset X , denoted as lmv(xi , X ), is the sum of the transaction measure values of the item xi in all transactions containing X , where lmv(xi , X ) = tmv(xi , Tq ) (3) Tq ∈Tx
The local measure value for an item xi will always be less than or equal to the global measure value for the item xi , since the global measure value represents the sum of transaction measure values of item xi in every transaction in which item xi individually occurs, whether or not the complete itemset occurs in each of these transactions. A single item will have a separate local measure value for each itemset in which the item appears. Thus, the local measure value of some item I p in the itemset X will be different from the local measure value of I p in the itemset Z , if Z is not equal to X . Definition 7. The local measure value of an itemset X , denoted as lmv(X ), is the sum of the local measure values of each item in X in all transactions containing X , where lmv(X ) =
k
lmv(xi , X )
(4)
i=1
Definition 8. The item share of an item xi in itemset X , denoted as SH(xi , X ), is the ratio of the local measure value of xi in X to the total measure value, where SH(xi , X ) =
lmv(xi , X ) MV
(5)
159
EXTRACTING SHARE FREQUENT ITEMSETS
Definition 9. The itemset share of itemset X , denoted as share(X ), is the ratio of the local measure value of X to the total measure value, where SH(X ) =
lmv(X ) MV
(6)
Based on the sample transaction database provided in Table 1, values corresponding to the measures described in Definitions 6, 7, 8 and 9 are provided in Table 3. The lefthand column lists all possible itemsets. The two columns under each item label show the local measure value and item share of the item in each of the itemsets in the left-hand column. For example, lmv(A, ACD) = 2 and recalling that MV = 100, SH(A, ACD) = lmv(A, ACD)/MV = 2/100 = 0.02. The two columns under the label Itemset X are the local measure value and itemset share of the itemsets in the left-hand column. For itemset ACD, lmv(ACD) = lmv(A, ACD) + lmv(C, ACD) + lmv(D, ACD) = 2 + 3 + 18 = 23 and SH(ABC) = lmv(ACD)/MV = 23/100 = 0.23. A dash in a table cell indicates that the itemset does not contain the item. 2.4.
A comparative example of share and support
We now present a brief example that illustrates the utility of the share measure. We extracted itemset information from a large telecommunication product database, provided to us by a commercial partner. The database and the program used to extract this information are Table 3.
Sample database summary. Item A
Itemset
Item B
lmv
SH
lmv
A
6
0.06
B
–
–
C
–
D
Item C
Item D
Itemset X
SH
lmv
SH
lmv
SH
lmv
SH
–
–
–
–
–
–
6
0.06
1
0.01
–
–
–
–
1
0.01
–
–
–
26
0.26
–
–
26
0.26
–
–
–
–
–
–
67
0.67
67
0.67
AB
0
0.00
0
0.00
–
–
–
–
0
0.00
AC
2
0.02
–
–
3
0.03
–
–
5
0.05
AD
6
0.06
–
–
–
–
25
0.25
31
0.31
BC
1
0.01
–
–
1
0.01
–
–
2
0.02
BD
1
0.01
–
–
–
–
10
0.10
4
0.13
CD
–
–
–
–
8
0.08
35
0.35
43
0.32
ABC
0
0.00
0
0.00
0
0.00
–
–
0
0.00
ABD
0
0.00
0
0.00
–
–
0
0.00
0
0.00
ACD
2
0.02
–
–
3
0.03
18
0.18
23
0.23
BCD
–
–
1
0.01
1
0.01
3
0.03
5
0.05
ABCD
0
0.00
0
0.00
0
0.00
0
0.00
0
0.00
160
BARBER AND HAMILTON
Table 4. Alternate ranking of itemsets by share and support. (a) Top 20 itemsets ranked by support; (b) Top 20 itemsets ranked by share. Support rank
% Support
Share rank
% Share of revenue
Share rank
% Share of revenue
Support rank
% Support
1
5.47
252
0.1
1
3.69
6
2.43
2
1.55
48
0.78
2
3.1
39
0.39
2
1.55
46
0.8
3
2.9
67
0.29
2
1.55
39
0.91
4
2.6
104
0.19
5
1.51
42
0.84
5
2.5
391
0.05
6
1.31
57
0.72
6
2.43
1
3.69
7
1.17
52
0.76
7
2.41
392
0.05
8
1.07
50
0.77
8
2.36
13
0.87
9
0.92
64
0.67
9
2.35
393
0.05
9
0.92
70
0.64
10
2.35
394
0.05
9
0.92
89
0.52
11
2.15
52
0.34
12
0.87
8
2.36
12
1.97
18
0.73
12
0.87
97
0.5
13
1.82
395
0.05
14
0.83
71
0.64
14
1.76
396
0.05
15
0.78
26
1.3
15
1.76
397
0.05
15
0.78
101
0.49
16
1.75
25
0.58
17
0.73
12
1.97
17
1.67
398
0.05
17
0.73
104
0.48
18
1.67
399
0.05
17
0.73
78
0.59
19
1.61
400
0.05
20
0.68
66
0.66
20
1.57
242
0.1
described more fully in Sections 6 and 7, respectively. The data correspond to customer purchases for a particular segment of customers and include only items from one major product group. The measure value attribute in this example is total revenue. The information in Table 4 corresponds to extracted 3-itemsets. In total, 1300 3-itemsets were found in the data. Each row in the tables corresponds to an itemset. The top 20 itemsets ranked by support are shown in Table 4(a). Their corresponding rank by share of total revenue is also indicated. Table 4(b) contains the top 20 itemsets ranked by share of revenue, with their corresponding rank by support provided. Clearly, the measure used to examine the itemsets affects the view of the data that is presented. Only two 3-itemsets ranked in the top 20 by support appear in the top 20 of the 3-itemsets ranked by share of total revenue. The 3-itemset that has the greatest contribution to the total revenue is ranked 252nd by support. Share ranking based on the fraction of a real valued total is also better able to differentiate between itemsets than support. For example, three itemsets are tied for the second highest rank by support, but their share values are different. Since support ranking differs so greatly from share ranking, it cannot serve as a heuristic guide when determining share ranking.
EXTRACTING SHARE FREQUENT ITEMSETS
161
This example shows that the effectiveness of each measure depends on the context of the current discovery task. From the point of view of a data analyst, support and share can be considered as tools, and the ability to select the right tool for the job is always beneficial. Itemset share can be incorporated into many of the algorithms developed for the support measure (Hilderman et al., 1998) and we have fully implemented both measures. 3.
Frequent itemsets
The minimum support threshold specifies the minimum percentage of transactions in which an itemset must be contained to be considered frequent (Agrawal et al., 1993). As the itemset size grows, the support of the itemset never increases and usually decreases. Thus, all subsets of a frequent itemset are also guaranteed to be frequent (Agrawal and Srikant, 1994; Mannila et al., 1994). This property is referred as downward closure (Brin et al., 1997a; Zaki et al., 1997). Formally, the closure property has been defined as follows (Silverstein et al., 1998). Definition 10. A property P is downward closed with respect to the lattice of all itemsets if, for each itemset with the property P, all of its subsets also have the property P. A property P is upward closed with respect to the lattice of all itemsets if, for each itemset with the property P, all of its supersets also have the property P. The property of downward closure has permitted the development of efficient algorithms based on the support measure. For example, in the Apriori algorithm (Agrawal and Srikant, 1994), only itemsets that have been determined to be frequent by support are used to generate the candidate itemsets for the next pass. Infrequent itemsets are never used for candidate itemset generation because the downward closure property guarantees that no superset of an infrequent itemset can be frequent. As a result, the algorithm traverses only a small portion of the itemset lattice, but it is guaranteed to find all itemsets that are frequent by support. Share frequency differs from support frequency in that it is not based on the frequency of the itemset as a whole. The original definition of share frequency is as follows (Carter et al., 1997). Definition 11. An itemset X is downward closed share frequent (DC-frequent) if ∀xi ∈ X , SH(xi , X ) ≥ minshare, a user defined minimum share value. Although Carter et al. (1997) refer to this term simply as “frequent”, we add the prefix “DC-” to differentiate it from a more general type of frequency that is not downward closed. Theorem 1. DC-frequency is downward closed with respect to the lattice of all itemsets. Proof: To show that DC-frequency is downward closed, we must show that if X is DCfrequent, then for all X j ⊆ X, X j must be DC-frequent. Suppose X is DC-frequent. By definition, for all xi ∈ X , SH(xi , X ) ≥ minshare. Since X j ⊆ X, lmv(xi , X j ) ≥ lmv(xi , X ) and SH(xi , X j ) = lmv(xi , X j )/M V ≥ SH(xi , X ) = lmv(xi , X )/MV. Since for all xi ∈ X,
162
BARBER AND HAMILTON
SH(xi , X ) ≥ minshare, then for all xi ∈ X j , SH(xi , X j ) ≥ minshare. Therefore, by definition, X j is DC-frequent. When using the share measure, the share of an itemset may increase or decrease as the size of the itemset increases. The addition of an item xi to a k-itemset X to create a new (k + 1)-itemset Y , adds a restriction to the measure value of the items in X . In other words, the measure values associated with the items in X now contribute to the local measure value of Y , only when they occur with the new item xi . Their contribution towards the local measure value of Y must be less than or equal to their contribution to the local measure value of X . However, the local measure value of xi is added to the local measure value of Y , which may be less than, equal to, or greater than the local measure value of X . Thus, if share frequency is measured against the share of the itemset, then it is possible to have an itemset with share above the minimum share, whose component itemsets have share below the minimum share. For existing, efficient algorithms to work with share, the property of downward closure of DC-frequency is essential. However, an entire class of frequent itemsets will be missed, namely those with infrequent subsets. Table 5 provides information about the local Table 5.
Local measure values of 3-itemsets ranked in the top 20 by total revenue. lmv Share rank
Item 1
Item 2
Item 3
Itemset
1
0.60
680.00
191.00
871.60
2
5.70
9.90
477.35
492.95
3
14.00
8.40
439.40
461.80
4
12.00
3.60
398.15
413.75
5
25.80
252.00
120.00
397.80
6
196.00
96.00
95.40
387.40
7
11.25
252.00
120.00
383.25
8
66.00
27.90
281.25
375.15
9
2.60
252.00
120.00
374.60
10
1.95
252.00
120.00
373.95
11
16.80
19.20
306.78
342.78
12
29.10
25.20
258.75
313.05
13
25.80
11.25
252.00
289.05
14
25.80
2.60
252.00
280.40
15
1.95
25.80
252.00
279.75
16
46.00
19.50
213.75
279.25
17
11.25
2.60
252.00
265.85
18
1.95
11.25
252.00
265.20
19
1.95
2.60
252.00
256.55
20
0.60
1.20
247.50
249.30
EXTRACTING SHARE FREQUENT ITEMSETS
163
measure value of the items in the itemsets that were ranked in the top 20 by share of total revenue. The itemset associated with each row in Table 5 corresponds to the itemset in the same row of Table 4(b). If we assume a minshare equal to 0.01 and MV equal to 16000, the minimum local measure value of a frequent itemset is 160. The local measure value of each of the top 20 itemsets exceeds this minimum. However, it happens that each of the itemsets contains one or more items for which the local measure value does not meet the minimum requirement. The infrequent items are shown in bold face in Table 5. To find frequent itemsets with infrequent subsets, we employ the following definition of share frequency. Definition 12. An itemset X is share frequent, or SH-frequent, if SH(X ) ≥ minshare, a user defined minimum share value. By defining frequency in terms of the share of the itemset as a whole, we lose the property of downward closure. Theorem 2. Share frequency is not downward closed with respect to the lattice of all itemsets. Proof: Here proof by counterexample is sufficient. Consider itemset ACD in Table 3. Assume a minshare equal to 0.20. SH(ACD) is 0.23 and thus, ACD is a SH-frequent itemset. However, SH(A) is 0.06 and thus, 1-itemset A is not SH-frequent. Since A is a subset of ACD, share frequency based on the share of the itemset as a whole is not downward closed. Existing algorithms for extracting frequent itemsets cannot find this class of share frequent itemsets, since they are designed to ignore them. However, since any itemset that is DCfrequent is share frequent, these algorithms provide a good starting point for the development of algorithms to find SH-frequent itemsets with infrequent subsets. 4.
Description of algorithms
In this section, we describe five algorithms for extracting SH-frequent itemsets, including those with infrequent subsets. These algorithms can be classified as bottom-up and breadth-first. The first pass through the data collects information about all 1-itemsets that occur in the data set being examined. Summary information about the first pass is compiled, including MV, the total measure value associated with all 1-itemsets, and TCT , the total number of transactions. A set of candidate 2-itemsets is generated using the 1itemsets, and information about these candidate 2-itemsets is collected in a second pass. This process of building candidate itemsets using the itemsets examined in the previous pass continues until no candidate itemsets are generated. In general, after the kth pass, information is available about the measure value and transaction count of each counted k-itemset.
164
BARBER AND HAMILTON
We define Ck to be the set of candidate itemsets for the kth pass. For k > 1, the algorithms can be described at their coarsest level in the following manner: 1. 2. 3. 4. 5.
while |Ck | > 0 do Process transactions to collect candidate itemset information k := k + 1 Ck := GenerateCandidateItemsets(Ck−1 ) end while
The algorithms we describe differ in how potential candidate k-itemsets are generated from the (k − 1)-itemsets in Ck−1 and in the type of pruning that is carried out. These processes are encapsulated in the procedure GenerateCandidateItemsets. A discussion of this procedure is sufficient to describe the differences between algorithms. We use two of the earliest methods described for generating candidate itemsets (Agrawal and Srikant, 1994; Mannila et al., 1994), which we refer to as combination generation and item add-back generation (Barber and Hamilton, 2000). Combination generation is defined as the generation of k-itemsets through the process of combining itemsets in Ck−1 that differ only in their last item. Item add-back generation is defined as the generation of k-itemsets by adding, to each itemset in Ck−1 , any item found in the first pass that is not contained in the itemset. In the absence of pruning, either method generates the same set of candidate itemsets. We represent the process of generating the next potential candidate itemset I pc , with an iterator procedure called GenerateNextItemset. The first call to the procedure returns the first generated itemset and repeated calls cycle through all possible generated itemsets. When no more itemsets can be generated, the procedure returns false. We investigate two types of pruning within GenerateCandidateItemsets. Pre-generation pruning prunes itemsets from Ck−1 using information obtained during the k − 1 pass. Typically, this type of pruning is done before any k-itemsets are generated. Pre-generation pruning is done in a procedure called PreGenPrune. In generation pruning, potential candidate k-itemsets are first generated and then pruned as required before they are added to Ck . Generation pruning is carried out in the procedure PruneGeneratedItemset. A general form of the procedure GenerateCandidateItemsets is written as: GenerateCandidateItemsets (Ck−1 ) 1. 2. 3. 4. 5. 6. 7. 4.1.
PreGenPrune(Ck−1 ) Ck := {} while I pc := GenerateNextItemset() do if PruneGeneratedItemset(I pc ) = false then Add I pc to Ck end if end while A starting point for an algorithm space
One method of extracting all frequently occurring itemsets from a transaction database is an exhaustive counting of all possible combinations of the items that were found in the first
EXTRACTING SHARE FREQUENT ITEMSETS
165
pass over the data. Of course, such a search is impractical because a tremendous number of itemsets that may have to be counted. If a transaction database has 2000 items, then there would be ( 2000 ) = 1,999,000 possible 2-itemsets, ( 2000 ) = 1,331,334,000 possible 2 3 3-itemsets, and so on. However, we use an exhaustive algorithm as a starting point and specialize it to create the algorithm space considered in our study. In the exhaustive algorithm, no pruning is done. Step 1 and Step 4 are not included in the procedure GenerateCandidateItemsets. The procedure simply generates all possible k-itemsets and adds them to Ck . If m is the number of items found in the first pass, the exhaustive algorithm generates and collects information about 2m itemsets. All SH-frequent itemsets will be found. 4.2.
Zero pruning (ZP) algorithm
Zero pruning is defined as the removal of any itemset Ii ∈ Ck−1 for which TC I i = 0, before candidate itemset generation is performed. Zero pruning candidate set Ck−1 prevents the generation of k-itemsets from (k − 1)-itemsets that were not found in the data. For the ZP algorithm, the procedure PreGenPrune(Ck−1 ) is written as: PreGenPrune(Ck−1 ) 1. 2. 3. 4. 5.
foreach Ii ∈ Ck−1 if TC I i = 0 then Remove Ii from Ck−1 end if end foreach
For the ZP algorithm, step 4 is omitted from GenerateCandidateItemsets. The number of itemsets that will be generated and counted cannot exceed 2m . All SH-frequent itemsets will be found. 4.3.
Zero subset pruning (ZSP) algorithm
Even with zero pruning, it is still possible for a generated itemset to have a subset that was found to have a zero transaction count. Consider three items A, B, and E found in the data during pass 1. The 2-itemsets AB, AE, and BE are generated and added to C2 . After pass 2, C2 is zero pruned. Assume TCAB = 0, TCAE = 0 but TCBE = 0. Combination generation will combine AB and AE to form 3-itemset ABE. Since TCBE = 0 and BE is a subset of ABE, ABE cannot exist in the data. Subset pruning prevents the inclusion in Ck of any generated itemset with a k − 1 subset not found in Ck−1 . We use ITi to represent the ith item of a potential candidate itemset I pc and Is to represent a k − 1 subset of I pc . The procedure PruneGeneratedItemset is written as: PruneGeneratedItemset(I pc ) 1. 2.
foreach ITi ∈ I pc foreach IT j ∈ I pc where i = j
166 3. 4. 5. 6. 7. 8. 9.
BARBER AND HAMILTON
add IT j to Is end foreach if Is ∈ / Ck−1 then return true end if end foreach return false
The number of itemsets that will be generated and examined cannot exceed the number of itemsets generated by the ZP algorithm. All SH-frequent itemsets will be found. 4.4.
Share infrequency pruning (SIP)
Thus far, pruning has focused on eliminating itemsets that cannot be present in the data set. We now discuss pruning of itemsets that exist in the data set, but which have been deemed to be uninteresting according to a minimum share criteria specified by the user. Share infrequency pruning involves the removal of any itemset Ii in the candidate set Ck−1 whose actual share SH(Ii ) is less than the share threshold (TSH ) before candidate set Ck is generated. The pre-generation pruning procedure becomes: PreGenPrune(Ck−1 ) 1. 2. 3. 4. 5.
foreach Ii ∈ Ck−1 if TC I i = 0 or SH(Ii ) ≤ TSH then remove Ii from Ck−1 end if end foreach
Subset pruning is implemented in the procedure PruneGeneratedItemset. If any subset of a generated itemset has a zero transaction count or was not a candidate itemset in the previous pass, the generated itemset will be pruned. The SIP algorithm behaves like the Apriori algorithm (Agrawal and Srikant, 1994), pruning infrequent itemsets from Ck−1 , constructing candidate k-itemsets based on the (k − 1)-itemsets that were found to be frequent in the previous pass, and pruning any generated itemset with an infrequent subset. However, unlike the Apriori algorithm, we cannot guarantee that the SIP algorithm will find all SH-frequent itemsets, because the downward closure property does not apply. 4.5.
Combine all counted algorithm (CAC)
After the kth pass through the data set, we have information about all k-itemsets in candidate set Ck . In the SIP algorithm, information about infrequent itemsets is discarded. However, some of the infrequent itemsets may be subsets of a larger SH-frequent itemset. In the CAC algorithm, we give the infrequent itemsets a second chance to contribute to a larger SHfrequent itemset. We utilize all information that is collected in the (k − 1) pass to generate
EXTRACTING SHARE FREQUENT ITEMSETS
167
k-itemsets to add to Ck by delaying the infrequency pruning of the itemsets in Ck−1 , until after the itemsets in Ck have been generated. Without further modification, the algorithm would display the same behavior as the ZSP algorithm. To prevent this, we use heuristic methods to calculate the predicted share of I pc , denoted PSH(I pc ), and prune any generated itemsets whose predicted share is less than the share threshold. A heuristic method for estimating the share of a generated itemset should provide a reasonable balance between accuracy and the amount of work that is required to apply the heuristic to the generated itemsets and then count the k-itemsets in the next pass over the data. In addition, the method should not require additional passes over the database to acquire information needed for prediction. The subset pruning in PruneGeneratedItemsets already examines each of the (k − 1) subsets of I pc to ensure that they exist in Ck−1 . For each subset Is , there is a corresponding item IT that is a member of I pc but not a member of Is . We use information about Is and IT to calculate the predicted share, since no additional work is required to determine their values. If Is exists in Ck−1 , then its share and transaction count information were determined in the previous pass. While we know the item value of IT, we do not know its share and transaction count unless IT was determined to be SH-frequent after the first pass. To make this information available, we store first pass information about all 1-itemsets, both SHfrequent and infrequent. Given an IT value, we can retrieve IIT , the 1-itemset containing IT. For k greater than 1, information about infrequent itemsets is still discarded after candidate set Ck has been constructed. We calculate P(I ), the probability that any single transaction contains an itemset I , using the equation P(I ) =
TC I TC T
(7)
Assuming a uniform distribution of the actual share in the transactions in which I occurs, then the share value in each of these transactions is SH(I )/TC I . PSH(I ), the predicted share of I in any single transaction is defined by the equation PSH(I ) = P(I ) ×
SH(I ) + (1 − P(I )) × 0 TC I
(8)
Substituting Eq. (7) into Eq. (8) yields PSH(I ) =
SH(I ) TC T
(9)
To calculate the predicted share when an itemset, item pair occurs together, we use one of the equations PSH = SH(IT) + (PSH(Is ) × TCIT )
(10)
PSH = SH(Is ) + (PSH(IT) × TC I s ) [(SH(IT) + PSH(Is ) × TCIT ) + (SH(Is ) + PSH(I T ) × TC I s )] PSH = 2
(11) (12)
168
BARBER AND HAMILTON
Equation (10) is used if TCIT is less than TC I s , Eq. (11) is used if TCIT is greater than TC I s , and Eq. (12) is used if TCIT is equal TC I s . These equations are encapsulated in the function GetPredictedShare. This function could be easily replaced by a different heuristic function. After the predicted value for each Is , IT pair is calculated, the average of all such pairs is compared to TSH . The PruneGeneratedItemset procedure becomes: PruneGeneratedItemset(I pc ) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 11. 12. 13. 14. 15.
PSH(I pc ) := 0 SubsetCount := 0 foreach ITi ∈ I pc foreach IT j ∈ I pc where i = j add IT j to Is end foreach if Is ∈ / Ck−1 then return true end if PSH(I pc ) := PSH(I pc ) + GetPredictedShare(ITi ,Is ) SubsetCount := SubsetCount + 1; end foreach if PSH(I pc )/SubsetCount ≥ TSH then return true else return false end if
4.6.
Item add-back algorithm (IAB)
The IAB algorithm takes the approach that no item with a non-zero value is ever discarded forever. Zero and infrequency pruning is done prior to the generation of candidate itemsets. New itemsets are generated using item add-back generation. In the kth pass, each single item found in the first pass is added to each SH-frequent itemset from the (k − 1) pass. To prevent the counting of too many itemsets, we again use predictive pruning. The PSH value of an Is , IT pair is calculated using Eqs. (10), (11) or (12), as for the CAC algorithm. However, subset pruning is not used, because it would tend to counteract the effect that we are trying to achieve with the item add-back. The predicted share value is the average PSH of the (k − 1) subset, IT pairs that are available. Information about all counted 1-itemsets must be stored. The PruneGeneratedItemset procedure becomes: PruneGeneratedItemset(I pc ) 1. 2. 3. 4.
PSH(I pc ) := 0; SubsetCount := 0 foreach IT i ∈ I pc foreach IT j ∈ I pc where i = j
169
EXTRACTING SHARE FREQUENT ITEMSETS
5. add IT j to Is 6. end foreach 7. if Is ∈ Ck−1 then 8. PSH(I pc ) := PSH(I pc ) + GetPredictedShare(ITi , Is ) 9. SubsetCount := SubsetCount + 1; 10. end if 11. end foreach ITi ∈ I pc 12. if (PSH(I pc )/SubsetCount)≥ TSH then 13. return true 14. else 15. return false 16. end if 4.7.
The algorithm space
Figure 1 shows the algorithm space that was considered in this study. The root algorithm is the exhaustive algorithm. The type of pruning added to create the other algorithms is shown on the edges between nodes. 4.8.
The algorithms at work: An example
We use the sample database presented in Table 1 to provide an example of how the algorithms work. Figure 2 gives the itemset lattice for the sample data set. Each node in the lattice
CAC
SIP
PSH < TSH
SH < TSH
ZSP
IAB
PSH < TSH
TCIs = 0
ZP TC = 0
Exhaustive Figure 1.
Algorithm space.
170
BARBER AND HAMILTON
ABCD 0/0
AB 0/0
Figure 2.
ABC 0/0
ABD 0/0
ACD 23/2
BCD 12/1
AC 5/2
AD 31/3
BC 2/1
BD 11/1
A 6/3
B 1/1
C 26/8
D 67/7
CD 50/5
Itemset lattice.
is labeled with the itemset name. Below the itemset name are the total measure value of the itemset in all transactions and the number of transactions in which the itemset appears, separated by a forward slash. For our example, we will assume a share threshold of 0.20. The total measure value MV is equal to 100, so any item with a measure value greater than or equal to 20 is SHfrequent. SH-frequent itemsets are shaded in figure 2. The first pass through the data collects information about all itemsets that are present in the data set and the 1-itemsets C and D are identified as SH-frequent itemsets. With the ZP algorithm, all itemsets are generated and counted, thus all SH-frequent itemsets are found. With the ZSP algorithm, the 1-itemset AB is pre-generation pruned since TC AB = 0. The supersets of AB, itemsets ABC, ABD and ABCD, are not generated or counted. All SH-frequent itemsets are found with slightly less work. With the SIP algorithm, infrequent items A and B are pre-generation pruned. Itemset CD is generated from the SH-frequent items in the candidate set from the first pass. After the second pass, we find that CD is SH-frequent. No 3-itemsets can be generated from a single 2-itemset and the algorithm terminates. The SIP algorithm misses SH-frequent itemsets AD and ACD, because item A was infrequency pruned after the first pass and therefore cannot exist in any larger itemsets. In the CAC algorithm, all counted 1-itemsets are used to generate 2-itemsets. As a result, all possible 2-itemsets that can be constructed from the m items that were found in the data are generated and in fact, this will always be the case. However, before any 2-itemset is added to Ck , we prune based on predicted share value. Consider itemset AD. SH(A) = 6, TC A = 3 and SH(D) = 67, TC D = 7. Since TC A < TC D , the predicted share PSH(AD) = 6 + 67(3/7) = 34.7. PSH(AD) exceeds the threshold value of 20 and the 2-itemset AD is added to Ck . For itemset AB, SH(A) = 6, TC A = 3 and SH(B) = 1, TC B = 1. Since TC B < TC A , the predicted share PSH(AB) = 1 + 6(1/3) = 3.0. Since
171
EXTRACTING SHARE FREQUENT ITEMSETS
PSH(AB) is less than the threshold value of 20, itemset AB is not added to Ck . Only AD and CD meet the condition of PSH > TSH so they are the only itemsets counted in pass 2. The algorithm terminates after pass 2 because AD and CD cannot be used to generate a 3-itemset that will have all subsets existing in Ck−1 . The CAC algorithm counted one more 2-itemset than the SIP algorithm and that was a SH-frequent itemset. However, the CAC algorithm still missed the SH-frequent 3-itemset ACD because one of its subsets, 2-itemset AC, was not counted in the second pass. The IAB algorithm generates all possible 2-itemsets with the exception of AB. AB is not generated since both A and B are infrequent. As for the CAC algorithm, we predict that itemsets AD and CD may be SH-frequent and count them on the second pass. After the second pass, the 3-itemsets ABD, ACD and BCD are generated by adding single items to the SH-frequent itemsets AD and CD. To determine PSH(ACD), we check all available 2-itemset, item pairs. Only 2-itemsets AD and CD exist in Ck−1 so we check C plus AD, as well as A plus CD. TCC = 8 and SH(C) = 26, TC AD = 3 and SH(AD) = 31.0. Since TC AD < TCC , the predicted share is calculated as PSH(ACD) = 31 + 26(3/8) = 40.8. TC A = 3 and SH(A) = 6, TCCD = 5 and SH(CD) = 50. Since TC A < TCCD , the predicted share PSH(ACD) = 6 + 50(3/5) = 36.0. The average PSH(ACD) = (40.8 + 36.0)/2 = 38.4. Since PSH(ABD) is greater than TSH , ACD is added to Ck . The predicted shares of ABD and BCD do not meet the minimum share requirement. The 4-itemset ABCD is generated after the third pass, but PSH(ABCD) is less than TSH , so the algorithm terminates. The IAB algorithm counts one 3-itemset that was not counted by either the SIP or CAC algorithms, and it was a SH-frequent itemset. Table 6 summarizes the data of the algorithms for the example data set. A “1” in a column labeled Gen indicates an itemset that was generated, a “1” in a column labeled Cnt indicates an itemset that was counted. The rows containing SH-frequent itemsets are shaded. This Table 6. Example task summary. ZP Itemset
ZSP
SIP
CAC
Gen
Cnt
Gen
Cnt
Gen
Cnt
Gen
Cnt
AB
1
1
1
1
0
0
1
AC
1
1
1
1
0
0
1
AD
1
1
1
1
0
0
BC
1
1
1
1
0
BD
1
1
1
1
0
CD
1
1
1
1
ABC
1
1
0
0
ABD
1
1
0
0
0
0
0
0
NP
ACD
1
1
1
1
0
0
0
0
NP
BCD
1
1
1
1
0
0
0
0
NP
ABCD
1
1
0
0
0
0
0
0
NP
11
11
8
8
1
1
6
2
Sum
IAB
PSH
ASH
Gen
Cnt
PSH
ASH
0
3.0
0
0
0
15.8
5
1
0
NP
0
0
15.8
1
1
34.7
31
5
1
1
34.7
31
0
1
0
4.3
0
1
0
10.6
2
1
0
4.3
2
4
1
0
10.6
1
1
1
1
4
89.8
43
1
1
89.8
43
0
0
0
0
NP
0
0
0
NP
0
0
1
0
11.3
0
23
1
1
36.3
23
5
1
0
9.6
5
0
1
0
12.5
0
9
3
172
BARBER AND HAMILTON
example illustrates that a trade off may exist between the amount of work that an algorithm does and its effectiveness. The ZP and ZSP found all SH-frequent itemsets, but counted all or most of the itemsets in the itemset lattice. The SIP algorithm performed very little work but missed two of the SH-frequent itemsets. In the following section, we will discuss the criteria by which we evaluate the effectiveness of the algorithms. 5.
Evaluation criteria
An algorithm for extracting frequent itemsets from a data set can be thought of as a method for classifying itemsets into two classes, frequent itemsets (positive instances) and infrequent itemsets (negative instances). Each itemset in a lattice is assigned to one of the classes, either directly or indirectly. Some itemsets are actually examined and some characteristic of the itemset, such as measured frequency, is used to classify it directly. This classification can then cascade through the lattice to classify supersets of the itemset indirectly. Our method of evaluating performance of the algorithms borrows concepts from studies into classifier performance in machine learning. A confusion matrix (Kohavi and Provost, 1998) contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. Table 7 shows the confusion matrix for a two class classifier. The entries in the confusion matrix have the following meaning in the context of our study: a is the number of correct predictions that an itemset is infrequent, b is the number of incorrect predictions that an itemset will be frequent, c is the number of incorrect of predictions that an itemset will be infrequent, and d is the number correct predictions that an itemset will be frequent. Several standard terms have been defined for the two class matrix. The accuracy (AC) is the proportion of the total number of predictions that were correct. It is determined using the equation: AC =
a+d a+b+c+d
(13)
The recall or true positive rate (TP) is the proportion of positive cases that were correctly identified, as calculated using the equation: TP =
Table 6.
d c+d
(14)
Confusion matrix. Predicted
Actual
Negative
Positive
Negative
a
b
Positive
c
d
EXTRACTING SHARE FREQUENT ITEMSETS
173
The false positive rate (FP) is the proportion of negatives cases that were incorrectly classified as positive, as calculated using the equation: FP =
b a+b
(15)
The true negative rate (TN) is defined as the proportion of negatives cases that were classified correctly, as calculated using the equation: TN =
a a+b
(16)
The false negative rate (FN) is the proportion of positives cases that were incorrectly classified as negative, as calculated using the equation: FN =
c c+d
(17)
Finally, precision (P) is the proportion of the predicted positive cases that were correct, as calculated using the equation: P=
d b+d
(18)
Typically, the class distribution in an itemset lattice is unbalanced, with most of the itemsets being infrequent or not occurring at all. The accuracy determined using Eq. (13) might not be an adequate performance measure when the number of negative cases is much greater than the number of positive cases (Kubat et al., 1998). Suppose there are 1000 cases, 995 of which are negative cases and 5 of which are positive cases. If the system classifies them all as negative, the accuracy would be 99.5%, even though the classifier missed all positive cases. Other performance measures account for this by including TP in a product: for example, geometric mean (g-mean) (Kubat et al., 1998), as defined in Eqs. (19) and (20), and F-Measure (Lewis and Gale, 1994), as defined in Eq. (21). √ g-mean 1 = TP × P √ g-mean 2 = TP × TN (β 2 + 1) × P × TP F= β 2 × P + TP
(19) (20) (21)
In Eq. (21), β has a value from 0 to infinity and is used to control the weight assigned to TP and P. Any classifier evaluated using Eqs. (19), (20) or (21) will have a measure value of 0, if all positive cases are classified incorrectly. Another way to examine the performance of classifiers is to use a ROC graph (Swets, 1988). A ROC graph is a plot with the true positive rate on the X axis and the false positive rate on the Y axis. The point (0,1) is the perfect classifier: it classifies all positive cases and negative cases correctly. The point (0,0) represents a classifier that predicts all cases to
174
BARBER AND HAMILTON
be negative, while the point (1,1) corresponds to a classifier that predicts every case to be positive. Point (1,0) is the classifier that is incorrect for all classifications. In many cases, a classifier has a parameter that can be adjusted to increase TP at the cost of an increased FP or decrease FP at the cost of a decrease in TP. Each parameter setting provides a (TP, FP) pair and a series of such pairs can be used to plot an ROC curve. A non-parametric classifier is represented by a single ROC point, corresponding to its (TP, FP) pair. Our algorithms are non-parametric and thus, produce a single ROC point for a particular data set. We also examine parametric forms of the CAC and IAB algorithms in Barber and Hamilton (2001). An ROC graph has several desirable characteristics. A ROC curve or point is independent of class distribution or error costs (Provost et al., 1998). A ROC graph encapsulates all information contained in the confusion matrix, since FN is the complement of TP and TN is the complement of FP (Swets, 1988). ROC curves provide a visual tool for examining the tradeoff between the ability of a classifier to correctly identify positive cases and the number of negative cases that are incorrectly classified. It has been suggested that the area beneath an ROC curve can be used as a measure of accuracy in many applications (Swets, 1988). Provost and Fawcett (1997) argue that using classification accuracy to compare classifiers is not adequate unless cost and class distributions are completely unknown and a single classifier must be chosen to handle any situation. They propose a method of evaluating classifiers using a ROC graph and imprecise cost and class distribution information. Frequent itemset extraction algorithms must have sufficient generality that they can be applied to any transaction database and so, we choose not to include assumptions regarding cost and class distributions in our analysis. A final way of comparing ROC points is by using an equation that equates accuracy with the Euclidian distance from the perfect classifier, point (0,1) on the graph. We include a weight factor that allows us to define relative misclassification costs, if such information is available. We define ACd as a distance based performance measure for an ROC point and calculate it using the equation:
ACd = 1 −
W × (1 − TP)2 + (1 − W ) × FP2
(22)
where W is a factor, ranging from 0 to 1, that is used to assign relative importance to false positives and false negatives. ACd ranges from 1 for the perfect classifier to 0 for a classifier that classifies all cases incorrectly, point (1,0) on the ROC graph. ACd differs from g-mean1 , g-mean2 and F-measure in that it is equal to zero only if all cases are classified incorrectly. In other words, a classifier evaluated using ACd gets some credit for correct classification of negative cases, regardless of its accuracy in correctly identifying positive cases. In our application, counting an infrequent itemset is an important cost of incorrect negative case classification, regardless of the specific data against which the algorithm is applied. Consider two algorithms A and B that perform adequately against most data sets. However, assume both A and B misclassify all positive cases in a particular data set and A classifies 10 times the number of infrequent itemsets as potentially frequent compared to B. Algorithm B is the better algorithm in this case because there has been less wasted effort counting infrequent itemsets.
EXTRACTING SHARE FREQUENT ITEMSETS
Figure 3.
6.
175
Specification of task restriction by concept.
Software implementation
Experimental results were obtained using a software program called CItemset. CItemset has been developed to extract characterized itemsets (Hilderman et al., 1998) from transaction databases. CItemset employs a Windows 95/NT GUI for task specification and display of results. Its design facilitates a focused and interactive search for frequent itemsets in transaction data. Task specification is accomplished using a series of dialog boxes. Groups of customers or items can be easily included in or excluded from a task using concept hierarchies, which are tree structures defined by domain experts that group database values for an attribute into more general concepts. Figure 3 provides an example of the dialog box used to specify task restrictions by concept. The leaves of the tree are concepts corresponding to values that that exist in the database for the Province column. The remainder of the tree contains user-defined concepts that become more general as they approach the root of the tree. If, for example, the user wishes to examine transaction information only for customers in western Canada, the concept “Western” can be selected and CItemset takes care of the mapping to actual database values. More than one hierarchy can be defined for a column, allowing the user to easily switch his view of the data. After each pass over the data, frequent itemsets are displayed in a grid. The user can sort on various columns, delete itemsets that are not relevant to the current task or change threshold values prior to initiating the next pass over the data. Figure 4 shows an example of the grid for SH-frequent 2-itemsets. The underlying code is written in C++, with all data structures implemented using Standard Template Library (STL) containers. An itemset object contains a set of pointers to item objects. An itemset list object contains a set of pointers to itemset objects. The insertion and find operations for STL set containers are O(log N), where N is the number of items in the set. 7.
Experimental results
All experiments were carried out using a 450 MHz Pentium II PC with 384 MB of RAM. Our test data were drawn from an eight million record customer database provided to us by
176
Figure 4.
BARBER AND HAMILTON
Frequent itemset grid.
a commercial partner. Transaction information is contained in a three million record table representing the purchase of 2200 unique items by over 500,000 customers. Each record consists of an account number, a code identifying an equipment purchase, an equipment rental or a service purchased by a customer, a value indicating the quantity purchased, the unit cost and the total cost. Six discovery tasks were performed on different subsets of the data. The database data was restricted by both customer segment and item groups. This enabled us to obtain baseline information using the ZSP algorithm, which is guaranteed to find all SH-frequent itemsets but requires exponential time. By knowing the true answer, we were able to evaluate the results of our heuristic methods with confidence. A share threshold of 0.02 was used, since this produced tasks that were small enough to run the ZSP algorithm on but not too small to be interesting. The segment and item group restrictions were varied so that each task examined a different subset of the transaction data. Information about the tasks is contained in Table 8. Figure 5 shows the ROC graphs for the 6 tasks. The ROC points on the graphs indicate that the CAC and IAB algorithms performed well. With a single exception, the ROC points for the algorithms plot in the upper third of the graph. In Task 3, both algorithms found all SH-frequent itemsets. The ROC points for the CAC, IAB and SIP algorithms always plot Table 7.
Task metrics. Task identifier
Transaction count
Record count
T1
599
1844
T2
2058
4373
T3
2374
7051
T4
5092
24725
T5
14257
43700
T6
30901
81044
177
EXTRACTING SHARE FREQUENT ITEMSETS T2
T1 True Positives versus False Positives
True Positives versus False Positives
1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7 0.6
ZSP
0.5
SIP
0.4
IAB
TP
TP
ZP
ZP
0.6
0.3
0.2
0.2
0.1
0.1
0.0 0.0
0.0 0.3
0.4
0.5
0.6
0.7
0.8
0.9
IAB CAC
0.3
0.2
SIP
0.4
CAC
0.1
ZSP
0.5
1.0
0.0
0.1
(a)
(b)
1.0
0.9
0.9
0.8
0.8
ZSP
0.5
SIP
0.9
1.0
0.4
IAB
0.6 TP
TP
0.8
ZP
ZP
0.6
CAC
0.3
ZSP
0.5
SIP
0.4
IAB CAC
0.3 0.2
0.2
0.1
0.1 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.0 0.0
1.0
0.1
0.2
0.3
0.4
0.5
FP
FP
(c)
(d)
T5 True Positives versus False Positives
0.6
0.7
0.8
0.9
1.0
T6
True Positives versus False Positives
1.0
1.0
0.9
0.9 0.8
0.8
0.7
0.7
ZP
0.6
ZSP
0.5
SIP
0.4
IAB CAC
0.3
ZP
0.6 TP
TP
0.7
0.7
0.7
ZSP
0.5
SIP
0.4
IAB CAC
0.3 0.2
0.2 0.1
0.1
0.0 0.0
0.0 0.0
Figure 5.
0.6
T4 True Positives versus False Positives
1.0
0.0 0.0
0.4
0.5 FP
T3 True Positives versus False Positives
0.2
0.3
FP
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.1
0.2
0.3
0.4
0.5
FP
FP
(e)
(f )
0.6
0.7
0.8
0.9
1.0
ROC graphs for tasks. (a) T1; (b) T2; (c) T3; (d) T4; (e) T5; (f) T6.
very close to the left side of the graph, indicating that they count very few of the possible infrequent itemsets. In every task, the ROC points for the CAC and IAB algorithms are closer to the upper left-hand corner of the graph than the SIP, ZSP and ZP algorithms. Recall that this point corresponds to the perfect predictor. The ZSP algorithm confirmed theoretical expectations by finding 100% of the SH-frequent itemsets at the cost of counting
178
BARBER AND HAMILTON
many itemsets, more than half for all tasks run. In Task 4 and Task 5, it counted more than 90% of the possible infrequent itemsets. The ZP algorithm found all SH-frequent itemsets by counting all possible itemsets. So that we may examine the individual tasks in greater detail, we provide the values of the confusion matrix terms for the six tasks in Table 9. The second last row in the table Table 8. Task
Confusion matrix terms. Alg.
TP
FP
P
AC
g-mean1
F
g-mean2
ACd
T1
ZP
1.0000
1.0000
0.0009
0.0009
0.0302
0.0018
0.0000
0.2929
(3)
ZSP
1.0000
0.6117
0.0015
0.3889
0.0386
0.0030
0.6232
0.5675
SIP
0.5263
0.0013
0.2778
0.9983
0.3824
0.3636
0.7250
0.6651
CAC
0.6757
0.0040
0.1323
0.9857
0.2990
0.2212
0.8203
0.7705
IAB
0.6757
0.0128
0.0459
0.9869
0.1761
0.0859
0.8167
0.7707
T2
ZP
1.0000
1.0000
0.0016
0.0016
0.0402
0.0032
0.0000
0.2929
(5)
ZSP
1.0000
0.5203
0.0031
0.4806
0.0556
0.0062
0.6926
0.6321
SIP
0.6410
0.0013
0.4386
0.9981
0.5302
0.5208
0.8001
0.7462
CAC
0.7949
0.0036
0.2605
0.9960
0.4550
0.3924
0.8899
0.8549
IAB
0.7949
0.0145
0.0814
0.9852
0.2543
0.1476
0.8851
0.8546
T3
ZP
1.0000
1.0000
0.0111
0.0111
0.1052
0.0219
0.0000
0.2929
(4)
ZSP
1.0000
0.6919
0.0159
0.3158
0.1262
0.0314
0.5551
0.5108
SIP
0.8214
0.0042
0.6866
0.9939
0.7510
0.7480
0.9044
0.8737
CAC
1.0000
0.0216
0.3415
0.9786
0.5843
0.5091
0.9891
0.9847
IAB
1.0000
0.0688
0.1400
0.9320
0.3742
0.2456
0.9650
0.9514
T4
ZP
1.0000
1.0000
0.0004
0.0004
0.0210
0.0009
0.0000
0.2929
(7)
ZSP
1.0000
0.9419
0.0005
0.0585
0.0216
0.0009
0.2410
0.3339
SIP
0.1905
0.0001
0.4000
0.9995
0.2760
0.2581
0.4364
0.4276
CAC
0.3175
0.0003
0.3252
0.9994
0.3213
0.3213
0.5634
0.5174
IAB
0.8889
0.0107
0.0352
0.9892
0.1770
0.0678
0.9377
0.9211
T5
ZP
1.0000
1.000
0.0009
0.0009
0.0301
0.0018
0.0000
0.2929
(6)
ZSP
1.0000
0.9486
0.0009
0.0522
0.0307
0.0019
0.2267
0.3292
SIP
0.6914
0.0004
0.6022
0.9993
0.6452
0.6437
0.8313
0.7818
CAC
0.8395
0.0014
0.3560
0.9985
0.5467
0.5000
0.9156
0.8865
IAB
0.8765
0.0058
0.1210
0.9941
0.3256
0.2126
0.9335
0.9126
T6
ZP
1.0000
1.0000
0.0004
0.0004
0.0202
0.0008
0.0000
0.2929
(8)
ZSP
1.0000
0.5021
0.0008
0.4981
0.0286
0.0016
0.7056
0.6450
SIP
0.8000
0.0003
0.5106
0.9996
0.6391
0.6234
0.8943
0.8586
CAC
0.9333
0.0004
0.2667
0.9989
0.4989
0.4148
0.9644
0.9529
IAB
0.9333
0.0036
0.2285
0.9964
0.4618
0.3671
0.9659
0.9528
Best
ZSP, ZP
SIP
SIP
SIP
SIP
SIP
IAB, CAC
IAB, CAC
Sig
α = 0.01
α = 0.05
α = 0.01
α = 0.05
α = 0.01
α = 0.01
α = 0.05
α = 0.05
EXTRACTING SHARE FREQUENT ITEMSETS
179
indicates which algorithm performed the best for the column term and the last row indicates the significance level associated with a paired t-test analysis. The TP and FP values in Table 9 correspond to the ROC points plotted in figure 5. The CAC and IAB algorithms found more SH-frequent itemsets than the SIP algorithm in every task and score better according to g-mean2 and ACd measures. The SIP algorithm scores better when the precision based measures g-mean1 and F-Measure are considered, since a larger proportion of the itemsets that it predicted would be SH-frequent, were in fact SH-frequent. The greater precision of the SIP algorithm indicates that a solid foundation for potentially SH-frequent itemsets can be created from itemsets that we already know are SH-frequent. The SIP algorithm uses only this information, constructing candidate itemsets by combining SH-frequent itemsets from the previous pass and then ensuring that each k −1 subset is a SH-frequent itemset. However, the SIP algorithm never finds any SH-frequent itemset that has infrequent subsets. The CAC and IAB algorithms found every itemset found by the SIP algorithm and, because they also consider itemsets with less solid foundations, they are able to discover SH-frequent itemsets with infrequent subsets. In general, the ability of the IAB and CAC algorithms to find SH-frequent itemsets was very similar. They found the same SH-frequent itemsets in four of the six tasks and differed by a single 3-itemset in the fifth. An obvious exception to this occurred in T4, where the IAB algorithm found 72 more of the 126 SH-frequent itemsets than the CAC algorithm. An examination of the data showed that both the IAB and CAC algorithms missed the same three SH-frequent 2-itemsets, having predicting that they would not be SH-frequent. The difference in the number of SH-frequent itemsets that are discovered in subsequent passes can be attributed to a fundamental difference between the IAB and CAC algorithms. Once the CAC algorithm has predicted that an itemset will not be SH-frequent and has discarded it, no superset of the itemset can ever be counted. The three incorrectly classified 2-itemsets for example, were subsets of 25 larger SH-frequent itemsets. The IAB algorithm, on the other hand, by giving every item a chance to contribute in the construction of the set of candidate itemsets for each pass, was able to recover from its errors and discover 21 of these SH-frequent itemsets. The remaining 47 SH-frequent itemsets that were missed by the CAC algorithm contained subsets that were not counted because they were correctly predicted to be infrequent. In general, predicted share values were found to be higher than actual share values. Figure 6(a) shows the results for a typical case. Figure 6(b) is a close-up of the lower right hand corner of the plot. The darker horizontal and vertical lines that separate the plots into four quadrants correspond to SH = 0.02 and PSH = 0.02. Recall that 0.02 is the share threshold that was used for the task. Each of the four quadrants represents one entry in the confusion matrix, as indicated by the labels in figure 6(b). The dashed line corresponds to PSH = SH, the line of perfect prediction. A majority of the points on the plots fall below the dashed line of perfect prediction, indicating that the predicted values were higher than the actual values. Given the uncertainty in the prediction process, we felt that predicting consistently high share values was preferable to a better predictor that resulted in more missed SH-frequent itemsets, provided that the false positive rate remained fairly low. The greater ability of the CAC and IAB algorithms to find SH-frequent itemsets compared to SIP is achieved at the cost of counting more infrequent itemsets, although the proportion of infrequent itemsets counted by all of them is
180
BARBER AND HAMILTON
T1: CAC Algorithm-2 Itemsets SH versus PSH
T1: CAC Algorithm-2 Itemsets SH versus PSH 0.12
0.25
0.1
0.2
0.08
SH
SH
0.3
0.15
0.04
0.05
0.02
0
0
Figure 6. 6(a).
0.05
0.1
0.15
0.2
0.25
d
0.06
0.1
0
c
0.3
a 0
b 0.02
0.04
0.06
0.08
PSH
PSH
(a)
(b)
0.1
0.12
SH versus PSH, CAC 2-itemsets. (a) SH versus PSH, CAC 2-itemsets; (b) Lower right hand corner of
typically small. The SIP algorithm has the smallest false positive rate in all tasks. However, in four of the six tasks the SIP algorithm did not carry out enough passes over the data, resulting in missed SH-frequent itemsets. In all tasks, the IAB algorithm counted more itemsets that were found to be infrequent than the CAC algorithm. The difference in the false positive rate is particularly high for T3. In this task, the CAC algorithm carried out 2 fewer passes over the data and missed 10% of the SH-frequent itemsets as a result. In two of the six tasks, the IAB algorithm did an additional pass over the data, when no SH-frequent itemsets remained to be found. The primary reason for the increased false positive rate of the IAB algorithm is that it does not subset prune. As a result, it tends to count itemsets with zero pruned subsets. A series of experiments was performed to show that the primary effect of counting more infrequent itemsets is increased run time. Table 10 shows the total runtimes measured for Table 9.
Total runtimes for algorithms. Task identifier
Metric
ZSP
IAB
CAC
SIP 46
T1
Runtime
684
75
69
Counted
24826
545
189
72
T2
Runtime
2063
732
621
431
Counted
12606
381
119
57
T3
Runtime
833
410
416
272
Counted
3516
396
164
67
T4
Runtime
7544
403
198
202
Counted
268439
3178
123
60
T5
Runtime
7176
789
551
448
Counted
85671
587
191
93
T6
Runtime
17377
1387
1021
1004
Counted
36797
288
105
47
181
EXTRACTING SHARE FREQUENT ITEMSETS
the algorithms in seconds and the total number of itemsets counted. Runtimes were obtained by recording the elapsed time on a dedicated machine. The total runtime includes the time to prepare the database, generate the candidate sets, process the transactions and extract the frequent itemsets from the set of counted itemsets. The database preparation time is of course unaffected by the algorithm being used. However, the number of passes over the data varied with the algorithm, so the inclusion of database preparation time in the total run time presents a more accurate picture of the overall performance of the various algorithms. Generally, the total run time increases with the number of itemsets counted by an algorithm. Our experiments showed that the time to carry out a pass is dominated by the time to process the transactions. Table 11 provides a breakdown of the measured runtime, excluding database setup, for each pass of T5. These results are representative of the results obtained for all tasks. The dash in a cell of the first column indicates that no candidate set is generated for the first pass. A dash in any other column indicates that the corresponding algorithm did not perform the pass. The values in Table 11 show that the time to process the transaction database increases with the number of itemsets counted. However, other factors also contribute. The time to process transactions depends on the number of transactions, the size of a candidate itemset, the number of candidate itemsets and the number of items in the transactions. In our implementation, we search for each candidate itemset in every transaction, provided that the transaction has enough items, rather than constructing itemsets from the items in the transaction and then searching the candidate set for the resulting itemsets. If T represents Table 10.
Breakdown of runtime for task T5.
Generate candidates
Process transactions
Extract SH-frequent itemsets
Algorithm
Pass 1
ZSP
–
0.19
1.10
5.408
24.26
92.83
IAB
–
0.16
0.88
0.51
0.15
2.60
CAC
–
0.40
0.08
0.11
0.03
–
SIP
–
0.01
0.01
0.00
ZSP
34.89
756.87
991.38
1441.14
Pass 3
Pass 4
Pass 5
Pass 6
–
–
1809.82
1626.69 31.05
IAB
34.79
75.60
97.01
75.45
44.02
CAC
34.26
79.04
80.65
43.67
26.95
–
SIP
34.10
71.00
42.57
27.88
–
–
ZSP
0.00
0.00
0.02
0.04
IAB
0.00
0.01
0.01
0.01
0.00
0.00
CAC
0.00
0.01
0.01
0.01
0.00
–
SIP Number of itemset counted
Pass 2
ZSP
0.00 135
0.00
0.00
0.00
0.13
–
–
861
2301
7244
21636
–
IAB
135
46
139
213
148
41
CAC
135
46
87
53
5
–
SIP
135
45
34
14
–
–
182
BARBER AND HAMILTON
T5 Elapsed Time versus Total Search Count in a Pass 200 180
Elapsed Time (sec)
160 140 ZSP
120
SIP
100
IAB
80
CAC
60 40 20 0 0
200000
400000
600000
800000
1000000
1200000
Total Pass Search Count
Figure 7.
Elapsed time versus total pass search count for task T5.
the set of items in a transaction, then the procedure to process the transaction can be written as follows. 1. foreach Ii ∈ Ck 2. CandidateFound := true 3. foreach IT j ∈ Ii 4. if IT j ∈ / T then 5. CandidateFound := false 6. break 7. end if 8. end foreach 9. if CandidateFound = true then 10. update Ii 11. end if 12. end foreach Thus, for each transaction, we must search for each item of each candidate itemset in the set of items for each transaction. The total number of searches into the sets of transactions items encapsulates the four factors previously mentioned. Figure 7 is a plot of elapsed time to process transactions versus the total number of item searches for each pass in T5. The plot indicates a linear relationship between the searches for items in the transactions and elapsed time to process all transactions for a pass. Any factor that increases the number of item searches will directly increase the time to process the transactions so algorithms that count more itemsets will take longer to complete. 8.
Conclusions and future research
We showed that the share measure can provide useful information about numerical values typically associated with transaction items, that the support measure cannot. We defined the
EXTRACTING SHARE FREQUENT ITEMSETS
183
problem of finding share frequent itemsets, showing share frequency does not have the property of downward closure when it is defined in terms of the itemset as a whole. We presented algorithms that do not rely on the property of downward closure and thus, are able to find share frequent (SH-frequent) itemsets that have infrequent subsets. Using heuristic methods, we generate candidate itemsets by supplementing the information contained in the set of SH-frequent itemsets from a previous pass, with other information that is available at no additional processing cost. These algorithms count only those generated itemsets predicted to be SH-frequent. We demonstrated the effectiveness of the algorithms by applying them to a commercial transaction database. The performance of the algorithms was examined using machine learning principles of classifier evaluation; this technique for evaluating itemset algorithms is a novel contribution of this research. Although the algorithms did not always find all SH-frequent itemsets, they found a large proportion of the SH-frequent itemsets while counting only a small proportion of all possible itemsets. Overall, the CAC and IAB algorithms performed best in our experiments. Future research will include the application of the algorithms against other databases, both real and synthetically generated. We will investigate other methods of share value prediction and examine how the use of different predicators affects the effectiveness of the algorithms. Methods of analyzing relationships between the measure values of items when they occur in the same itemset are in the process of development. A threshold factor can be incorporated into the CAC and IAB algorithms to increase the number of frequent itemsets found at a cost of an increase in the number of infrequent itemsets examined (Barber and Hamilton, 2001). Further investigation of the resulting versions of the CAC and IAB algorithms is warranted. Acknowledgments We thank the anonymous reviewers for their comments, in particular the fruitful suggestion that we develop better algorithms than our original algorithm (SIP). This research was supported by the Natural Sciences and Engineering Research Council of Canada via a research grant, a Strategic Project Grant, and a collaborative research and development grant. References Agrawal, A., Imielinksi, T., and Swami, A. 1993. Mining association rules between sets of items in large databases. In Proc. ACM SIGMOD Int. Conf. on the Management of Data, Washington, D.C., pp. 207–216. Agrawal, A., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A.I. 1996. Fast discovery of association rules. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.), Menlo Park, California, pp. 307–328. Agrawal, A. and Schafer, J.C. 1996. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 8(6):962–969. Agrawal, A. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. Twentieth Int. Conf. on Very Large Databases, Santiago, Chile, pp. 487–499. Ali, K., Manganaris, S., and Srikant, R. 1997. Partial classification using association rules. In Proc. Third Int. Conf. on Knowledge Discovery in Databases and Data Mining, Newport Beach, California, pp. 115–118.
184
BARBER AND HAMILTON
Barber, B. and Hamilton, H.J. 2000. Algorithms for mining share frequent itemsets containing infrequent subsets. In Proc. Fourth European Conf. on Principles of Knowledge Discovery in Databases, Lyon, France, pp. 316– 324. Barber, B. and Hamilton, H.J. 2001. Parametric algorithms for mining share frequent itemsets. Journal of Intelligent Information Systems, 16(3):277–293. Bayardo, R.J., Agrawal, R., and Gunopulos, D. 1999. Constraint based rule mining in large dense databases. In Proc. 15th Int. Conf. on Data Engineering, Sydney, Australia, pp. 188–197. Brin, S., Motwani, R., and Silverstein, C. 1997a. Beyond market baskets: Generalizing association rules to correlations. In Proc. ACM SIGMOD Int. Conf. on the Management of Data, New York, pp. 265–276. Brin, S., Motwani, R., Ullman, J.D., and Tsur, S. 1997b. Dynamic itemset counting and implication rules for market basket data. In Proc. ACM SIGMOD Int. Conf. on the Management of Data, New York, pp. 255–264. Buchter, O. and Wirth, R. 1998. Discovery of association rules over ordinal data: A new and faster algorithm and its application to market basket data. In Proc. Second Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Melbourne, Australia, pp. 36–47. Cai, C.H., Fu, A., Cheng, C.H., and Kwong, W.W. 1998. Mining association rules with weighted items. In Proc. of IEEE Int. Database Engineering and Applications Symposium, Cardiff, United Kingdom, pp. 68–77. Carter, C.L., Hamilton, H.J., and Cercone, N. 1997. Share based measures for itemsets. In Proc. First European Conf. on the Principles of Data Mining and Knowledge Discovery, Trondheim, Norway, pp. 14–24. Cheung, D.W., Ng, V.T., Fu, A.W., and Fu, Y. 1996. Efficient mining of association rules in distributed databases. IEEE Transactions on Knowledge and Data Engineering, 8(6):911–922. Frawley, W.J., Piatetsky-Shapiro, G., and Matheus, C.J. 1991. Knowledge discovery in databases: An overview. In Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley (Eds.), Menlo Park: AAAI/MIT Press, pp. 1–27. Han, J. and Fu, Y. 1995. Discovery of multiple-level association rules from large databases. In Proc. Int. Conf. on Very Large Databases, Zurich, Switzerland, pp. 420–431. Hidber, C. 1999. Online association rule mining. In Proc. ACM SIGMOD Int. Conf. on Management of Data, Philadephia, Pennsylvania, pp. 145–156. Hilderman, R.J., Carter, C., Hamilton, H.J., and Cercone, N. 1998. Mining association rules from market basket data using share measures and characterized itemsets. Int. J. of Artificial Intelligence Tools, 7(2):189–220. Hipp, M., Myka, A., Wirth, R., and G¨untzer, U. 1998. A new algorithm for faster mining of generalized association rules. In Proc. Second European Symposium on Principles of Data Mining and Knowledge Discovery, Nantes, France, pp. 74–82. Kohavi, R. and Provost, F. 1998. Glossary of terms. Machine Learning, 30(2):271–274. Koperaki, K. and Han, J. 1995. Discovery of spatial association rules in geographic information databases. In Proc. Fourth Int. Symposium on Large Spatial Databases, Portland, Maine, pp. 47–66. Kubat, M., Holte, R.C., and Matwin, S. 1998. Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(1):195–215. Lewis, D.D. and Gale, A.G. 1994. A sequential algorithm for training text classifiers. In Proc. 17th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, Dublin, Ireland, pp. 3–12. Lin, T.Y., Yao, Y.Y., and Louie, E. 2002. Value added association rules. In Proc. Sixth Pacific-Asia Conf. on Knowledge Discovery and Data Mining, Taipei, Taiwan, pp. 328–333. Lu, S., Hu, H., and Li, F. 2001. Mining weighted association rules. Intelligent Data Analysis, 5(3):211–225. Mannila, H., Toivonen, H., and Verkamo, A.I. 1994. Efficient algorithms for discovering association rules. In Proc. 1994 AAAI Workshop on Knowledge Discovery in Databases, Seattle, Washington, pp. 144–155. Masand, B. and Piatetsky-Shapiro, G. 1996. A comparison of approaches for maximizing business payoff of predictive models. In Proc. Second Int. Conf. on Knowledge Discovery and Data Mining, Portland, Oregon, pp. 195–201. Megiddo, N. and Srikant, R. 1998. Discovering predictive association rules. In Proc. Fourth Int. Conf. on Knowledge Discovery and Data Mining, New York, pp. 274–278. Park, J.S., Chen, M., and Yu, P. 1995. An effective Hash-based algorithm for mining association rules. In Proc. ACM SIGMOD Int. Conf. on the Management of Data, San Jose, California, pp. 175–186. Pei, J., Han, J., and Lakshmanan, L.V.S. 2001. Mining frequent itemsets with convertible constraints. In Proc. 2001 Int. Conf. on Data Engineering, Heidelberg, Germany, pp. 433–442.
EXTRACTING SHARE FREQUENT ITEMSETS
185
Provost, F. and Fawcett, T. 1997. Analysis and visualization of classifier performance: Comparison under imprecise class and cost distribution. In Proc. Third Int. Conf. on Knowledge Discovery and Data Mining, Newport Beach, California, pp. 43–48. Provost, F., Fawcett, T., and Kohavi, R. 1998. Building the case against accuracy estimation for comparing induction algorithms. In Proc. Fifteenth Int. Conf. on Machine Learning, Madison, Wisconsin, pp. 445–453. Silverstein, C., Brin, S., and Motwani, R. 1998. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2(1):39–68. Srikant, R. and Agrawal, R. 1996. Mining quantitative association rules in large relational tables. In Proc. ACM SOGMOD Conf. on the Management of Data, Montreal, Canada, pp. 1–12. Swets, J.A. 1988. Measuring the accuracy of diagnostic systems. Science, 240:1285–1293. Zaki, M.J., Parthasarathy, M., Ogihara, M., and Li, W. 1997. New algorithms for fast discovery of association rules. In Proc. Third Int. Conf. on Knowledge Discovery and Data Mining, Newport Beach, California, pp. 283–286.