A Novel Method for Mining Class Association Rules with Itemset Constraints Dang Nguyen1,2, Bay Vo1,2, and Bac Le3 1
Division of Data Science and 2 Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Viet Nam
[email protected],
[email protected] 3 Computer Science Department, University of Science, VNU - Ho Chi Minh, Viet Nam
[email protected]
Abstract. Mining class association rules with itemset constraints is very popular in mining medical datasets. For example, when classifying which populations are at high risk for the HIV infection, epidemiologists often concentrate on rules which include demographic information such as sex, age, and marital status in the rule antecedents. However, two existing methods, post-processing and pre-processing, require much time and effort. In this paper, we propose a lattice-based approach for efficiently mining class association rules with itemset constraints. We first build a lattice structure to store all frequent itemsets. We then use paternity relations among nodes to discover rules satisfying the constraint without re-building the lattice. The experimental results show that our proposed method outperforms other methods in the mining time. Keywords: Associative classification, Class association rule, Data mining, Useful rules, Lattice.
1
Introduction
Mining class association rules (CARs) with itemset constraints is one of variations of mining rules including mining association rules [1], mining CARs [2], mining sequential rules [3]. There are two basic methods to discover CARs with itemset constraints: post-processing and pre-processing methods. The post-processing method first mines the complete set of CARs by using an algorithm such as CBA [4], ECR-CARM [5], CAR-Miner [6], or PMCAR [2] and then filters out the ones which do not satisfy the itemset constraint in the post-processing step. One example of this strategy is CARMiner-Post [7]. This kind of strategy is very inefficient because all CARs have to be generated and often a large number of candidates must be tested in the last step. Another strategy, pre-processing, first filters out records which do not contain the constrained itemset in the pre-processing step, mines all CARs, and then obtains the constrained CARs in the post-processing step. In [7], the authors proposed SC-CARMiner, an algorithm integrates the itemset constraint into the actual mining process to generate only the CARs which satisfy the constraint. Since this strategy can use the properties of the constraint much more effectively, its execution time is much lower than those of the post- and pre-processing strategies. D. Hwang et al. (Eds.): ICCCI 2014, LNAI 8733, pp. 494–503, 2014. © Springer International Publishing Switzerland 2014
A Novel Method for Mining Class Association Rules with Itemset Constraints
495
In practice, the itemset constraint is often changed by end-users. Consequently, the available methods must be executed again whenever the constraint is changed. Thus, the response cannot be immediately returned to the end-users. Under this context, the present study proposes an efficient method for mining CARs with itemset constraints which are in the form of the presence of specific items in the rule antecedents and are frequently changed by end-users. The main contributions of this paper are stated as follows. Firstly, we develop a lattice structure to store all frequent itemsets (Section 4.1). Secondly, we provide a theorem for quickly checking the paternity relation between two nodes in the lattice (Section 4.2). Finally, we propose an efficient algorithm for mining CARs with the itemset constraint based on the lattice (Section 4.2).
2
Some Concepts and Notations
Let D be a dataset with n attributes { A1 , A2 ,..., An } and D records (objects) and let C = {c1 , c2 ,..., ck } be a list of class labels. A specific value of an attribute Ai and
class C are denoted by lower-case letters ai and c j , respectively. An item is described as an attribute and a specific value for that attribute, denoted by
( Ai , aim )
.
An itemset is a set of items and a Constraint_Itemset is a specific itemset considered by end-users. A class association rule r has the form itemset → c j , where c j ∈ C is a class label. A rule r satisfies the itemset constraint if its antecedent (itemset) contains the Constraint_Itemset. The actual occurrence ActOcc ( r ) of rule r in D is the number of objects in D which match r ’s antecedent. The support of rule r , denoted by Sup(r ) , is the number of objects in D which match r ’s antecedent and are labeled with r ’s class. The confidence of rule r , denoted by Conf ( r ) , is defined as:
Conf ( r ) =
Sup ( r )
ActOcc ( r )
Table 1. Example of a dataset
OID 1 2 3 4 5 6 7 8
A a1 a1 a2 a3 a3 a3 a1 a2
B b1 b2 b2 b3 b1 b3 b3 b2
C c1 c1 c1 c1 c2 c1 c2 c2
Class 1 2 2 1 2 1 1 2
496
D. Nguyen, B. Vo, and B. Le
A sample dataset is shown in Table 1 where OID is an object identifier. It contains eight objects, three attributes, and two classes (1 and 2). For example, consider rule r : {( A, a1)} → 1 . We have ActOcc ( r ) = 3 and Sup(r ) = 2 since there are three objects with A = a1 , in which two objects have the same class 1. In addition, we have Sup ( r ) 2 Conf ( r ) = = . ActOcc ( r ) 3
3
Related Work
In this section, we review some algorithms for mining frequent itemsets with itemset constraints. Since the introduction of mining frequent itemsets with itemset constraints [8], various strategies have been proposed. They can be classified into three main categories: post-processing, pre-processing, and constrained itemset filtering. Post-processing methods firstly mine frequent itemsets, and then check them against the itemset constraint. Two examples are Apriori+ [9] and FP-Growth+ [10]. Pre-processing methods firstly restrict the source dataset to records which contain constrained itemsets, and then find frequent itemsets on the filtered dataset. Examples include MCFPTree [10] and Pre-CAP [11]. Constrained itemset filtering methods integrate itemset constraints into the actual mining process to generate only frequent itemsets which satisfy the constraint. CAP [9] and MFS_DoubleCons [12] belong to this strategy. However, three approaches for mining frequent itemsets with itemset constraints cannot be applied for mining CARs with itemset constraints since they do not generate constrained rules directly. They succeed in mining frequent itemsets with itemset constraints only. Obviously, the problem of mining class association rules with itemset constraints studied in this paper is totally different from the problem of mining frequent itemsets with itemset constraints. Hence, it requires a different strategy for mining.
4
Mining CARs with Itemset Constraints
4.1
Lattice Structure
In this section, we describe the lattice structure used to store all frequent itemsets in the dataset. Definition 1: Let X be a k-itemset. The children itemsets generated from X based on the equivalence class concept are: childrenEC ( X ) = { XA | XA is a (k + 1)-itemset, A ∉ X and A ≠ ∅}
Definition 2: Let X be a k-itemset. The children itemsets generated from X based on the lattice concept are:
childrenL ( X ) = { BX | BX is a (k + 1)-itemset, BX ∉ childrenEC ( X ) and X ⊂ BX }
A Novel Method for Mining Class Association Rules with Itemset Constraints
497
Definition 3: Each node in the lattice is a tuple as follows: id , itemset , ( Obidset1 ,..., Obidsetk ) , pos, total , traverse, childrenEC, childrenL
Where: 1. id : A positive integer storing the identity of the node 2. ( Obidset1 ,..., Obidsetk ) : A list of Obidsets in which each Obidseti is a set of object identifiers that contain the itemset and class ci (k is the number of classes) 3. pos : A positive integer storing the position of the class with the maximum cardinality of Obidseti , i.e., pos = argmax i∈[1, k ] { Obidseti }
4. total : A positive integer which stores the sum of cardinality of all Obidseti , i.e., k
total = Obidseti i =1
5. traverse : A flag which indicates whether or not the node already generated a rule 6. childrenEC : A list of children nodes generated from itemset based on the equivalence class definition 7. childrenL : A list of children nodes generated from itemset based on the lattice definition In the lattice, the itemset is converted into form att × values for easily programming, where: (a) att : A positive integer which represents a list of attributes (b) values : A list of values, each of which is contained in one attribute in att Example 1: In Table 1, itemset X =
( A, a2 )
is contained in two objects 3 and 8,
both of which belong to class 2. Thus, the node which contains itemset X has the form 1 × a 2 ( ∅,38 ) in which att=1, values=a2, Obidset1 = ∅ (i.e. no objects contain both itemset X and class 1), Obidset2 = {3,8} (or Obidset2 = 38 for short) (i.e. two objects
3 and 8 contains both itemset X and class 2), pos=2 (a line under position 2 of list Obidseti ), and total=2. pos is 2 because the cardinality of Obidset for class 2 is maximum (2 versus 0). Compared to the MECR-tree [6] and the SCR-tree [7], the lattice structure has a significant advantage that we do not need to re-build the lattice when the itemset constraint is changed. Based on the generated lattice, we use paternity relations among nodes to discover class association rules satisfying the itemset constraint. This noticeably improves the response time. 4.2
Proposed Algorithm
In this section, we firstly introduce some theorems for quickly determining the support of an infrequent itemset and the paternity relation between two nodes. Then, we
498
D. Nguyen, B. Vo, and B. Le
present an efficient algorithm called L-CAR-Miner for mining CARs with the itemset constraint based on the lattice structure. Given two nodes X and Y , if X .att = Y .att X .values ≠ Y .values , then X .Obidseti ∩ Y .Obidseti = ∅ ( ∀i ∈ [1, k ] ).
Theorem 1
[7]:
and
Theorem 1 implies that if two nodes X and Y have the same attributes, it is not necessary to combine them as node XY because Sup ( XY ) = 0 . Theorem 2 [13]: Let XA and XB be two nodes of k-itemset. If ∀XB ∈ X .childrenEC and XA is generated before XB in lattice, ¬∃Y ∈ XB.childrenEC ∪ XB.childrenL so that Y ∈ XAchildrenL . . . Theorem 2 implies that finding children nodes which belong to XAchildrenL is easily performed in two loops and one if statement as follows: (i) let Y ∈ X .childrenL , . (ii) let YZ ∈ Y .childrenEC , and (iii) if XA ⊂ YZ , then YZ ∈ XAchildrenL . This theorem allows the proposed algorithm to be better than the one mentioned in [14] because it eliminates a large number of candidates. Theorem 3: Let Y ∈ X .childrenL , YZ ∈ Y .childrenEC , and XA ∈ X .childrenEC . If A = Z , then XA ⊂ YZ . Proof: Regarding Definition 2, Y has the form BX where BX is a (k+1)-itemset. It can be inferred that YZ has the form BXZ . Thus, if A = Z , then XA ⊂ BXA = YZ ∎
Using Theorem 3 allows us to check quickly the condition (iii) XA ⊂ YZ in Theorem 2. Instead of checking XA with all subsets of YZ , we need to check only the condition A = Z . Based on the proposed lattice structure and three theorems, the algorithm L-CARMiner is developed for efficiently mining CARs with the itemset constraint. The proposed algorithm is briefly described as follows. Firstly, the lattice structure is built to store frequent itemsets (showed in Figure 1). Secondly, when the itemset constraint is changed, the sub-lattice which begins at the itemset constraint is traversed to generate all CARs satisfying the constraint (showed in Figure 2). The algorithm firstly finds the root node of the lattice ( Lr ) which contains frequent 1-itemsets at the first level (Line 1). Procedure BUILD-LATTICE is then recursively called with the parameters Lr and minSup to build the lattice structure (Lines 2-19). The function of procedure UPDATE-LATTICE(X, XA) is to update the paternity relations of a node with its children nodes (Lines 20-23) while the function of procedure TRAVERSE-LATTICE is to find all rules which satisfy the itemset constraint (Lines 24-30). Procedure GENERATE-RULE generates a rule from a node (Lines 31-33). We apply the proposed algorithm to the example dataset in Table 1 with minSup = 25% and minConf = 50% to illustrate its basic ideas. Firstly, L-CAR-Miner finds all frequent 1-itemsets. The result after this first step is Lr = {1× a1(17, 2 ) ,1× a 2 ( ∅,38) ,
1× a3 ( 46,5 ) , 2 × b2 ( ∅, 238 ) , 2 × b3 ( 467, ∅ ) , 4 × c1(146, 23) , 4 × c2 ( 7,58 )} . The algo-
rithm then calls procedure BUILD-LATTICE with two parameters Lr and minSup to
A Novel Method for Mining Class Association Rules with Itemset Constraints
499
build the lattice structure. When a node in the lattice is generated, its paternity relations with its children nodes are updated through procedure UPDATE-LATTICE. The lattice structure constructed by the proposed algorithm is shown in Figure 3. Note that the solid and dashed lines represent childrenEC and childrenL , respectively. Input: Dataset D and minimum support minSup Output: Lattice L containing all frequent itemsets in D Procedure: 1. Let Lr be the root of the lattice. Lr includes a set of nodes where each node contains a frequent 1-itemset and has the form:
id , itemset , ( Obidset1 ,..., Obidsetk ) , pos, total , false, { } , {
BUILD-LATTICE( Lr , minSup) 2.
for all lx ∈ Lr .children do
3.
Pi = ∅ ;
4.
for all l y ∈ Lr .children, with y > x do
5.
if l y .att ≠ l x .att then // using Theorem 1
6. 7.
assign an incremental integer to O.id ; O.att = lx .att ∪ l y .att ; // using bitwise operation
8.
O.values = lx .values ∪ l y .values ;
9.
O.Obidseti = lx .Obidseti ∩ l y .Obidseti ; // ∀i ∈ [1, k ]
10.
O. pos = argmax i∈[1, k ] { O.Obidseti } ;
11.
O.total = O.Obidseti ;
12.
O.traverse = false ;
13.
if O.ObidsetO. pos ≥ minSup then // O satisfies minSup
k
i =1
14.
add O.id to lx .childrenEC ;
15.
add O.id to l y .childrenL ;
16.
UPDATE-LATTICE( l x , O );
17.
Pi = Pi ∪ O ;
18. add O to lattice L ; 19. BUILD-LATTICE( Pi , minSup); UPDATE-LATTICE( X , XA ) 20. for each child node Y in X .childrenL do // using Theorem 2 21. 22. 23.
for each child node YZ in Y .childrenEC do if A = Z then // using Theorem 3 . ; add YZ.id to XAchildrenL Fig. 1. Building the lattice structure
}
500
D. Nguyen, B. Vo, and B. Le
Input: Lattice L , minimum confidence minConf, and the itemset constraint Constraint_Itemset Output: All CARs satisfying minSup, minConf, and Constraint_Itemset Procedure: TRAVERSE-LATTICE( l , minConf) 24. if l.traverse = false then 25.
GENERATE-RULE( l , minConf);
26.
l.traverse = true ;
27. for each child node X in l.childrenEC do 28. TRAVERSE-LATTICE( X , minConf); 29. for each child node Y in l.childrenL do 30. TRAVERSE-LATTICE( Y , minConf); GENERATE-RULE( l , minConf) 31. conf = l.Obidsetl . pos / l.total ; 32. if conf ≥ minConf then 33.
{
(
CARs=CARs ∪ l .itemset → c pos l.Obidsetl . pos , conf
)} ;
Fig. 2. Traversing the sub-lattice to generate constrained CARs
{} 1 × a1 (17, 2 ) 1 × a 2 ( ∅, 38 ) 1 × a 3 ( 46, 5 ) 2 × b2 ( ∅, 238 ) 2 × b3 ( 467, ∅ ) 4 × c1 (146, 23 ) 4 × c 2 ( 7, 58 )
3 × a 2b 2 ( ∅, 38 ) 3 × a 3b3 ( 46, ∅ ) 5 × a 3c1 ( 46, ∅ ) 6 × b 2c1 ( ∅, 23 ) 6 × b3c1 ( 46, ∅ )
7 × a 3b3c1 ( 46, ∅ )
Fig. 3. Lattice structure generated by L-CAR-Miner for the dataset in Table 1 with minSup = 25% and minConf = 50%
To generate all CARs satisfying the itemset constraint and minConf, the algorithm uses procedure TRAVERSE-LATTICE to traverse the sub-lattice which begins at the constraint. For example, consider Constraint_Itemset = ( B, b3) . The algorithm first looks up node 2 × b3 ( 467, ∅ ) . Because this node did not generate a rule yet, the algorithm uses procedure GENERARTE-RULE to generate rule 2 × b3 → 1 (i.e. “if B = b3 then Class = 1”). Then the algorithm considers node 6 × b3c1( 46, ∅ ) and generates rule 6 × b3c1 → 1 (i.e. “if B = b3 and C = c1 then Class = 1”). Similarly, rules 7 × a3b3c1 → 1 and 3 × a3b3 → 1 are generated from node 7 × a3b3c1( 46, ∅ ) and
A Novel Method for Mining Class Association Rules with Itemset Constraints
501
node 3 × a3b3 ( 46, ∅ ) , respectively. Note that the algorithm takes into account node 7 × a3b3c1( 46, ∅ ) after generating a rule from node 3 × a3b3 ( 46, ∅ ) through the
childrenEC . However, the rule is not generated because it is already done before through the childrenL of node 6 × b3c1( 46, ∅ ) .
5
Experiments
All experiments were conducted on a computer with an Intel Core i7-2637M CPU at 1.7 GHz and 4 GB of RAM, running Windows 7 Enterprise (64-bit) SP1. The experimental datasets were obtained from the UCI Machine Learning Repository1. The algorithms were coded in C# using Microsoft Visual Studio .NET Premium 2013 with .NET Framework 4.5.50938. Characteristics of experimental datasets are described in Table 2. The table shows the number of attributes (including the class attribute), the number of class labels, the number of distinctive values (i.e. the total number of distinct values in all attributes), and the number of objects (or records) in each dataset. Note that minConf = 50% was used for all experiments. Table 2. Characteristics of the experimental datasets
Dataset German Lymph Chess Connect-4
#attributes 21 19 37 43
#classes
2 4 2 3
#distinctive values 1,077 63 76 130
#objects 1,000 148 3,196 67,557
To show the efficiency of L-CAR-Miner, we compared its execution time with those of Pre-CAR-Miner (a pre-processing method) and SC-CAR-Miner [7]. Since SC-CARMiner always outperforms CAR-Miner-Post [7], we do not include CAR-Miner-Post in the comparison. The results are shown in Figures 4 and 5. Note that Pre, SC, and L are runtimes of Pre-CAR-Miner, SC-CAR-Miner, and L-CAR-Miner, respectively. To initialize the Constraint_Itemset, we define the selectivity of a constraint as the ratio of the number of items selected to be the constraint against the total number of items. Thus, a constraint with 0% selectivity means no items, while a constraint with 100% selectivity is the one selecting all the items (distinctive values) in the dataset. Note that the runtime of L-CAR-Miner includes both execution times of two phases, lattice building and mining. The results show that Pre-CAR-Miner performed worst for all experimental datasets. It is slower than SC-CAR-Miner and L-CAR-Miner because it must generate all CARs and check a huge number of candidates. L-CAR-Miner is always the fastest of all tested algorithms. For example, consider the Chess dataset with minSup = 50% and selectivity = 90%. The runtime of L-CAR-Miner was 3.807(s) while Pre-CAR-Miner was 40.357(s) and SC-CAR-Miner was 17.468(s). 1
http://mlearn.ics.uci.edu
502
D. Nguyen, B. Vo, and B. Le
German (minSup = 4%)
20
Pre SC L
6
Runtime (s)
Runtime (s)
8
Lymph (minSup = 4%)
4 2 0
15 10
Pre SC L
5 0
10 20 30 40 50 60 70 80 90
10 20 30 40 50 60 70 80 90
selectivity (%)
selectivity (%)
(a)
(b)
Fig. 4. Runtimes of Pre-CAR-Miner, SC-CAR-Miner, and L-CAR-Miner for the German (a) and Lymph (b) datasets with the constraint selectivity
50 40 30 20 10 0
Pre SC L
Connect-4 (minSup = 95%) Runtime (s)
Runtime (s)
Chess (minSup = 50%)
120 100 80 60 40 20 0
Pre SC L
10 20 30 40 50 60 70 80 90
10 20 30 40 50 60 70 80 90
selectivity (%)
selectivity (%)
(a)
(b)
Fig. 5. Runtimes of Pre-CAR-Miner, SC-CAR-Miner, and L-CAR-Miner for the Chess (a) and Connect-4 (b) datasets with the constraint selectivity
6
Conclusions
This paper proposes an efficient approach for mining CARs with itemset constraints. Unlike the post-processing and pre-processing, the proposed approach does not need to re-build its data structure when the constraint is changed. The framework of the proposed algorithm is based on the lattice structure and three theorems for quickly pruning infrequent itemsets and determining the paternity relation between two nodes. To validate the efficiency of the proposed algorithm, a series of experiments was conducted on four popular datasets, namely German, Lymph, Chess, and Connect-4. The experimental results show that the proposed method is superior to existing methods.
A Novel Method for Mining Class Association Rules with Itemset Constraints
503
However, when the minimum support value is very low, the cost for storing the lattice is very high. This can cause the memory leakage. We will study the solution for reducing the memory consumption in the future. Acknowledgment. This work was supported by Vietnam’s National Foundation for Science and Technology Development (NAFOSTED) under Grant Number 102.01-2012.17.
References 1. Vo, B., Le, B.: Mining traditional association rules using frequent itemsets lattice. In: The 39th International Conference on Computers & Industrial Engineering (CIE 2009), pp. 1401–1406. IEEE (2009) 2. Nguyen, D., Vo, B., Le, B.: Efficient Strategies for Parallel Mining Class Association Rules. Expert Systems with Applications 41, 4716–4729 (2014) 3. Van, T.-T., Vo, B., Le, B.: IMSR_PreTree: an improved algorithm for mining sequential rules based on the prefix-tree. Vietnam Journal of Computer Science 1, 97–105 (2014) 4. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: The 4th International Conference on Knowledge Discovery and Data Mining (KDD 1998), pp. 80–86 (1998) 5. Vo, B., Le, B.: A novel classification algorithm based on association rules mining. In: Richards, D., Kang, B.-H. (eds.) PKAW 2008. LNCS, vol. 5465, pp. 61–75. Springer, Heidelberg (2009) 6. Nguyen, L.T., Vo, B., Hong, T.-P., Thanh, H.C.: CAR-Miner: An efficient algorithm for mining class-association rules. Expert Systems with Applications 40, 2305–2311 (2013) 7. Nguyen, D., Vo, B.: Mining class-association rules with constraints. In: Van Huynh, N., Denoeux, T., Tran, D.H., Le, A.C., Pham, B.S. (eds.) KSE 2013, Part II. AISC, vol. 245, pp. 323–334. Springer, Heidelberg (2014) 8. Srikant, R., Vu, Q., Agrawal, R.: Mining association rules with item constraints. In: The 3rd International Conference on Knowledge Discovery and Data Mining (KDD 1997), pp. 67–73 (1997) 9. Ng, R.T., Lakshmanan, L.V.S., Han, J., Pang, A.: Exploratory mining and pruning optimizations of constrained associations rules. In: ACM SIGMOD International Conference on Management of Data, pp. 13–24. ACM (1998) 10. Lin, W.-Y., Huang, K.-W., Wu, C.-A.: MCFPTree: An FP-tree-based algorithm for multiconstraint patterns discovery. International Journal of Business Intelligence and Data Mining 5, 231–246 (2010) 11. Nguyen, D., Truong, T., Vo, B.: Mining Frequent Patterns containing HIV-Positive in HIV Voluntary Counseling and Testing data. ICIC Express Letters 8, 541–546 (2014) 12. Duong, H., Truong, T., Vo, B.: An efficient method for mining frequent itemsets with double constraints. Engineering Applications of Artificial Intelligence 27, 148–154 (2014) 13. Vo, B., Le, T., Hong, T.-P., Le, B.: An effective approach for maintenance of pre-largebased frequent-itemset lattice in incremental mining. Applied Intelligence (in press, 2014) 14. Nguyen, L.T., Vo, B., Hong, T.-P., Thanh, H.C.: Classification based on association rules: A lattice-based approach. Expert Systems with Applications 37, 11357–11366 (2012)