Mining Association Rules with Ontological Information - National ...

Mining Association Rules with Ontological Information Ming-Cheng Tseng Institute of Information Engineering, I-Shou University, 840,Taiwan [email protected]

Wen-Yang Lin Dept. of Comp. Sci. & Info. Eng., National University of Kaohsiung, 811, Taiwan [email protected]

Abstract The problem of mining association rules incorporated with domain knowledge has been studied recently. Previous work was conducted individually on two types of knowledge, classification and composition. In this paper, we revisit this problem from a more unified viewpoint. We consider the problem of mining association rules with ontological information that presents not only classification but also composition relationship. Two effective algorithms are proposed with empirical evaluation displayed.

1. Introduction It is well-known that the data mining process is knowledge intensive. It has been shown in many applications that with the aid of domain knowledge, one can discover more meaningful patterns and enrich the semantics of discovered rules. The most popular method in organizing the domain knowledge is employing ontology [3][8], which is an explicit specification of a conceptualization that can help us to define and share knowledge. One of the most important patterns in data mining is to discover association rules from a database. An association rule is an expression of the form, X  Y, where X and Y are sets of items. Such information is very useful in making decision for business management. In the past few years, there has been researches investigated the problem of mining association rules with classification or composition information [4][5][7], showing the benefit of incorporating domain knowledge and proposing effective algorithms. In this paper, we revisit this problem from a more unified viewpoint. We consider the problem of mining association rules with ontological information that presents not only classification but also composition relationship. Two effective algorithms are proposed with empirical evaluation displayed.

Rong Jeng Dept. of Information Management, I-Shou University, 840, Taiwan [email protected]

The remaining of the paper is organized as follows. A review of related work is given in Section 2. The problem of mining association rules with ontology is formalized in Section 3. In Section 4, we describe the proposed methods for finding frequent itemsets. A simple example is provided for illustration. In Section 5, we present the experimental results. Finally, our conclusion is stated in the last section.

2. Related work Mining association rules in presence of taxonomy (classification hierarchy) information was first addressed in [6]. The problem is named as mining generalized association rules, which aims to find associations among items at any level of the taxonomy. Another closely related work but with different purpose was conducted in [4], which emphasized “ drill-down” discovery of association rules. In [5], Jea et al. considered the problem of discover multiple-level association rules with composition (has-a) hierarchy and proposed a method. Their approach is similar to [7]. In [2], Domingues and Rezende proposed an algorithm, called GART, which uses taxonomies, in the step of knowledge post-processing, to generalize and to prune uninteresting rules that may help the user to analyze the generated association rules.

3. Problem statement Let I {i1, i2,…,im} be a set of items, and DB {t1, t2,…,tn} be a set of transactions, where each transaction ti  tid, Ahas a unique identifier tid and a set of items A (A  I). To study the mining of association rules with ontological information from DB, we assume that the ontology of items, T, is available and is denoted as a directed acyclic graph on I E, where E {e1, e2, …, ep} represents the set of extended items derived from I,

including generalized items J  {j1, j2, …, jq} and composite items K {k1, k2, …, kr}, i.e., E J K. There are two different types of edges in T, taxonomic edge (denoting is-a relationship) and meronymic edge (denoting has-a relationship). We call an item j a generalization of item i if there is a path composed of taxonomic edges from i to j, and conversely, we call i a specialization of j. On the other hand, we call item k a component of item i if there is a path composed of meronymic edges from i to k, and we call i an aggregation of k. Definition 1 Given a transaction t  tid, A , we say an itemset B is in t if every item in B is in A or is an extended item of some item in A. An itemset B has support s, denoted as s sup(B), in the transaction set DB if s% of transactions in DB contain B. Definition 2 Given a set of transactions DB and an ontology T, an association rule is an implication of the following form, A  B, where A, B I E, A B , and no item in B is an extended items of any item in A, and vice versa. The support of this rule, sup(A  B), is equal to the support of A B. The confidence of the rule, conf(A  B), is the ratio of sup(A B) versus sup(A). Definition 3 The problem of mining association rules is that, given a set of transactions DB and a ontology T, find all association rules that has support and confidence equal to or greater than a user-specified minimum support ms and minimum confidence mc, respectively. For example, consider the ontology in Figure 1. For primitive purchased item “ Sony VAIO” , “ PC” and “ Desktop”are its generalizations, while “ Segate 60GB” and “ RAM 512MB”are its components. It is likely to discover the following association rules: PC  HP DeskJet, or Product with IBM 60GB  HP DeskJet. The first rule asserts that pe opl ewh o pu r c h a s e“ PC” pr odu c tt e n dt opu r c h a s e“ HPDe s k J e t ” ,while the second one implies that people who purchase a product c on t a i n i n g“ I BM 60GB”t e n dt opu r c h a s e“ HPDe s k J e t ” . Printer

PC

Is-a Hs-a

Desktop PC

Hard Disk

Memory

Notebook

IBM 60GB

HP DeskJet

Epson --EPL

---

Sony Gateway VAIO GE

Segate 60GB

---

RAM 512MB

IBM TP

Ink Cartridge ---

Photo Conductor

RAM --256MB

Figure 1. Example of ontology

Toner --Cartridge

4. The proposed methods 4.1 Algorithm description As founded in [1], the task of association rule mining can be decomposed into two phases: frequent itemset generation and rule construction. Since the second phase is straightforward and less expensive, we concentrate only on the first phase of finding all frequent itemsets. We propose two algorithms, called AROC and AROS, which are derived from the Cumulate and Stratify algorithms, respectively, presented in [7], wherein ARO stands for Association Rules with Ontological information. The main problem arisen from incorporating ontological information into association rule mining is how to effectively compute the occurrence of an itemset A in the transaction database DB. This involves checking for each item a A. Intuitively, we can simplify the task by first adding the generalized and composed items, i.e., extended items, of all items in a transaction t into t to form the extended transaction t*. We denote the resultant extended database as ED. There is one thing worthy of special care. A purchased item also might be a component of other purchased items. After the addition of extended items into transactions, to differentiate whether an item is directly purchased by customers or is an indirectly purchased component we introduce the asterisk symbol ‘ *’to an item that denotes indirectly purchased component. The first algorithm we proposed, called AROC, is deployed following the Cumulate algorithm in [7] with some extensions derived from the following lemmas and enhancements: Lemma 1 [7] The support of an itemset A that contains both an item a and its generalization ais the same as the support for itemset A  { a}. Lemma 2 The support of an itemset A that contains both an item a and its component ais the same as the support for itemset A  { a}. Observation 1 Itemset pruning: Any candidate itemset that contains both an item and its generalization or its component can be pruned. Observation 2 Extension piltering: The extension (generalization or component) of an original item can be added into a transaction only if the item does appear in at least one candidate itemset being counted currently. Figure 2 gives an overview of the proposed AROC algorithm. Another algorithm we introduced, called AROS, is an extension derived from the Stratify algorithm in [7]. The most distinguished feature of AROS goes to the course of occurrence counting of candidate itemsets. Adopting the concept of stratification, we divide the set of candidate k-

itemsets Ck, according to the classification and composition relationship, into two disjoint subsets, called maximal candidate set MCk and residual candidate set RCk, as defined below: Input: (1) DB: the original database; (2) T: the item ontology; (3) ms: the ms setting. Output: L: the set of frequent itemsets for DB with T w.r.t. ms. Steps: 1. repeat 2. if k 1 then Generate C1 from item ontology T; 3. else Ck  apriori-gen(Lk1); 4. Delete any candidate in Ck that possesses classification or composition relationship between items; /* Observation 1 */ 5. for each transaction t DB do 6. for each item a t do 7. Add all generalizations or components of a in T into t and remove any duplicates; /* Observation 2 */ 8. for each itemset A Ck do 9. if A in t then count(A)++; 10. Lk {A | ACk and sup(A) ms}; 11. until Lk  12. L  Uk Lk;

Output: L: the set of frequent itemsets for DB with T w.r.t. ms. Steps: 1. repeat 2. if k 1 then Generate C1 from item ontology T; 3. else Ck  apriori-gen(Lk1); 4. Delete any candidate in Ck that consists of classification or composition relationship between items; 5. MCk MCk-gen(Ck , T); /* Using Ck, T to find maximal Ck */ 6. Scan ED to count count(A) for each itemset A in MCk; 7. MLk {A | AMCk and sup(A) ms}; 8. RCk RCk-gen(Ck , MCk, MLk); /* Using Ck, MCk, MLk to find RCk */ 9. if RCk then 10. Scan ED to count count(A) for each itemset A in RCk; 11. RLk {A | ARCk and sup(A) ms}; 12. end if 13. Lk MLk RLk; 14. until Lk  15. L Uk Lk;

Figure 3. Algorithm AROS Item ontology T A

Figure 2. Algorithm AROC

Definition 5 Consider a set Sk of candidates in Ck induced by the schema (a1, a2,…,ak1,  , where '' me a n s“ don ' t c a r e ” .A c a n di da t ek-itemset A   a1, a2,…,ak1, ak) is a maximal candidate if none of the candidates in Sk is a generalization of A or a component of A. That is, MCk  ~ {A | ACk, (A   ã1, ã2,…,ãk1, ãk) Ck, for ai  ãi, 1 i k1, and ãk being a generalization or a component of ak)}. The residual candidate set RCk Ck MCk. Lemma 3 Consider three k-itemsets (a1, a2,…,ak), (a1, a2,…,a ) and (a1, a2,…,a ), where a is a generalized k k k item of ak and a is a component item of a …, k. If (a1, a2, k ) or (a , a ,…,) is not frequent, then neither is (a , ak ak 1 2 1 a2,…,ak). Based on above concepts, rather than computing the occurrences of all candidates in Ck in the same pass as in AROC, we first count candidates in MCk and then proceed to RCk, hopefully waiving the necessity of counting some candidates in RCk. The AROS algorithm is shown in Figure 3.

4.2. Example We use Figure 4 to illustrate our algorithms. Table 1 shows the extended database. The running summary of AROS is shown in Table 2. The running summary of AROC is the same as Table 2, but without columns BC2, BL2, RC2 and RL2. Input: (1) DB: the original database; (2) T: the item ontology; (3) ms: the ms setting.

G

B D

C

H

E

F

J

I K

L

ms 40%

Transaction table (DB) TID Primitive Items 1 I, C, F* 2 H, I, C 3 I, B, D* 4 B 5 I, B, D* 6 H, D*

Figure 4. An illustrated example Table 1. Extended database (ED) TID 1 2 3 4 5 6

Primitive Items I, C, F* H, I, C I, B, D* B I, B, D* H, D*

Extended Items G, K, L, A, E, F G, J, K, L, A, E, F G, K, L, A, D A, D G, K, L, A, D G, J, K

Table 2. Running summary of AROS C1 A B C D* D E E* F F* G H I J J* K K* L L*

L1 A B D* D G I K L

C2 A, D* A, G A, I A, K A, L B, D* B, G B, I B, K B, L D*, D D*, G D*, I D*, K D*, L D, G D, I D, K D, L K, L

BC2 A, D* A, G A, K A, L B, D* B, G B, K B, L D*, D D*, G D*, K D*, L D, G D, K D, L K, L

BL2 RC2 A,G A, I A, K A, L D*, G D*, K K, L

RL2 A, I

L2 C3 & L3 A, G A, K, L A, I A, K A, L D*, G D*, K K, L

In this section, we evaluate the performance of the proposed algorithms, AROC and AROS. Synthetic datasets, generated by the IBM data generator [1], are used in the experiments after adding generalized and component items. The parameters are that number of items is 362 items, average size of transactions 20 items, groups 30, levels 4, and fanout 5. We also adopted two different support counting strategies into the implementation of each algorithm: one with the horizontal counting [1], and the algorithms are denoted as AROC(H) and AROS(H); the other with the vertical intersection counting [6], and the algorithms are denoted as AROC(V) and AROS(V). We first compare the execution times of AROC and AROS for different settings of ms with |DB| 200,000. The results are shown in Figure 5. At low ms, AROC(V) outperformed AROS(V), with the gap increasing as the ms decreased. The reason is that AROS(V) spent too much time on finding maximal candidate itemsets. However, at low ms, AROS(H) performs a little better than AROC(H) because scanning the database dominates over most of the time in horizontal support counting.

Run time (sec.)

10000

AROC(V)

AROS(V)

AROC(H)

AROS(H)

10000 Run time (sec.)

5. Experimental results

presented two algorithms, AROC and AROS, for discovering frequent itemsets. Experimental results showed that these two algorithms have good linear scaleup characteristic. In the future work, we will study the maintenance aspect of this problem.

4

6

8

10

12

14

16

18

20

Acknowledgement This work is partially supported by National Science Council of Taiwan under grant No. NSC 95-2221-E-390024.

References

10

[3] 3

100

Figure 6. Execution times for various transactions

[2]

2.5

log

Number of transctions (x 10,000)

100

2

AROS(H)

1000

2

[1]

1.5

AROS(V)

AROC(H)

10

1000 log

1

AROC(V)

3.5

ms%

[4]

Figure 5. Execution times for various ms

We then evaluated the algorithms under varying transaction sizes at ms  1.5%. The results are shown in Figure 6. AROC(V) performed 750% faster than AROS(V) at transactions 20,000, with the gap decreasing as the transaction increased since the advantage of pruning candidates by maximal candidate itemsets in AROS(V) is gradually overwhelmed by increasing transactions.

6. Conclusion We have investigated the problem of mining association rules with ontological information. We

[5]

[6]

[7]

[8]

R.Ag r a wa la ndR.Sr i k a nt ,“ Fa s ta l g orithms for mining a s s oc i a t i onr ul e s , ”Proc. of 20th Int. Conf. on Very Large Data Bases, 1994, pp. 487-499. M.A. Domingues and S.O. Re z e nde ,“ Us i ngtaxonomies to facilitate the analysis of the association rul e s , ”Proc of 2nd Int. Workshop on Knowledge Discovery and Ontologies, 2005, pp. 59-66. T.R. Gruber,“ At r a ns l a t i onapproach to portable ontology specifications, ”Knowledge Acquisition, Vol. 5, 1993, pp. 199-220. J. Han and Y. Fu, “ Discovery of multiple-level association rules from large databases,”Proc. of 21st Int. Conf. on Very Large Data Bases, 1995, pp.420-431. K. F.J e a ,T. P.Chi ua ndM. Y.Cha ng ,“ Mi ni ngmultiplelevel association rules in has-a hi e r a r c hy , ”Proc. of the Joint Conf. on AI, Fuzzy System, and Grey System, 2003. A.Sa v a s e r e ,E.Omi e c i ns k i ,a ndS.Na v a t he ,“ Ane f f i c i e nt algorithm for mining associat i onr ul e si nl a r g eda t a ba s e s , ” Proc. of 21st Int. Conf. on Very Large Data Base, 1995, pp. 432-444. R. Sr i k a nt a nd R. Ag r a wa l ,“ Mi ni ng g e ne r a l i z e d a s s oc i a t i onr ul e s , ”Future Generation Computer Systems, Vol. 13, Issues 2-3, pp. 161-180. V.C. Storey, “ Understanding Semantic Relationships, ” Very Large Databases Journal, Vol. 2, No. 4, 1993, pp. 455-488.