A binary based approach for generating association

1 downloads 0 Views 122KB Size Report
the process of generating frequent itemsets based on STT. In. Section 7, we ..... signature of the itemset I and L1, L2, …, Lp the set of selected. Tids during the ...
A binary based approach for generating association rules 1

Med El Hadi Benelhadj1, Khedija Arour2, Mahmoud Boufaida1 and Yahya Slimani3 LIRE Laboratory, Computer Science Department, Mentouri University, Constantine, Algeria 2 National Institute of Applied Science and Technology, Tunis, Tunisia 3 Computer Science Department, Faculty of Sciences, Tunis, Tunisia

Abstract - Advanced database application areas, such as computer aided design, office automation, digital libraries, data-mining, hypertext and multimedia systems, need to handle complex data structures with set-valued attributes. These information systems contain implicit data that will be necessary to extract and exploit, by using data mining techniques. To exploit the data from these systems, the choice of appropriate storage structures becomes essential. In this paper, we propose a new compact structure to represent a transactions database, called a signatures tree, to speed up the signature file scanning. The construction of this tree requires only one single access to the transactions database. This tree will be used later to compute maximum support, extract frequent itemsets and generate association rules. Keywords: Data Mining, Frequent itemset, Signature file, Signature tree.

1 Introduction Extracting Knowledge from Databases involves the extraction of implicit information, unknown and potentially useful, stored in large databases. The amounts of data collected are becoming increasingly important and their analysis more tedious. Data mining is an essential step in a KDD process. The efficient search of information in large databases to extract knowledge from the contributing to a decision is vital for any expert. Several methods and techniques are used in KDD process to extract knowledge from large databases. Mining association rules which trends to find interesting association or correlation relationships among large amounts of data is one of these techniques. An association rule R is defined as an implication of the form R: S → T such that S ⊂ I and T ⊂ I and S ∩ T = Ø, I being a set of items. This generation of association rules involves two steps:



1.

The extraction of frequent itemsets (with support Minsup),

2.

The generation of association rules (with confidence ≥ Minconf).

Using this technique, we can generate, from a set of transactions, the frequent itemsets (itemsets with support above a minimum fixed by the user) and then the association rules from these itemsets. The first step is the most expensive with high demands for computation and data access [1] [3].

Because of that, we focus our attention in this paper on the frequent counting. Our proposition consists to adopt a binary approach for generating frequent itemsets. Transaction database are represented by using a new compact data structure based on tree signatures. The use of signatures provides a low cost of storage and a speed of binary operations. Several algorithms that generate association rules are based on two sub-steps "Generate" and "Verify" such as Apriori [1]. However, this phase is the most expensive, because of multiple access to transactions database. Other algorithms have tried to improve the algorithm Apriori. The partition algorithm [10] divides the transaction database into partitions, which increases the number of locally frequent itemsets that are globally rare, thus generating a loss of time doing redundant computation [13]. However, the algorithm Dynamic Itemset Counting [4] is a generalization of the Apriori algorithm. FP-Growth [9] extracts the frequent itemsets without generating candidate itemsets. It is based on a FP-Tree structure, which requires a complete reconstruction of the FP-Tree, for each updating. DFPMT-A [12] mining frequent itemsets are based on Apriori algorithm and uses dynamic approach like Longest Common Subsequence. In order to compute the support of a collection of itemsets, it is necessary to access to the transaction database. As the transaction database is generally large, a solution for avoiding repetitive and costly access is to represent it using compact structures. As an example of these compact structures, we can mention: BitMap [8], FP-Tree [9], Patricia tree [11], Transposed Form [12] and so on. Standard data structures cannot provide scalability, in terms of the data size and the performance for large databases, we must rely to adopt a binary and compact structure to improve performance and search space. In this paper, we propose an approach using a binary tree structure to represent the transaction database. Each transaction is represented by a binary signature. The set of signatures is a signature file which is represented as a signature tree. In the process of generating frequent itemsets, a signature SI is associated with each itemset I and is constructed with the same way that the selected transaction signatures. Each SI is associated with the identifier of transaction (Tid), which generating SI. This process constructs the signatures transaction tree, a compact structure, finds all the frequent itemsets, based on a maximum support, and une only one access to transactions database. The reminder of the paper is organized as follows: Section 2 gives an overview of the concept of tree signature. Section 3 presents our proposed structure called STT (Signature

Transaction Tree) and the tree construction process. Section 4 gives some basic concepts. In Section 5, we discuss the search process of a signature in STT. Section 6 is devoted to the process of generating frequent itemsets based on STT. In Section 7, we analyze theoretical complexity of our proposition. In Section 8, we report the experiment results. Finally, Section 9 concludes the paper.

the left child corresponds to the value "0" and the right one to the value "1". Each leaf node contains three informations: a signature, the number of transactions generating this signature and the transactions identifier (Tid). The number of leaf node in STT is equal to the number of signatures.

3.1. Example

2 Specification of the Signature tree

The construction of a signature tree requires two phases:

A signature file can be considered as a set of bit strings, which are called signatures. Several approaches have been proposed to represent the signature file: sequential signature file, bit_slice file and several other variants. The signature file method involves a high processing. This problem is resolved by partitionning the signatures file or by introducting an auxiliary data structure. Recently, Chen proposed an approach to represent the signature file as a signature tree [5]. Definition 1 [6]: A signature is a binary vector of length m obtained by applying one (or several) hash function (s).





The application of the hash function H(item) = Integer (for example, we use modulo function) to obtain the signature for each item in a transaction; the composition of these signatures will give the transaction signature. All transactions and their signatures are represented in Table II. We note that the transactions T3 and T6 generate the same signature (the phenomenon of collision). It will be represented only once in STT (S3 in our example). Each transaction signature is inserted in the STT tree. Each leaf of this tree contains the signature, the Tid and the number of transactions generating this signature.

Table I shows an example of signature extraction of a block ("full text scan"). The signatures of the words of the block are combined by superposition (by or-ing) the word signature.

TABLE II. TABLE I.

Transactions signatures and corresponding STT

Signature Generation Tid

Full Text Scan Bloc Signature

0000 0000 0000

0000 0001 1000

0000 0000 0000

0010 0000 0000

0000 0000 0000

0000

1001

0000

0010

0000

1 2 3 4 5 6

Transactions

1, 2, 4, 6, 8, 10 1, 2, 6, 10 3, 4, 10 1, 2, 4, 5, 8 1, 3, 4, 6, 10 2, 3, 4





For each internal node of Ts, the left edge leaving it is always labeled with "0" and the right edge is always labeled with "1". Ts have n leaves labeled 1, 2, …, n, used as pointers to n different signatures S1, ..., Sn in S. Let nf be a leaf node. Denote the pointer p(nf) to the corresponding signature Each internal node v is associated to a positive number, noted by position(v), to tell which bit will be checked.

3 Structure of STT Improvement of algorithm performance for discovering association rules requires an optimization of the extraction phase of frequent itemsets. To reach this objective, we propose, the STT structure representing the transaction signatures. Each transaction is represented by a signature of size m. STT has the advantage of being both a compact structure (binary representation) and dynamic (care of updates). A signature tree contains two types of nodes: internal nodes and leaf nodes. For each internal node of STT,

11101010 01100010 00111000 11101100 01111010 00111000 1

Definition 2 [5]: A tree of signatures Ts represents a set of signatures S = {S1, ..., Sn} where: Si ≠ Sj for all i ≠ j and│Sk│ = m for 1 ≤ k ≤ n. Ts is a binary tree such that: •

Signatures

6

2 S3,2,Tid3,Tid6

4

S2, 1, Tid2

S1, 1, Tid1

S4, 1, Tid4

S5, 1, Tid5

3.2. Construction process of STT At the beginning, the tree contains an initial node: a node containing the first signature transaction, his Tid and the number "1". Then, we take a new signature transaction, a composition of signatures items, and we insert it into the STT. Let s be the signature we wish to enter. We cross the tree from the root. Let v be the node encountered and assume that v is an internal node with position (v) = i. Then, s[i] will be checked. If s[i] = 0, we go left, otherwise, we go right. If v is a leaf node, we compare s with the signature s' into v. If s = s', we add only the Tid in the leaf v and increment the number of transactions nt. Otherwise, s is the new signature. We

assume that the first k bits of s agree with s'; but s differs from s' in the (k+1)th position. We construct a new node u with position (u) = k+1 and replace v with u. We mean that the position of v in the tree is occupied by u and v becomes one of u’s children. If position (u) = 1, we make v be the left and s be the right child of u, respectively. If position (u) = 0, we make v the right child of u and s the left child of u.

1

6

2

1) Steps to generate STT S3, 1, Tid3

The following steps (a) to (f) show how to create the STT: The step (a) build a root node r such that r is a leaf node and contains the signature S1, the number "1" and the identifier of the first transaction T1.

S1, 1, Tid1

4

S2, 1, Tid2

S4, 1, Tid4

S5, 1, Tid5

(a) Insert T1 (S1) S1, 1, Tid1

(b) Insert T2 (S2): S1[1] ≠ S2[1] ⇒ create internal node v with position(v) = 1 and leaf node {S2, 1, Tid2}. 1

The step (f) inserts an existing signature. We use the same way of the steps (b) to (e). When arrives in a corresponding leaf node, we remarques that S6 = S3. We increment the number and we add the identifier of the transaction T6. (f) Insert T6 (S6): S6[1] = 0, S6[2] = 0. S6 = S3 the number and add Tid6 in the leaf node.

⇒ Increment

1 S2, 1, Tid2

S1, 1, Tid1

The steps (b) to (e) insert a new signature Si to the corresponding leaf node in STT, using the value of signature bit position in each internal node.

2

(c) Insert T3 (S3): S3[1] = 0, S2[2] ≠ S3[2] ⇒ create internal node v with position(v) = 2 and leaf node {S2, 1, Tid2}.

S3,2,Tid3,Tid6

4

S2, 1, Tid2

1

S1, 1, Tid1

S4, 1, Tid4

S5, 1, Tid5

Figure 1. Steps of STT Construction.

S1, 1, Tid1

2

6

2) Algorithm to construct STT S3, 1, Tid3

The algorithm Cons_STT to construct the tree is presented below. It is composed of two parts:

S2, 1, Tid2

(d) Insert T4 (S4): S4[1] = 1, the first different bit between S4 and S1 is the 6th bit, S4[6] = 1 and S1[6] = 0 ⇒ create internal node v with position(v) = 1 and leaf node {S2, 1, Tid2}. 1

S3, 1, Tid3

S2, 1, Tid2

S1, 1, Tid1

Signature Generation: A hash function "Gen_Sig(Ti)" provides the signature of the transaction Ti.



Insertion: Call Insert(SI) procedure that inserts the signature SI in STT. TABLE III.

6

2



S4, 1, Tid4

(e) Insert T5 (S5): S5[1] = 0, S5[2] = 1. The first different bit between S2 and S5 is the 4th bit, S2[4] = 0 and S5[4] = 1 ⇒ create internal node v with position(v) = 4 and leaf node {S5, 1, Tid5}.

STT Construction Algorithm

Algorithm Cons_STT /* STT construction*/ Begin /* Input: Set of Transactions */ /* Output: STT */ /* v is the current internal node, it contains 3 fields: bit position to check, left child pointer and right child pointer */ /* f is the current leaf node, containing 3 fields: the signature S, Tids and number */ S1 = Gen_Sig (T1) Build a root node r such that r is a leaf node. /* It will contain the first transaction signature S1 and the corresponding Tid */

For i = 2 à n Do Si = Gen_Sig (Ti) Call Insert (Si) EndDo End The following procedure inserts a given transaction signature to STT: TABLE IV.

Insert algorithm in STT

Procedure Insert (s,STT) Begin Stack ← root While Stack not empty Do v ← pop (Stack) If v is internal node Then j ← position (v) If s[j] = 1 Then push (Stack, right_child) Else push (Stack, left_child) Endif Else /* v: leaf node = s' and nt */ If s = s' Then /* Old signature */ nt ← nt + 1 Else /* New signature */ Assume that the first k bits of s agree with s'. s differs from s' in the (k+1)th position. Generate a new internal node with position(u) = k+1. Generate a new leaf node v'  {s, Tid, 1} If s [k+1] = 1 Then v' will be the right child of u and v the left child Else v' will be the left child of u and v the right child Endif Endif Endif EndDo End The insertion procedure presents two possible cases: •

s = s'. In this case, we add Tid in the leaf node and we increment the transactions number nt.



s ≠ s'. A new internal node u is created, containing the corresponding position to first different bit between s and s'. We create also a new leaf node v' containing {s, Tid, 1}. If position (s) = 1, we make v' be the left and v be the right child of u. If position (s) = 0, we make v' the right child of u and v the left child of u.

Definition 4: An itemset I is a subset of D. A k-itemset is an itemset of cardinality k. Definition 5: A transaction Ti is an itemset wich is associated to an identifier: the Transaction Identifier (Tid). Definition 6: The support of an itemset I denoted Support (I) is the number of transactions containing I. Definition 7: A minimum support Minsup is a threshold fixed by the user. Definition 8: An itemset I is denoted fréquent if Support(I) ≥ Minsup. Definition 9: The maximum support Maxsup of an itemset I is equal to the sum of the selected Tids sets. If SI is the signature of the itemset I and L1, L2, …, Lp the set of selected Tids during the search process of SI, the maximum support is: Maxsup (I) = | L1| + |L2| + … + |Lp| Where | Li| = number of Tids in the leaf i. Définition 10: An itemset I is said Mfréquent if Maxsup(I) Minsup.

5 Search Process in STT Now, we discuss how to search a signature SI of an itemset I in STT. During the traversal of STT, the inexact matching is done as follows: 1.

Let v be the node encountered and position(v) be the position to be checked.

2.

If position(v) = 1, we move to the right child of v.

3.

If position(v) = 0, both the right and left child of v will be explored.

In fact, this process corresponds to the signature matching criterion, i.e., for a bit position i in SI, if it is set to "1", the corresponding bit position in s must be set to "1"; if it is set to "0", the corresponding bit position in s can be "1" or "0". This reflects that only the signatures s, such that s ٨ SI = s, are selected. Example 1. The itemset I is composed by the items 1, 2 and 6. If we apply the hash function, we obtain the signature SI = 01100100. The procedure Search(SI) will select all the signatures that contain SI (in our example, S1, S2 and S5). Figure 2 represents the bold path in the STT tree when searching SI. 1

2

S3, 2, Tid3, Tid6

6

4

S1, 1, Tid1

4 Basic Concepts Definition 3: An item Ii is any object, attribute, literal, into a finite set of distinct elements D = {I1, I2, ..., In}.



S2, 1, Tid2

S5, 1, Tid5

Figure 2. Result of search SI = 01100100

S4, 1, Tid4

5.1. Algorithm to search a signature in STT

7 Complexity Study

This algorithm search a signature and computes the maximum support of itemset using STT structure.

The algorithm Gen_STT to build the signatures tree has a complexity of O(n*m) where:

TABLE V.

Search Algorithm in STT

Algorithm Search /* Input: an itemset I */ /* Output:Maxsup (I) */ Début SI = Gen_Sig (I) ST ← Ø push (Stack, root); While Stack not empty Do v ← pop (Stack); If v is an internal node Then i ← position (v) /*bit position to check */ If SI [i] = 1 Then push (Stack, right_child(v)) Else push (Stack, left_child(v)) push (Stack, right_child(v)) Endif Else /* v is a leaf node */ Compare SI with the signature S If S contains SI Then ST ← ST ∪ {Tids} Endif Endif EndDo Use ST to compute Maxsup(I) Return Maxsup(I) End

6 Frequent Itemsets Generation The generation of frequent itemsets computes, for each candidate itemsets I, the maximum support of I Maxsup(I) and compares it to a minimum Minsup, defined by the user. An itemset I is said frequent if Maxsup(I) is greather than Minsup. TABLE VI.

Generation Frequent Itemset Algorithm

Algorithm Generation_FI /* Input: Set of itemsets I = {I1, I2,...,In}*/ /* Output: Set of frequent itemsetsFI */ Begin /* Initially, FI is empty */For i = 1 to n Do Search (Ii,Maxsup(Ii)) If Maxsup(Ii) > Minsup Then FI = FI ∪{Ii} /* Union */ Endif EndDo Return FI End Example 2. If we consider database transaction of Table II, the itemset I = {1, 2, 6} and Minsup = 2, we can select 3 signatures S1, S2 and S5. Then, Support(I) = 3 and is greather than Minsup. We conclud that the itemset I is a frequent one.



n: number of transaction signatures



m: size of a signature

For against, the procedure for insertion of each signature in STT requires one tree parsing for the first signature, 2 for the second, and so on. The number of path traversed is: 1 + 2 + ... + n = n (n +1) / 2 = (n2 + n) / 2 The complexity of the insertion algorithm is about O (n2). The search procedure of a signature in the tree STT has been studied by Chen [5] and is of order O (n/2l), where: •

n is the number of signatures



l the number of bits to "1" in SI.

The Generation_FI algorithm for generating frequent itemsets contains a loop that is run n times (n being the number of candidate itemsets). Complexity to handle a candidate itemset is equal to n times that of the search procedure, thus the order of: O(n (n/2l)) = O (n2/2l). Finally, the complexity of generating frequent itemsets is: O (nm) + O (n2) + O (n2/2l) ~ O (n2).

8 Experimental Study We have implemented our proposal in C++, on an Intel Core 2 Duo 1,80 GHZ and 2 GB RAM. At a first experimentation, we performed the same test on two transaction databases, dense and sparse, varying the minimum support to measure its influence on the total number of frequent itemsets extracted and comparing our results with those obtained by Apriori [2]. Figure 3 and Figure 4 show the results obtained respectively with Mushroom and T10I4D100K transactions bases. The number of extracted frequent itemset by our approch, based on maxsup, is about equal to the frequent itemset obtained by Apriori.

Figure 3. Experimentation with a dense database (Mushroom)

In figure 7, we have the same time for support between 50% and 90% and a better time for support less than 50%.

Figure 4. Experimentation with a sparse database (T10I4D100K)

In the second experience, we consider several transaction databases and we compute the time required for different Minsup.

Figure 8. Time vs Minsup for Kosarak

Figure 8 shows that our result is better only for a little support (less than 1,5%).

Figure 5. Time vs Minsup for Mushroom

Figure 5 shows the Time vs Minsup for Mushroom obtained by our approach and respectivelly Apriori [4]. Our approach gives a better result for a support less than 20% and the same result for support greather than 20%.

Figure 9. Time vs Minsup for T10I4D100K

Figure 9 gives a linear time for the both approch. Our approach provides better result than Apriori.

Figure 6. Time vs Minsup for Retail Figure 10. Time vs Minsup for T40I10D100K

Figure 6 gives the Time vs Minsup for Retail obtained by the both approch. Our result is also better for all supports.

In figure 10, for T40I10D100K, the result is approximatly the same for our approach and Apriori.

9 Conclusion

Figure 7. Time vs Minsup for Accident

In this paper, we proposed a new data structure to represent database transactions. The main caracteristics of this structure is the use of a binary signature, which can be associated to each transaction. These binary signatures are then organised in a tree, in which each edge is labeled with "0" or "1", and each internal node is associated with a number, indicating which bit in a signature to check. Thus, the searching of a signature uses only a signature binary tree and nead only one acces to transactions database .

The complexity of our proposal is linear. In order to show the efficiency of our approch, we have conducted a series of experimentation to compare our proposal with the Apriori algorithm, using different database transactions and différent supports. The results of these experimentation show that the signature tree algorithm outperforms significantly Apriori.

10 References [1] Agrawal R., Srikant Ramakrishnan, "Fast Algorithm for Mining Association Rules", in Proceeding of the 20th VLDB Conference Santiago, September 12-15, pp. 487- 499, Chile, 1994. [2] Bodon F. "A Trie-based APRIORI Implementation for Mining Frequent Item sequences", OSDM05 Proc. of the 1st Int. Workshop on Open Source Data mining, pp. 56-65, August 21, Chicago, Illinois, USA, 2005. [3] Bodon F., "A Fast APRIORI Implementation", Proceeding of FIMI'03, 19th Workshop on Frequent Itemset Mining Implementations. In conjonction with the 3rd IEEE International Conference on Data Mining, pp. 16-25, November 19, Melbourne, Florida, USA, 2003. [4] Brin S., Motawni R., Ulman J.D., "Dynamic itemset counting and implication rules for market basket data", In Proceedings of the ACM SIGMOD, , pp. 255-264, May 1115, Tucson, Arizona, USA, 1997. [5] Chen Yangjun, Chen Yibin.,"On the Signature Tree Construction and Analysis", IEEE Transactions on Knowledge and Data Engineering, vol. 18, Issue 9, pp. 1207-1224, September 2006. [6] Faloutsos C. "Signature Files: Design and Performance Comparaison of Some Signature Extraction Methods", ACM Sigmod Record, Volume 14, Issue 4, pp. 63 – 82, May 1985. [7] FIMI repository. http://fimi.cs.helsinki.fi/data. [8] Gardarin G., Ph. Pucheral, and F. Wu., "Bitmap based algorithms for mining association rules", In Proceedings of 14th Int. Conf. Bases de Données Avancées, pp. 157-175, Octobre 26-30, Hammamet, Tunisie, 1998. [9] Han J., Pei J. and Yin Y., "Mining frequent patterns without candidate generation". In Proceedings of the 2000 ACM SIGMOD Int. Conf. on Management of Data, Dallas, pp. 1-12, May 14-19, 2000. [10] Savesere A., Omiecinski E., Navathe S., "An efficient algorithm for mining association rules in large databases", In Proceedings of the 21th VLDB Conference, pp 432-444, September 11-15, Zurich, Switzerland,1995. [11] Zandolin D. and Pietracaprina A., "Mining Frequent Itemsets using Patricia Tries", In Proceedings of the Workshop on Frequent Itemset Mining Implementations,

FIMI03, vol. 90 of CEUR Workshop Proceedings, Melbourne, Florida, USA, 2003. [12] Joshi S. and Jain R.C., "A Dynamic Approach for Frequent Pattern Mining Using Transposition of Database", In Proceeding of the International Conference on Communication Software and Networks (ICCSN'10), Feb 26-28, pp 498-501, Singapore, 2010. [13] Zaki M., J. Parthasarathy, S. Ogihara and M., Li W., "New Algorithms for Fast Discovery of Association Rules", In proceeding of the 3rd International Conference on Knowledge Discovery and Data Mining (KDD-97), pp 283286, August 14–17, Newport Beach, California, USA, 1997.