An Algorithm for Mining Frequent Closed Itemsets ... - ScienceDirect.com

7 downloads 186 Views 351KB Size Report
Caiyan Dai and Ling Chen / Physics Procedia 24 (2012) 1722 – 1728. 1723. Author name / Physics Procedia 00 (2011) 000–000. This paper presents an ...
Available online at www.sciencedirect.com

Physics Procedia

Physics Procedia 00 (2011) Physics Procedia 24000–000 (2012) 1722 – 1728

www.elsevier.com/locate/procedia

2012 International Conference on Applied Physics and Industrial Engineering

An Algorithm for Mining Frequent Closed Itemsets in Data Stream Caiyan Dai 1, Ling Chen12 1

Department of Computer Science, Yangzhou University , Yangzhou, 225009, China 2 State Key Lab of Novel Software Technology, Nanjing University, Nanjing 210093, China

Abstract Mining frequent itemsets from data streams by the model of sliding window has been extensively studied. This paper presents an algorithm AFPCFI-DS for mining the frequent itemsets from data streams. The algorithm detects the frequent items using a FP-tree in each sliding window. In processing each new window the algorithm first changes the head table and then modifies the FP-tree according to the changed items in the head table. The algorithm also adopts local updating strategy to avoid the time-consuming operations of searching in the whole tree to add or delete transactions. Our experimental results show that the algorithm is more efficient and has lower time and memory complexity than the algorithms Moment and FPCFI-DS. © 2011 by Elsevier B.V. Selection and/or peer-review under responsibility of ICAPIE Organization © 2011Published Published by Elsevier Ltd. Selection and/or peer-review under responsibility of [nameCommittee. organizer] Open access under CC BY-NC-ND license.

Keywords: stream data, closed frequent data itemsets, sliding window

1. Introduction In recent years, researchers have paid more attention to mining data streams. Mining frequent itemsets from stream data is an important problem with ample applications in stream data analysis. Examples include stock tickers, bandwidth statistics for billing purposes, network traffic measurements, web-server click streams, data feeds from sensor networks, transaction analysis in stocks and telecom call records, etc. Unlike traditional datasets, stream data flow in and out of a computer system continuously and with varying update rates. They are temporally ordered, fast changing, massive and potentially infinite. For the stream data applications, the volume of data is usually too huge to be stored or to be scanned for more than once. Furthermore, since the data points can only be sequentially accessed in data streams, random data access is not practicable.

1875-3892 © 2011 Published by Elsevier B.V. Selection and/or peer-review under responsibility of ICAPIE Organization Committee.

Open access under CC BY-NC-ND license. doi:10.1016/j.phpro.2012.02.254

Caiyan Dai and Ling Chen / Physics Procedia 24 (2012) 1722 – 1728 Author name / Physics Procedia 00 (2011) 000–000

This paper presents an algorithm AFPCFI-DS for mining the frequent itemsets from data streams. The algorithm detects the frequent items using a FP-tree in each sliding window. The experimental results show that the algorithm is more efficient and has lower time and memory complexity than the algorithms Moment and FPCFI-DS. 2. Related Work In 1999, Pasquierti 0 first proposed the concept of closed frequent data itemset to reduce the storage space and processing time. Since then mining closed frequent data itemsets in streams has attracted increasing amount of interest. MOMENT by Chi 0 is a typical algorithm mining the frequent closed itemsets in data stream, which can decrease the size of the data structure.. Based on a FP-tree data structure, J. Han et al proposed FP-growth algorithm 0. In solving many application problems, it is desirable to discount the effect of old data. One way to handle such problems are is using sliding window models. In mining the data stream, there are two typical models of sliding window 0: milestone window model and attenuation window model. Lossy Counting 0 is a typical mining algorithm for data stream based on the milestone window. Using the attenuation window, Chang presented the algorithm Decest algorithm 0 in 2003. Gannell 0 proposed FP-stream by adopting the traditional FP-tree to the data stream. . FP-stream detects the frequent itemsets in the different period by using the different time granularity. Teng proposed FTP-DS algorithm 0 which uses the statistical regression technique in the sliding window. 2.1 The FP-growth Algorithm FP-growth (frequent-pattern growth) algorithm is based on the following partition strategy. Firstly it compresses the database into a frequent pattern tree. Then the compressed database is divided into a group of conditional database, each of which is associated with a frequent itemset or a pattern portion and then the algorithm mines every conditional database. The algorithm needs not to generate all the candidate frequent itemsets. It uses the most infrequent item as the suffix to reduce the search space and the overhead significantly. 2.2 The Moment Algorithm The algorithm of Moment uses an data structure called Closed Enumeration Tree (CET) which consists of four types of nods as follows. Infrequent nodes, Unpromising nodes, Intermediate nodes, Closed nodes. Example 2: Let minsup=50 %, in the GET tree show in Fig.1, d: 1 is an infrequent node, ac: 2、b:3 are unpromising nodes, a: 3 is an intermediate node, and abc: 2、ab:3、c:3 are closed nodes.

Figure 1. An Example of GET

1723

1724

Caiyan Dai and Ling Chen / Physics Procedia 24 (2012) 1722 – 1728 Author name / Physics Procedia 00 (2011) 000–000

In MOMENT, huge amount of memory space and processing time are required for building the GET. To overcome the drawbacks of the algorithms FP-growth and Moment, we present an algorithm AFPCFIDS, which is suitable for the data stream and also efficient in terms of time/space complexity. 3. THE AFPCFI-CS ALGORITHM In this section we introduce the algorithm AFPCFI-CS for mining the frequent closed itemsets in data stream. The framework of algorithm AFPCFI-DS is as follows. Procedure :AFPCFI-DS (D) Input: Data Stream D Output: the updated GCT Begin 1 BuildFirstTree (); Repeat 2 Receive a new transaction t from the stream; 3 AddTrans (t); 4 DeleteTrans (n); /* n is the oldest transaction in the window*/ 5 Until end condition End 3.1 Constructing the FP-tree and GCT for the First Window In line 1 of the algorithm FP-tree and GCT for the first window are constructed by a subroutine BuildFirstTree(). When a new transaction in the stream enters the first window, the head table should be modified accordingly, and the new transaction will be inserted into the FP-tree according to the head table. The algorithm BuildFirstTree().is described as follows: Procedure: BuildFirstTree() Input: length: length of the window; Output: T: the FP-tree, GCT: the Closed Enumeration Tree Begin 1 T=Φ; size=0; 2 repeat 3 receive a new transaction t from the stream; 4 update the head table; 5 insert t into T; 6 size=size+1; 7 call Arrange (T); 8 until size=length; 9 call AFPCFI-DS (T); End Figure 2 illustrates the process of constructing the first FP-tree of the database in Example 2.

Caiyan Dai and Ling Chen / Physics Procedia 24 (2012) 1722 – 1728 Author name / Physics Procedia 00 (2011) 000–000

Figure 2. Insert transaction

bc , ab , acd , acd

The support of a and c in the new window is greater than that of b. Therefore, location of their nodes in the FP-tree should be rearranged as shown in Figure 3.

Figure 3. Rearrange the order of the nodes in FP-tree

Line7 in algorithm BuildFirstTree() uses a procedure Arrange (T) to rearrange the nodes in the FP-tree when adding or deleting transactions. Procedure Arrange (T) is shown as follows. Procedure: Arrange (T,n) Input: T: the FP-tree, m, n (m's child node) Output: updated FP-tree T; Begin 1 for each n's child n' do 2 if sup (n) n'->....from m; 7 insert n'->n->...into m; 8 if m has another child node n' then 9 merge () End When the algorithm inserts or deletes transactions, the support of the associated items should be changed. At this moment, their nodes in FP-tree should be modified accordingly. Let node n’ be a child of node n. If the current head table indicates sup (n’) ≧sup (n) , we should change the position of n and n’ to make sure that all the nodes in the routes of the FP-tree is arranged in the descent order of their support..

3.2 Adding a Transaction After the transaction in the first window being processed, the main task of the algorithm is to maintain the FP-tree and GCT when new transactions coming into the current window. Line 3 of the algorithm AFPCFI-CS (D) adds the new transaction t using the procedure AddTrans(t). The procedure AddTrans (T) on line 3 of the algorithm AFPCFI-DS (D) is dicribed as follows: Procedure:AddTrans(T) Input: T ,GCT Output: the updated T, GCT

1725

1726

Caiyan Dai and Ling Chen / Physics Procedia 24 (2012) 1722 – 1728 Author name / Physics Procedia 00 (2011) 000–000

1 updates the head table according to the new transaction t; 2 if exists super itemset of the new transaction in CFI then 3 insert the new transaction into CFI 4 its sup=sup (super itemset) +1 5 if exists the same itemset to the new transaction in CFI then 6 its sup=sup+1 7 if exists son itemset of the new transaction in CFI then sup (son itemset) = sup (super itemset) +1 8 9 find all the paths associated with the changed items in the head table (except the condition in line7 and line8) and form a new FP-tree T'; 10 insert t into T' 11 update support of the associated nodes; 12 call Arrange (T'); 13 call AFPCFI-DS (T') Figure 4 shows the new FP-tree after inserting new transaction abd into the current window of Example 2.

Figure 4. The FP-tree after inserting

3.3 Deleting a Transactions Once a new transaction is added into the window, there must be an old one being deleted from the window, and the FP-tree and GCT-tree should be modified accordingly. Since the two trees are considered to be independent when deleting the transactions we can modify them separately. Line 4 of the algorithm AFPCFI-CS (D) deletes the old transaction which uses the procedure DeleteTrans (n) described as follows. The change of GCT-tree: Procedure: DeleteTrans (n) Input: n, the current node in GCT Output: updating the children nodes of node n 1 updates the head table; 2 find all the paths associated with the changed items in the head table and form a new FP-tree T'; 3 update support of the associated nodes; 4 call Arrange (T') 5 for each child node n' of n do 6 if n' is not relevant to t then 7 continue; 8 update support of n'; 9 set Y=n'.base U {n'}; 10 if sup(Y)>=min_sup then 11 if leftcheck(Y) =false then 12 DeleteTrans (n');

1727

Caiyan Dai and Ling Chen / Physics Procedia 24 (2012) 1722 – 1728 Author name / Physics Procedia 00 (2011) 000–000

13 else 14 remove n' and n's descendants from GCT; 15 remove some invalid intermediate node; 16 else 17 remove n' and n's descendants from GCT; 18 remove some invalid intermediate node; 19 if n was closed and a child n' of s.t. sup (n') =sup (n) then 20 mark n an intermediate node; 21 remove n from the hash table; 22 else if n was closed then 23 update n's entry in the hash table; For example, after deleting the transaction bc and then d ' s support is more than the b' s and c' s , so it is needed to switch the location of responding items in the FP-tree. Secondly, find the responding item in DCT-tree and change the node’s type. Lastly, determine which node is intermediate node; delete the responding child node and unpromising intermediate nodes. 4. Experimental Results

In order to evaluate the performance of this algorithm, we compare it with Moment and FPCFI-DS 0 under the condition of the decreasing support. The experiment environment of this essay is: 2.8GHz CPU and 1G RAM PC; the operation system is Windows XP; the program developed in Visual studio C++6.0. The testing datasets of this article are Mushroom 0 and T10I4D100K 0. The size of the sliding window is 5K and 80K when dealing with dense dataset and sparse dataset respectively. We do the experiment of the usage of memory, the time of the building of the first window and the average time of the moving of the sliding window respectively. FPCFI-DS Moment AFPCFI-DS

200 100

FPCFI-DS Moment AFPCFI-DS

20

Loading Time(S)

Memory(MB)

300

0

15 10 5

0.3 Sliding Time(S)

400

0.1

0

0

1

FPCFI-DS Moment AFPCFI-DS

0.2

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Minimum Support(%)

1

0.9

0.8 0.7 0.6

0.5 0.4 0.3

1

0.2 0.1

0.9

0.8

Minimum Support(%)

0.7 0.6 0.5 0.4 Minimum Support(%)

0.3

0.2

0.1

Memory(MB)

800 FPCFI-DS Moment AFPCFI-DS

600 400 200 0

Loading Time(S)

400

FPCFI-DS Moment AFPCFI-DS

300 200 100

0.9 0.8

0.7 0.6 0.5 0.4 Minimum Support(%)

0.3 0.2

0.1

FPCFI-DS Moment AFPCFI-DS

0.2

0.1

0

0

1

Sliding Time(S)

0.3

1000

1

0.9 0.8

0.7 0.6 0.5 0.4 Minimum Support(%)

0.3 0.2

0.1

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

Minimum Spport(%)

Figure 5. Memory usage,Time for first window and Time for slinding

Figure5 show the usage of memory, the time of the building of the first window and the average time of the moving of the sliding window of the three algorithms. When dealing with dense and sparse dataset, Moment needs the most memory and AFPCFI-DS needs the least. AFPCFI-DS is obviously quicker than Moment, but a little slower than FPCFI-DS. when the minsup is relatively big; the moving speed of the three different algorithms isn’t obvious. But along with the decreasing of the minsup, AFPCFI-DS is the most rapid.

1728

5. Conclusions

Caiyan Dai and Ling Chen / Physics Procedia 24 (2012) 1722 – 1728 Author name / Physics Procedia 00 (2011) 000–000

This paper presents an improvement algorithm, called AFPCFI-DS, which not only can deal with data streams ,but also solve the problem of using too much of the search space in Moment. The related experiment indicates that the algorithm is effective and performs much more quickly than the previous algorithm Moment and FPCFI-DS. Although compared with FPCFI-DS, the time of building the first window is some longer, but the speed of constructing the GCT is much more quickly. When compared with Moment, AFPCFI-DS performs also much better. References [1]Pasquier N ,Bastide Y, Taouil R,Lakhal L. Discovering Frequent Closed Itemsets for Association Rules[P],Springer, Volume 1540/ 1999: 398-416 [2]Chi Y, Wang H, Yu P , Muntz R. MOMENT: Maintaining closed frequent Itemsets over a stream sliding window [A] .Proceedings of the 2004 IEEE International Conference on Data Mining[C] . Brighton, UK: IEEE Computer Society Press, 2004. 59 - 66. [3]Han, J.Pei, and Y.Yin, Mining frequent patterns without candidate generation, Proc. ACM SIGMOD 2000, pp.1(R)C12 [4]Zhu Y, Shasha D. StatStream: statistical monitoring of thousands of data streams in real time [A]. Proceedings of the 20th International Conference on Very Large Data Bases[C]. Hong Kong, China: Morgan Kaufmann, 2002.358 - 369. [5]Manku G, Motwani R. Approximate frequency counts over data streams [A]. Proceedings of the 28th International Conference on Very Large Data Bases [C]. Hong Kong, China: Morgan Kaufmann, 2002. 346 - 357. [6]Chang J, Lee W. Finding recent frequent Itemsets adaptively over online data streams [A]. Proceedings of the 9th ACMSIGKDD International Conference on Knowledge Discovery and Data Mining [C]. Washington, USA: ACM Press, 2003, 487 - 492. [7]Gannell C, Han J, Robertson E, Liu C. Mining frequent Itemsets over arbitrary time intervals in data streams [R]. Bloomington: Indiana University, 2003. [8]Teng W2G, Chen M2S, Yu P S. A regression based temporal pattern mining scheme for data streams [A]. Proceedings of the 29th International Conference on Very Large Data Bases[C]. Berlin, Germany: Morgan Kaufmann, 2003. 607 - 617. [9]Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases [A] . Proceedings of the 20th International Conference on Very Large Data Bases [C]. San Francis2 co: Morgan Kaufmann, 1994. 487 - 499. [10]Fujiang Ao,Jing Du,Yuejin Yan,Baohong Liu, KediHuang An Efficient Algorithm for Mining Closed Frequent Itemsets in Data Streams IEEE 8th International Conference on Computer and Information Technology Workshops 2008 37-42 [11]Dataset available at http://fimi.cs.helsinki.fi/.

Suggest Documents