Variable Support Mining of Frequent Itemsets over Data ... - CiteSeerX

27 downloads 148 Views 243KB Size Report
1 Department of Information Engineering and Computer Science,. Feng-Chia University, Taiwan [email protected], m9305966@webmail.fcu.edu.tw.
Variable Support Mining of Frequent Itemsets over Data Streams Using Synopsis Vectors Ming-Yen Lin1, Sue-Chen Hsueh2, and Sheng-Kun Hwang1 1

Department of Information Engineering and Computer Science, Feng-Chia University, Taiwan [email protected], [email protected] 2 Department of Information Management, Chaoyang University of Technology, Taiwan [email protected]

Abstract. Mining frequent itemsets over data streams is an emergent research topic in recent years. Previous approaches generally use a fixed support threshold to discover the patterns in the stream. However, the threshold will be changed to cope with the needs of the users and the characteristics of the incoming data in reality. Changing the threshold implies a re-mining of the whole transactions in a non-streaming environment. Nevertheless, the "look-once" feature of the streaming data cannot provide the discarded transactions so that a remining on the stream is impossible. Therefore, we propose a method for variable support mining of frequent itemsets over the data stream. A synopsis vector is constructed for maintaining statistics of past transactions and is invoked only when necessary. The conducted experimental results show that our approach is efficient and scalable for variable support mining in data streams.

1 Introduction Many data-intensive applications continuously generate an unbounded sequence of data items at a high rate in real time nowadays. These transient data streams cannot be modeled as persistent relations so that traditional database management systems are becoming inadequate in supporting the functionalities of modeling this new class of data [2]. The unbounded nature of data streams disallows the holding of the entire stream in the memory, and often incurs a high call-back cost even if the past data can be stored in external media. Any algorithm designed for streaming data processing would generally be restricted to scan the data items only once. Consequently, algorithms such as stream mining algorithms can present merely approximate results rather than accurate results because some data items will be inevitably discarded. The discovery of frequent items and frequent itemsets has been studied extensively in the data mining community, with many algorithms proposed and implemented [1, 5, 9]. The ‘one-pass’ constraint, however, inhibits the direct application of these algorithms over data streams. The mining of frequent items/itemsets in a data stream has been addressed recently. An algorithm in [10] uses the Buffer-Trie-SetGen to mine frequent itemsets in a transactional data stream. The FP-stream algorithm [4] incrementally maintains tilted-time windows for frequent itemsets at multiple time W.K. Ng, M. Kitsuregawa, and J. Li (Eds.): PAKDD 2006, LNAI 3918, pp. 724 – 728, 2006. © Springer-Verlag Berlin Heidelberg 2006

Variable Support Mining of Frequent Itemsets over Data Streams

725

granularities. The DSM-FI algorithm [7] uses a FP-tree [5] like forest and estimated supports for the mining. In addition, the Moment algorithm [3] employs a ‘closed enumeration tree’ for fast discovery of closed frequent itemsets in a data stream. Note that the above approaches for mining frequent itemsets over data streams accept only one minimum support in the mining. The minimum support cannot be changed during the mining for these approaches. In reality, the minimum support is not a fixed value for the entire stream of transactions. The user may specify a threshold in the beginning, adjust the threshold after evaluating the discovered result, or change the threshold after a period of time after receiving volumes of transactions. The minimum support threshold therefore should be variable to suit the need of the user. In contrast to frequent itemset mining with a fixed support, the mining with respect to a changeable support is referred to as variable support mining. Although online association rule mining and interactive mining [8] may have changeable support thresholds, both algorithms are inapplicable to the stream data because a scanning of entire transactions is required. In this paper, we formulate the problem of variable support mining in a data stream and propose the VSMDS (Variable Support Mining of Data Streams) algorithm for efficient variable mining of frequent itemsets in a stream of transactions. The VSMDS algorithm uses a compact structure (called PFI-tree) to maintain the set of potential frequent itemsets and update their support counts. A summary structure, called synopsis vector, is designed to approximate past transactions with a flexible distance threshold. The comprehensive experiments conducted show that VSMDS is highly efficient and linearly scalable.

2 Problem Statement Let Ψ = {α1, α2, …, αr} be a set of literals, called items. A data stream DS = {t1, t2, …, tc, …} is an infinite sequence of incoming transactions, where each transaction ti is an item-set associated with a unique transaction identifier. Let tc be the latest incoming transaction, called current transaction. The current length of the data stream is the number of transactions seen so far. A transaction ti contains an item-set e if e ⊆ ti. The support of an item-set e, denoted by sup(e), is the number of transactions containing e divided by the current length in DS. The user specified a minimum support threshold ms ∈ (0,1] in the beginning of the data stream. At any point of time, along with the incoming of transactions, the user may change the minimum support threshold so that the thresholds form a series of minimum supports. Let msc, called current minimum support, be the minimum support when we saw tc. An item-set e is a frequent itemset if sup(e) ≧ msc. The objective is to discover all the frequent itemsets in the data stream, with respect to current minimum support. Since the specified minimum support is not a fixed value, such a mining is called variable support mining over the data stream. In contrast, previous mining with only one unchangeable minimum support is called fixed support mining. The goal is to use the up-to-update minimum support msc and consider all the transactions, including the discarded ones, for the discovery of frequent itemsets.

726

M.-Y. Lin, S.-C. Hsueh, and S.-K. Hwang

3 VSMDS: Variable Support Mining for Data Streams We process the stream, in a bucket-by-bucket basis, by grouping |B| (called bucket size) incoming transactions into a bucket. A potential frequent itemset tree (called PFI-tree) is designed to maintain the set of potential frequent itemsets. To provide the user with the up-to-date result reflecting a newly specified minimum support, the proposed algorithm effectively compresses the discarded transactions into a summary structure called synopsis vector (abbreviated as SYV). Consequently, we may use the SYV to update the PFI-tree with respect to current minimum support. We use an idea similar to Proximus [6] for compressing the transactions but carry out a structure updating for more accurate results. The series of minimum supports specified by the user is collectively referred to as the support sequence (ms1, ms2, …, msλ), where msi indicates the minimum support used when DS has Bi buckets. In the following, The PFIi is the PFI-tree and SYVi is the SYV on seeing bucket Bi. Additionally, the msPFI denotes the minimum support threshold used in the PFI-tree. Fig. 1 depicts the overall concept of the proposed VSMDS algorithm. On seeing a new bucket Bi, VSMDS updates the PFIi-1 and compresses Bi with SYVi-1 into SYVi. The PFIi is used to output the desired patterns to the user. The SYVi-1 is used to build PFIi only when the PFIi cannot provide the up-to-date results, that is, when msi < msPFI. The PFIi-1 keeps all the itemsets having supports at least msPFI, considering buckets up to bucket Bi-1, during the process. If msi ≥ msPFI, the user are querying frequent itemsets that have higher supports. These itemsets can be located from PFIi-1 and VSMDS replies to the user without the participation of the SYV. If msi < msPFI, those itemsets having supports greater than or equal to msi but smaller than msPFI, thus being excluded in PFIi-1, become frequent. Hence, VSMDS will use the SYVi-1 to build PFIi-1 for the mining of these itemsets at this moment. VSMDS utilizes the lexicographic property of consecutive item-comparisons [9] in PFI-tree for fast mining and updating of potential frequent itemsets. The SYV is a list of (delegate, cardinality) pairs. The cardinality indicates the number of occurrences of the delegate; the delegate represents a group of approximated itemsets. A delegate dg is said to approximate to an itemset e if the distance (eg. the number of different items between dg and e) is no more than certain distance threshold (defined by the user).

(a)

Data Stream

Bucket Size=|B| Specified minimum support

t1, t2, …, t|B|



t|B|+1, t|B|+2, …,t2*|B|

B1

B2

ms1

B3

ms2 update

(b)

become

build

Bi

SYV i-1





ms3 PFIi-1

when msi < msPFI

compress into

tc

Bi

★ ★

PFIi

msi

(c)

retrieve e, sup(e)



ms i

PFIi : Potential frequent itemsets on seeing bucket B i SYVi: Synopsis vector on seeing Bi

SYV i

Fig. 1. Overall concept of the VSMDS algorithm: (a) bucketed transactions (b) update and compress operations on seeing a bucket Bi (c) retrieving the frequent itemsets from the PFIi

Variable Support Mining of Frequent Itemsets over Data Streams

727

T10.I5.D1000k dh=10 ms = random (1.1%~2%)

T10.I5.D1000k dh=10 1.4

1.4 Random ms=1.1% ~ 2%

1.2

1.2

). ce 1 st( ek cu b 0.8 erp e tim 0.6 n iot uc ex E 0.4

) c.e 1 (st ek ucb 0.8 re p e tim0.6 onti uc ex E 0.4 Support sequence = (1.5, 1.7, 2, 1.4, 1.2, 1.3, 1.8, 1.9, 1.1, 1.6)%

0.2

0.2

Total running time for 10 buckets = 7.437 seconds 0 100k

200k

300k

400k

(a)

500k

600k

700k

800k

comprees time update time

900k

0

1000k

Number of incoming transactions

100k

200k

300k

400k 500k 600k 700k Number of incoming transactions

800k

900k

1000k

(b)

Fig. 2. (a) Mining the data stream with a support sequence of random thresholds (b) the breakdown of the processing time Fixed ms = 0.7%

ms = 0.7% 200

30 T15.I5.D1000k T10.I5.D1000k

25 ). ecs 20 ( e m it no 15 i utc ex el a otT 10

) (M e 120 agus yor em 80 M 40

5 0

T10.I5.D1000k T15.I5.D1000k

160

10k

20k

40k (a) Bucket size

50k

0

100k

10k

20k

40k

(b)

50k

100k

Bucket size

Compressed ratio (%)

Fig. 3. (a) Effect on various bucket size (b) working memory size ms=0.7% dh=15

100 90 80 70 60 50 40 30 20 10 0

T10.I5.D1000k

450 T10.I4

400 350 ). ce 300 (s e m Ti250 no i utc 200 ex el tao 150 T

100 50

10

11

12 (a)

13

Distance Threshold

14

15

0

1000k

2000k

3000k

4000k 5000k 6000k 7000k (b) Number of incoming transactions

8000k

9000k

10000k

Fig. 4. (a) varying distance threshold (b) scalability evaluation: 1000k to 10000k

4 Experimental Results We have conducted extensive experiments to evaluate the algorithm. The experiments were performed on an AMD Sempron 2400+ PC with 1GB memory, running the Windows XP, using data-sets generated from [1]. Due to space limit, we only report the results on dataset T10I5D1000k. The distance threshold is 10 and |B|=10.

728

M.-Y. Lin, S.-C. Hsueh, and S.-K. Hwang

Fig. 2(a) shows the performance of VSMDS algorithm with respect to a support sequence of random values ranging from 1.1% to 2%, the breakdown of execution time is shown in Fig. 2(b). The performance with respect to various bucket sizes is shown in Fig. 3(a), and the working memory sizes for the experiment are depicted in Fig. 3(b). Let the compression ratio be the size of the synopsis vector divided by that of the original transactions. Fig. 4(a) confirms that a distance threshold of 15 compresses more than 50% of the transactions in size. Fig. 4(b) indicates that VSMDS algorithm scales up linearly with respect to the dataset size (from 1000k to 10000k).

5 Conclusion In this paper, we propose the VSMDS algorithm for mining frequent itemsets over a data stream with changeable support threshold. VSMDS utilizes the PFI-tree and the synopsis vector for the mining. The extensive experiments confirm that VSMDS efficiently mines frequent patterns with respect to variable supports, and has good linear scalability.

References 1. Agrawal, R. and Srikant, R.: Fast Algorithm for Mining Association Rules. In Proc. of the 20th International Conference on Very Large Databases (VLDB’94), pages 487-499, 1994. 2. Babcock, B., Babu, S., Datar, M., Motwani, R., and Widom, J.: Models and Issues in data stream systems. In Proc. of the 2002 ACM Symposium on Principles of Database Systems (PODS 2002), ACM Press, 2002. 3. Chi, Y. and Wang, H.: Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window. In Proc. of the Fourth IEEE International Conference on Data Mining (ICDM'04), pages 59-66, Brighton, United Kingdom, 01-04 November 2004. 4. Giannella, C., Han, J., Pei, J., Yan, X., and Yu, P. S.: Mining Frequent Patterns in Data Streams at Multiple Time Granularities. In Proc. of the NSF Workshop on Next Generation Data Mining, 2002. 5. Han, J., Pei, J., and Yin, Y.: Mining Frequent Patterns without Candidate Generation. In Proc. of the 2000 ACM SIGMOD International Conference on Management of Data, Vol. 9, Issue 2, pages 1-12, 1999. 6. Koyuturk, M., Grama, A., and Ramakrishnan, N.: Compression, clustering and pattern discovery in very high dimensional discrete-attribute datasets. IEEE Transactions on Knowledge and Data Engineering, Vol. 17, no. 5, pages 447-461, 2005. 7. Li, H. F., Lee, S. Y., and Shan, M. K.: An Efficient Algorithm for Mining Frequent Itemsets over the Entire History of Data Streams. In Proc. of the First International Workshop on Knowledge Discovery in Data Streams, pages 20-24, Pisa, Italy, September 2004. 8. Lin, M. Y. and Lee, S. Y.: Interactive Sequence Discovery by Incremental Mining. Information Sciences: An International Journal, Vol. 165, Issue 3-4, pages 187-205, 2004. 9. Lin, M. Y. and Lee, S. Y.: A Fast Lexicographic Algorithm for Association Rule Mining in Web Applications. In Proc. of the ICDCS Workshop on Knowledge Discovery and Data Mining in the World-Wide Web, pages F7-F14, Taipei, Taiwan, R.O.C., 2000. 10. Manku, G. S., Motwani, R.: Approximate Frequency Counts over Data Streams. In Proc. of the 28th VLDB Conference, pages 346-357, Hong Kong, China, August 2002.