1,2 Department of Computer Science and Engineering, Shiraz University, Shiraz, Fars, Iran. Abstract - Mining frequent patterns over data streams is an.
Frequent Patterns Mining over Data Stream Using an Efficient Tree Structure 1,2
M. Deypir1, M. H. Sadreddini2 Department of Computer Science and Engineering, Shiraz University, Shiraz, Fars, Iran
Abstract - Mining frequent patterns over data streams is an interesting problem due to its wide application area. In this study, a novel method for sliding window frequent patterns mining over data streams is proposed. This method utilizes a compressed and memory efficient tree data structure to store and to maintain sliding window transactions. The method dynamically reconstructs and compresses tree data structure to control the amount of memory usage. Moreover, the mining task is efficiently performed using the data structure when a user issues a mining request. The mining process reuses the tree structure to extract frequent patterns and does not need additional memory requirement. Experimental evaluations on real datasets show that our proposed method outperforms recently proposed sliding window based algorithms. Keywords: Data Stream Mining; Frequent Patterns; Sliding Window; Patricia Tree
1
Introduction
The Apriori [1] is a well-known algorithm for solving frequent itemset mining problem in static databases. An itemset (set of items) is frequent in a database if the number of its occurrences in the database is greater than a user given threshold. The frequent patterns mining in data streams is more challenging than static databases. This is due to dynamic nature of data streams and unbounded amount of incoming data. Commonly used approaches to handle and model data streams for frequent itemsets mining is sliding window model [2, 3, 4, 5]. In this paper, we have proposed a new data structure namely PatDS (Patricia tree for Data Stream mining) for frequent patterns mining which adapts a special type of prefix tree for data stream processing. Moreover, an adapted version of FP-growth [6] algorithm is also proposed to extract frequent patterns from the most recent transactions stored in PatDS when the user requests. Our proposed method provides better results in terms of memory usage and run time with respect to recently proposed algorithms. The reminder of paper is as follows. The next section presents a review on recently proposed frequent patterns mining algorithms over data stream. Problem statement is described in Section 3. In Section 4 the new method is introduced. The experimental results are presented in Section 5. Finally Section 6 concludes the paper.
2
Related works
Han et al. introduced a novel algorithm, known as the FP-growth [6] method, for mining frequent itemsets in static databases. The FP-Growth (Frequent Pattern Growth) method is a depth-first tree based search algorithm for mining patterns without candidate itemset generation. In this method, a data structure termed the FP-tree is used for storing frequency counts of items belong to the input database. In [7] a new tree based algorithm for frequent itemset mining called PatriciaMine was introduced. It uses Patricia Tree [8] instead of the FP-Tree to store frequent items of transactions. This data structure is used since it needs smaller memory with respect to FP-Tree due to have capability to store more than one item in a node of the tree. Additionally, applying pattern growth method on this tree structure leads to better run time due to smaller tree structure. The first algorithm for frequent patterns mining over data streams are proposed by Manku et al [9] where the authors start with frequent items mining and then extend their idea to frequent itemsets mining. Lin et al. [4] proposed a method for mining frequent patterns over time sensitive sliding window. In their method the window is divided into a number of batches for which itemsets mining is performed separately. The estWin [2] finds recent frequent patterns adaptively over transactional data streams using sliding window model. This algorithm requires the significant support in addition to the minimum support threshold to adaptively maintain the approximate frequent patterns. There are a number of algorithms in the literature in which the FP-Tree structure is adapted to store sliding window information. For the mining phases these method use the same FP-Growth algorithm to extract frequent patterns. One of these methods is the DSTree [3]. In this algorithm, transactions within a window are divided into a number of batches (or panes) and the information about every batch is maintained in a prefix tree. Nodes of the prefix tree show items in the transactions which are sorted in a canonical order. Transactions are inserted to and removed from the tree in a batch by batch manner. Frequent itemsets are mined using the FP-Growth method when the user requests. Another recently proposed prefix tree based algorithm is the CPS-Tree [5] which is more efficient than the DSTree. This method is similar to the DSTree algorithm but reconstructs the prefix tree to reduce its
memory requirements as the incoming data stream is changed. In the CPS-Tree, nodes are sorted in descending order of their support. For controlling the main memory usage, the tree is dynamically reconstructed to maintain support descending order of nodes. In this study, a new method to extract set of all frequent itemsets over sliding window is proposed. Our algorithm adapts Patricia Tree for frequent itemsets mining over sliding window on transactional data streams.
3
Problem statement
Let I={i1,i2,…,im} be a set of items. Let S be a stream of transactions in sequential order, wherein each transaction is a subset of I. For an itemset X, which is a subset of I, a transaction T in S is said to contain the itemset X if X ⊆ T . The support of X is defined as the percentage of transactions in S that contain X. For a given support threshold s, X is frequent if the support of X is greater than or equal to s%, i.e., if at least s% of transactions in S contain X. Transaction sensitive sliding window over data stream S contain |W| recent transactions in the stream, where |W| is the size of the window. The window slides forward by inserting a new transaction and deleting the oldest transaction from the window. Due to efficiency issues, instead of a single transaction, the unit of insertion and deletion can be a pane (or batch) of transactions. The current transactional window over the data stream S is defined as TWn-w+1 = {Tn-w+1, Tn-w+2, …, Tn} where n-w+1 and Ti are the window’s identifier and i th transaction in the S, respectively. In fact the window contains the n most recent transactions of the data stream. An itemset X is said to be frequent in TW if Sup(X) ≥ s, where Sup(X) and s are support of X in TW and the minimum support threshold, respectively. Thus, having a transactional sliding window TW and a minimum support threshold specified by the user, the problem is defined as finding all the frequent itemsets that exists in the recent TW.
Figure 1. A transactional data stream and sliding window model
Example 1. Figure 1 shows a data stream of transactions, where the left column shows the transaction id, i.e., Tid of incoming transactions and the right column shows the items within each transaction. Two consecutive transactional sliding windows W1 and W2 are defined over the so far received transactions. The pane size is 2 transactions and a window contains 2 panes. Given a minimum support threshold, the aim is to mine all frequent itemsets that exist within the recent window after the user makes a mining request.
4
The proposed method
As mentioned previously, with respect to DSTree the CPSTree is more memory efficient. This efficiency is due to its prefix tree which is maintained in the support descending order. Additionally, the CPS-Tree has better runtime due to its effective pane extraction mechanism and applying the FPGrowth to a support descending ordered prefix tree. However, this method suffers from a number of shortcomings which affect its runtime and memory usage during data stream processing and the mining phase. Since, in a data stream of transaction we have to store all items (frequent or infrequent), the number of nodes in the CPS-Tree becomes prohibitively large. Moreover, in some paths of CPS-Tree, there exist a number of items with the same support value. Therefore, long paths may be formed which consume significant amount of memory. Additionally, in the CPS-Tree the FP-Growth algorithm is used to find frequent patterns after a user submits a request. We believe that this type of mining is not suitable for data streams since it generates large number of conditional trees during its processing. These conditional trees require more memory in addition to the original tree. Moreover, constructing these trees is a time consuming task. To overcome these shortcomings we have utilized Patricia tree [7] for sliding window frequent patterns mining over a data stream. Based on this data structure, we have developed a new data structure namely PatDS for storing and updating sliding window transactions. Additionally, an efficient in place mining method has been exploited to extract frequent patterns from the PatDS Patricia tree can be considered as a special type of a prefix tree. In a Patricia tree items in all branches are sorted according to a predefined order, e.g., lexicographical or support descending. In this structure, each sequence of items existing consecutively in the same path with the same count, resides in a single node. Therefore, any node in Patricia tree can store a list of items if they have the same support value and appear in the same branch of tree consecutively. However, the Patricia tree allocates a node for an item if its adjacent nodes have different support values or if there is not any adjacent node. The same support items in a path of a prefix tree, is an ordered sequence of items having the same support value in that path. The same support items in a path of prefix tree could reside in a single node with single support value. The same support items in a path could reside in a single node without any lose of information. It is important to note that, same support items must appear in the same path. A set of items could have the same support value but do not co occur with each other in the same set of transactions and thus does not form any the same support items in a specific path. A prefix tree which benefits from the same support items to compress its structure is called a Patricia tree. By using smaller number of nodes in a Patricia tree, the tree structure exhibits lower space
requirements, especially in the sparse set of transactions [7]. In this study, we have adapted Patricia tree for space efficient storing sliding window information and proposed a new data structure namely PatDS. In our method, the structure of PatDS is dynamically reconstructed using a previously proposed restructuring algorithm. Similar to CPS-Tree, PatDS is used to store sliding window transactions. However, PatDS ensures smaller memory requirement and thus better mining time. In data stream mining we have to maintain both frequent and infrequent items since an infrequent item may become frequent in the near future, thus requiring its support value. Therefore a number of items become large in the tree data structure. The larger the number of items, the further probability of the same support items in every path of the tree. Therefore, PatDS memory footprint can be far less than CPSTree. We need to maintain a PatDS in support descending order and manage the same support items in all branches dynamically as the content of the window is changed. A restructuring method is applied on the PatDS to obtain support descending ordered tree. In [5], a restructuring method namely BSM (Branch Sorting Method) is proposed. This method uses the merge sort to sort every path of the prefix tree. In the BMS, unsorted paths are removed and then are sorted and reinserted to the tree structure. We use BMS to restructure the PatDS so that it has support descending order of items in every path. After tree restructuring, a compression process is started to identify the same support items in each path and merge them to a single node.
a) PatDS before reconstructing
b) PatDS after reconstructing and compression Figure 2. PatDS before and after restructuring
Example 2: Again consider the data stream shown in Figure 1. Suppose that the window size and pane size are set to 3 and 2 respectively. Therefore the sliding window contains all transactions. Suppose that the processing is started with lexicographical order and tree restructuring is performed after the first window. Figure 2.a shows PatDS structure where a lexicographical order of items is used and Figure 2.b shows its corresponding tree after restructuring and compression.
Therefore, using PatDS instead of CSP-Tree needs small number of nodes before and after restructuring and compression due to residing the same support items in a single node. However, PatDS restructuring and compressing are time consuming process and we must perform them when required. We call the process of tree restructuring and tree compressing when significant change is observed in the order of items and distinct support values in the header table. Therefore, we test two criterions after every sliding. Change in the order of items reflects the required change in the order of items in the paths of PatDS. Moreover, the number of distinct values appeared consecutively in the header table with respect to the number of items is a good approximation for the number of the same support items in different paths. As the number of distinct values appeared in the header table reduced, the number of the same support items in different branches is increased. We use an adaptation of FP-Growth algorithm for the mining phase. FP-Growth is a bottom up approach since it extracts conditional pattern bases by starting from bottom of the header table. This method constructs large number of conditional FP-Trees during the mining process in addition to the original FP-Tree. These conditional FP-Trees have further memory requirement beside original FP-Tree. On the other hand, Top-Down FP-Growth [10] for static databases, process nodes at upper level first to extract conditional pattern bases. This method reuses the paths in the original FP-Tree to form conditional FP-Trees. Therefore, since it is an in place approach, it does not waste memory for conditional trees and requires smaller memory during the mining. Only additional memory is header tables of conditional trees, i.e., conditional header tables. Since in data stream processing we have the limitation of main memory, here, we utilize this method for mining frequent patterns from PatDS.
5
Experimental evaluations
We empirically evaluate the performance of PatDS method. To fair comparison, two similar algorithms are selected which operate in the same model of PatDS. We have implemented the PatDS and two recently proposed algorithms of DSTree and CPS-Tree. All programs were written in C++ and executed in Windows XP on a 2.66 GHz CPU with 1 GB memory. To show the applicability of our proposed method two real life datasets of BMS-POS and Kosarak are used in the experiments. The first experiment compares the run time of PatDS, CPS-Tree and DSTree. In this experiment, the average runtime of all active windows are computed for the three algorithms on all datasets. After every window, the mining is performed on the current window. For the BMSPOS, every window contains three panes for which the pane size is 50K transactions. In the Kosarak dataset, the pane size and window size are set to 50K transactions and 4 panes respectively. Figure 3 shows the result of this experiment for different minimum support thresholds. In this Figure, for both charts, horizontal axis show minimum support values and vertical axis shows the run time in second. As shown in
Figure 3, the PatDS is more efficient than DSTree and CPSTree algorithms for both of the datasets. As the minimum support decreased, the efficiency of the PatDS is more revealed and the performance gap becomes more significant. Figure 3 shows that the PatDS is faster than other algorithms even for low support thresholds where the number of generated frequent itemsets is high. The efficiency of the PatDS is due to its efficient storing sliding window information using smaller number of nodes and in place mining of frequent patterns which does not needs any additional tree creation time.
usage. Therefore, the PatDS outperforms CPS-Tree and DSTree in terms of memory usage.
a) BMS-POS
a)BMS-POS b) Kosarak Figure 4. Memory usage comparison of DSTree, CPS-Tree and PatDS
6
b)Kosarak Figure 3. Runtime comparison of DSTree, CPS-Tree and PatDS
The second experiment is about the memory usage. In this experiment, the memory usages of all algorithms are measured with respect to different window sizes on the datasets. Again, for both datasets the pane size is set to 50K. Sizes of window for each dataset are varied by using different number of panes. The results are shown in Figure 4. In this figure, memory usage (in mega byte) is plotted with respect to each window size for all the algorithms. As shown in Figure 4, memory usage of the PatDS is significantly lower than other methods. This is true for all datasets and all window sizes. As can be inferred from Figure 4, the amounts of memory enhancement with respect to other algorithms in both datasets are almost fixed for different window sizes. This is due to the more compact data structure of PatDS. The PatDS, has far smaller number of nodes and thus has better memory
Conclusions
In this paper, a new method namely PatDS is proposed for frequent patterns mining in data streams. This method uses a new data structure which is an adaptation of Patricia tree for data stream mining. The PatDS mines the complete set of frequent patterns in recent sliding window faster than recently proposed algorithms. Moreover due to use of a compact data structure and efficient maintenance of this structure, the proposed algorithm requires less memory with respect to other algorithms. Frequent patterns are extracted using an in place mining algorithm which does not need additional tree structures during the mining.
7
References
[1] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules”, Proc. VLDB Int. Conf. Very Large Databases, 1994, pp. 487–499. [2] J. H. Chang and W.S. Lee, “estWin: Online data stream mining of recent frequent itemsets by sliding window method”, Journal of Information Science, vol. 31(2), 2005, pp. 76–90.
[3] C. K. S. Leung and Q. I. Khan, “DSTree: a tree structure for the mining of frequent sets from data streams”, Proc. ICDM, 2006, pp. 928–932. [4] C. H. Lin, D. Y. Chiu, Y. H. Wu, and A. L. P. Chen, “Mining frequent itemsets from data streams with a timesensitive sliding window”, Proc. SIAM Int. Conf. Data Mining, 2005. [5] S. K. Tanbeer, C. F. Ahmed, B. S. Jeong, and Y. K. Lee, “Sliding window-based frequent pattern mining over data streams”, Information Sciences, vol. 179(22), 2009, pp. 38433865. [6] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach”, Data Mining and Knowledge Discovery, vol. 8(1), 2004, pp. 53-87. [7] A. Pietracaprina and D. Zandolin, “Mining Frequent Itemsets Using Patricia Tries,” Proc. IEEE ICD Workshop Frequent Itemset Mining Implementations, CEUR Workshop Proc., vol. 80, Nov. 2003. [8] D. Knuth, “Sorting and Searching”. Reading, Mass.: Addison Wesley, 1973. [9] G. S. Manku and R. Motwani, “Approximate frequency counts over data streams”, Proc. VLDB Int. Conf. Very Large Databases, 2002, pp. 346–357. [10] K. Wang, L. Tang, J. Han, and J. Liu, “Top Down FPGrowth for Association Rule Mining”, Proc. 6th Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining, Lecture Notes In Computer Science; Vol. 2336, 2002, pp. 334 – 340.