Online Mining of Maximal Frequent Itemsequences from Data Streams Guojun Mao1,2, Xindong Wu2, Chunnian Liu1, Xingquan Zhu2, Gong Chen2, Yue Sun1, and Xu Liu1 1 School of Computer Science, Beijing University of Technology, Beijing 100022, P.R. China 2 Department of computer Science, University of Vermont, Burlington VT 05405, U.S.A E-mail:
[email protected];
[email protected] Abstract Mining data streams often requires real-time extraction of interesting patterns from dynamic and continuously growing data. This requirement has imposed challenges on discovering and outputting current useful patterns in an instant way, commonly referred to as online streaming data mining. In this paper, we present INSTANT, a novel algorithm that explores maximal frequent itemsequences from streaming data in an online fashion. We first provide useful operators on the lattice of itemsequential sets, and then apply them to the algorithm design of INSTANT. In comparison with the most popular methods such as close-itemset based mining algorithms, INSTANT has solid theoretical foundations to ensure that it employs more compact in-memory data structures than closed itemsequences. Experimental results show that our method can achieve better results than previous related methods in terms of both time and space efficiency.
1. Introduction Discovering frequent itemsets from transaction data streams is a typical problem which has received intensive studies [1, 4, 7, 13]. Recent research efforts on mining frequent itemsets from large volumes of data streams have centered on the development of in-memory data structures and the design of algorithms with effective time efficiency and space utilization [4, 11, 8]. Dong et al [11] argued that one of the keys to mining data streams is online mining of changes. Furthermore, online mining methods in data streams should timely output current patterns to users when these changes result in new pattern generation. This is because for many streaming data oriented applications such as stock analysis and market prediction, users need to review the current pattern changes in time. Thus, an online mining method should be expected in real-time response to streaming data, and the results generated by the mining algorithms should be instantly displayed to users. A full online algorithm must have the ability to maintain the intermediate information obtained from scanning the data stream, and upon the users’ requests, quickly display all available results. In comparison with traditional static data mining efforts, online mining in data streams could impose higher system resource requirements for maintaining historical information. However, these requirements must be controlled to efficiently maintain frequent patterns in dynamic streaming environments. Therefore, attractive algorithms for mining large-volume data streams should be relatively stable or less scale-up with increasing volumes of streaming data and various user-specified parameters. Due to the large volume and unpredictable speed of data streams, it is possible to experience a shortage in system resources at any time. Thus, online mining algorithms must provide good strategies to cope with system overload. It is a challenging task for mining data streams to realize load shedding while trying to minimize the degradation in accuracy. This paper aims at online mining of frequent itemsequences from data streams. We will present an efficient algorithm called INSTANT (maxImal frequeNt So-far iTemsequence mAiNTainer), which is based on a new mining theory provided by this paper. The paper will also discuss the performance of the proposed algorithm from both theoretic and experimental views.
1
1.1 Related Work The problem of mining frequent itemsets in databases was first addressed by Agrawal et al [2] who have created the apriori-property for frequent itemset mining such that all nonempty sub-itemsets of a frequent itemset must be frequent. During the last decade, many efforts have been made in mining frequent itemsets, where two approaches have received intensive attentions: Closed Itemsets [19] and FP-Tree patterns [14]. Pasquier et al [19] first addressed the problem of mining closed itemsets, and provided an improved mining theory to the Apriori principle where all nonempty closed sub-itemsets of a frequent closed itemset must be frequent. Since [19], many excellent algorithms based on closed itemset mining have been proposed [20, 23]. Han et al [14] proposed an FP-Tree method which is the first effort of mining frequent itemsets without candidate generation and with only two scans over the database. the compact in-memory data structure of the FP-Tree has now been widely adopted [15, 17]. Recently discovering frequent itemsets has been successfully extended to data stream mining, which is more challenging than mining in transaction databases. Manku et al [18] gave an algorithm called LOSSY COUNTING for mining all frequent itemsets over the entire history of the streaming data. This algorithm is based on the Apriori property, but it is an one-pass algorithm over data streams. Chi et al [9] proposed an algorithm called MOMENT that might be the first to mining closed patterns from data streams. It uses an in-memory data structure called CET to maintain closed itemsets obtained by scanning streaming data. There are also some algorithms based on tree structures for discovering frequent itemsets from data streams. Giannella et al [12] presented a data structure called FP-Stream for maintaining information about frequent itemsets in data streams. Based on scanning the generated FP-Stream, the frequent patterns during an arbitrary time interval can be obtained. Another typical algorithm based on tree structures is DSM-FI proposed by Li et al [17], which extends a prefix-tree-based compact pattern representation. In order to output frequent itemsets to users, DSM-FI executes a top-down frequent itemset discovery scheme from maintained in-memory data structures. Like our target in this paper, these methods all try to make use of excellent in-memory data structures to find frequent itemsets from the streaming data. However, unlike our target of this paper, they employ a two-phase implementation, i.e., generating frequent patterns to users from the inmemory data structures after scanning the data stream to produce interim in-memory data structures. We think that the output of frequent patterns should also be a dynamic streaming process. Once an object becomes frequent at any time, it should be instantly output. Theoretically, a data stream can continuously and infinitely increase over time, so selecting a current handling window is also a key problem for mining streaming data. Zhu et al [24] gave three windowing models for mining data streams: landmark windows, sliding windows, and damped windows. In a landmark window, algorithms cope with the data from a specific time point called landmark to the present. Without combining with extra data updating techniques, this model cannot handle well continuous high-volume data streams. A sliding window is a popular model in mining data streams. The data updating over a siding window with a fixed size is simple and trivial. An old transaction has to be cleaned up while a new transaction enters the sliding window. MOMENT [9] uses the sliding window technique to maintain the current CET. Another typical algorithm using sliding windows is FTP-DS proposed by Teng [22]. In the damped window model, the weights of all transactions in a data stream are considered as the functions of their arrival time. The later a transaction arrives, the higher weight it should have. Chang et al [8] developed a weight function that decreases with age and designed an algorithm estDec for mining frequent itemsets in streaming data. FP-Stream [12] created an aging weight function, and can mine frequent itemsets at multiple time granularities by a novel titled-time windowing technique. In fact, different window models have their own advantages and disadvantages. However, we think that the damped windows provide a more flexible way to update data, which can implement load shedding in different strategies such as periodical, background, threshold-driven and integrated shedding plans. The continuous, high-speed data arrival in data streams can cause system overload. A good load shedding scheme is necessary for data-stream mining to decide when and how to discard aged data in memory. There are two basic strategies to handle system congestions: (a) prevention - the system actively estimates its work load based on the current input rate of the data stream, and before a congestion occurs, the system performs load shedding to discard some data tuples in advance; and (b) post-treatment - when the system performance degrades seriously or when it stops working, a load shedding mechanism has to be invoked. In general, an accurate prevention from stream jams suffers from costly computing expenses, whereby a full post-treatment loses up-to-date responses to continuous data streams. Therefore, how to select data to discard in memory is crucial for mining data streams. Similar to the aging weight function used in FP-Stream, Chang et al [7] developed an algorithm for maintaining frequent itemsets in streaming data assuming each transaction has a weight that is related with its age. Recently, there are some refer-
2
ences that discuss this issue and have provided some effective methods [5, 6, 10]. Babcock et al [5, 6] assumed that a set of Quality-of-Service (QoS) specifications is available, and then a load shedding scheme according to the QoS specifications was designed to decide when and how to discard data. Chi et al [10] proposed a load shedding scheme for classifying multiple data streams, and introduced a new metric called Quality of Decision (QoD) to measure the load status. In addition, some simple shedding mechanisms like the aging weight function in FP-Stream may get a good efficiency for real-time or online systems. Several references have discussed the problem of online mining patterns from data streams [1, 12, 16, 21]. Asia et al [1] gave an online algorithm called StreamT that aims at mining patterns from streams of semi-structured data such as XML data. Keogh et al [16] and Palpanas [21] considered the problem of online mining of streaming time series and gave algorithms for solving this problem. Our data source format in this paper is very different from those of the above algorithms.
1.2 Our Contributions Our focus in this paper is on dynamic information maintenance with continuous streaming data, and instant output for current frequent itemsequences. Our contributions can be summarized as follows. (1) We assume that there is a lexicographical order among all items in a data stream, and while all items in a transaction are represented as an itemsequence, all transactions in a data stream can be modeled into an itemsequential set. Based on the algebra lattice of itemsequential sets, we provide some new mining operators. These theoretic results are then applied to mining data streams. (2) We present a new online algorithm, INSTANT, which has provable space and time efficiencies. (3) The in-memory data structures used in INSTANT have less space expenses than closed itemsequences. Most important of all, without re-scanning over any in-memory data structures to output frequent patterns, INSTANT can directly display current frequent itemsequences while they are generated. Therefore, INSTANT has an obvious online mining feature. The rest of the paper is organized as follows. In Section 2, we present our problem statement and some theoretical results on the algebra lattice of itemsequential sets. Section 3 describes our algorithm INSTANT and analyzes its theoretical performance properties. Experimental studies are provided in Section 4. Section 5 concludes the paper.
2. Operators on Itemsequential Sets Before presenting our mining algorithm in Section 3, we introduce relevant concepts and notations with a theoretic analysis in this section.
2.1 Problem Statement A popular formulation of the problem for mining transaction databases is via the term itemset. From this viewpoint, a transaction database is a series of tuples, each of which includes an itemset, and discovering frequent itemsets in the transaction database is considered a key phase in pattern mining. In this paper, we consider the term itemsequence rather than itemset. In short, an itemsequence is an ordered list of items. Definition 1 (Itemsequence). An itemsequence is an ordered list of items where the order of items is given by a specific criterion. Let α=a1a2…am and β= b1b2…bn be two itemsequences. We say that β contains α, denoted by α? sβ, if there exist integers k1>k2>…>km and ai=bki (i=1, 2, …, m). In this situation, we also call α a sub-itemsequence of β, or β a super-itemsequence of α. Example 1. Consider the capital letters in the English alphabet as all concerned items, and the order to be the alphabetic order. Then itemsequence ABC is an itemsequence, but ACB is not. Assuming β=ABCDEF, we have ABC? sβ, but ABG? sβ. After introducing the term itemsequence, we can now formalize the problem to be tackled in this paper as fellows. Let I={i1, i2, … , im} be an item alphabet, called items, and DS={t1, t2, …, tn, …} be a data stream in which every element represents a transaction. A transaction ti is modeled into an itemsequence on I (i=1, 2, …), and is related to a unique transaction identifier, TID, which increases over time. Given an arbitrary itemsequence, its support is the ratio of the number of these transactions containing (? s) this itemsequence against the number of all transac-
3
tions in DS. Due to the potentially infinite nature of a data stream, it is not feasible to get the full support information of an itemsequence in DS. However, through analyzing what has happened in DS so far, we can get its current patterns. Definition 2 (Support). Given an item alphabet I={i1, i2, … , im} and a data stream DS={t1, t2, …, tn, …}, the current support of an itemsequence t, denoted by Csup(t), is the percent that the number of these transactions containing t as a sub-itemsequence against the number of all transactions occurred so far in DS. Also, the global support, denoted by Gsup(t), is the number of transactions that contain t as a sub-itemsequence against the number of all transactions in DS. Note that we sometimes use the term Support Count instead of Support, and the support count of an itemsequence simply means the number of times that this itemsequence occurs in a specific period. Definition 3 (Maximal itemsequence). Given a set of itemsequences S, an itemsequence is a maximal itemsequence in S if it is not a sub-itemsequence of any other itemsequence in S. Definition 4 (Maximal frequent itemsequence). Given I, DS, and a minimum support Msup, an itemsequence t is called a frequent itemsequence if Gsup(t)? Msup. An itemsequence is called a maximal frequent itemsequence if it is not a sub-itemsequence of any other frequent itemsequence in DS. An itemsequence t is called a frequent so-far itemsequence if Csup(t)? Msup. An itemsequence is called a maximal frequent so-far itemsequence if it is not a subitemsequence of any other frequent itemsequence discovered so far. Our research objective in this paper is to develop a single-pass fast algorithm to find maximal frequent so-far itemsequences, and to instantly output them when such itemsequences are discovered.
2.2 Itemsequential Set Theory In Section 2.1, we have given the definition of itemsequence, and we will extend it to the term itemsequential set in this subsection. With this extension, we can create useful operators for discovering frequent itemsequences. Definition 5 (Itemsequential set). An itemsequential set is a set of itemsequences on I. Let t be an itemsequence, and s1 and s2 be two itemsequential sets, then (1) t sub-belongs to s1 if there exists an itemsequence s in s1, that t? ss, denoted by t? subs1. (2) t is an element of the sub-intersection set of s1 and s2, denoted by s1? (3) t is an element of the sub-union set of s1 and s2, denoted by s1?
subs2,
subs2,
if both t? subs1 and t? subs2.
if either t? subs1 or ? subs2.
Example 2. Assuming s1={AB, CD} and s2={ABC, AD}, if we use general set operators, then AB? s1, AB? s2, s1? s2=Φ, s1? s2={AB, CD, ABC, AD}. However, according to the above Definition 5, we can say AB? subs2, s1? subs2={A, B, C, D, AB}, and s1? subs2 ={A, B, C, D, AB, CD, AC, BC, AD, ABC}. Definition 6 (Maximal sub-operators). Let t be an itemsequence, and s1 and s2 be two itemsequential sets, then (1) t is an element of the maximal sub-intersection set of s1 and s2, denoted by s1? mss2, if t is an element of s1? subs2 and is not contained by any other element of s1? subs2. (2) t is an element of the maximal sub-union set of s1 and s2, denoted by s1? mss2, if t is an element of s1? subs2 and is not contained by any other element of s1? subs2. Example 3. Assuming s1 = {AB, CD} and s2 = {ABC, AD}, then s1 ? AD}. Property 1 (Idempotent law). s1?
= {AB, C, D}, s1?
ms s2
= {ABC, CD,
mss1 = s1; s1? mss1 = s1.
Property 2 (Commutative law). s1? Property 3 (Associative law). (s1? Property 4 (Absorption law). s1?
ms s2
mss2 = s2? mss1; s1 ? ms s2 = s2 ? ms s1.
mss2)
ms
?
(s1?
mss3
= s1?
ms (s2? mss3);
mss2)
= s1?
ms
(s1?
mss2)
(s1?
mss2)
mss3 = s1? ms (s2? mss3).
= s1.
Property 5 (Distributive law). s1? ms(s2? mss3)=(s1? ms s2)? ms(s1? mss3); s1? These properties can be easily derived from Definitions 5 and 6.
4
?
ms(s2? mss3)=(s1? mss2)? ms(s1? ms s3).
3. Algorithm and Analysis In this section, we will first present our algorithm design and then provide a theoretical analysis on its performance properties.
3.1 Algorithm Design In comparison with other types of data, streaming data is more difficult to deal with in pattern mining. On the one hand, a data stream is dynamically growing, so its patterns should be incrementally formed. From this point of view, algorithms for mining data streams should have an online feature, which means that any so-far patterns can be timely provided to the users once they have been found. On the other hand, a data stream is a collection of highvolume data, so an efficient usage of the main memory has become the bottleneck to mining data streams. To breakthrough this bottleneck, it is necessary to design a compact data structure in memory. Therefore, redundant information should not be stored in memory at all or as little as possible, and active pruning measures must be taken. Also, since a data stream is theoretically infinite, it is necessary to shed aged or less important data from memory in time. Based on the theoretical analysis in the previous section, we design an algorithm in a rather succinct way. Figure 1 gives its pseudocode. There are two main in-memory data structures used in our algorithm INSTANT: (1) K, an itemsequential set, stores maximal frequent so-far itemsequences that have been found by a specific time; and (2) U, an array of itemsequential sets, where U[i] stores maximal itemsequences that are infrequent with a support count of i by a specific time. Algorithm INSTANT INPUT: (1) An continuous data stream DS; (2) Minimum support count δ; (3) Memory space available for the user ϕ. OUTPUT: Maximal frequent so-far itemsequences. Main: Initialize(K, δ); Initialize(U, δ); REPEAT α=get an itemsequence from DS; IF (α? sub K) Fre_maker(K, U[δ-1]); Sup_maintainer(U, α, δ); IF (memory usage ? ϕ) Shedder(U,ϕ); ENDIF ENDIF UNTIL endof(DS) Figure 1 Description of Algorithm INSTANT When an itemsequence α arrives in memory from DS, INSTANT first tests if α? subK. If α? subK, no action needs to be taken because α or its super-itemsequences as frequent patterns have been stored and output. If α? subK, the following tree procedures will be called. (1) Fre_maker(K, U[δ-1]) : When α appears, it is possible that α or α’s sub-itemsequences become frequent. Figure 2 presents this procedure. In Figure 2, by executing S0={α}∩msU[δ-1], this procedure gets a new frequent itemsequential set S0 related to α, and updates K with K=K? msS0. Also, by calling Output(S0), this procedure timely displays the frequent so-far itemsequences right after they are generated, which indicates the algorithm has an obvious online mining feature. (2) Sup_maintainer(U, α, δ): when a new itemsequence α occurs from DS, new (sub)itemsequences can be generated into U, or the supports of existing elements in U should be updated. Thus, this procedure maintains the changes that α brings about to U[1], U[2], … , U[δ-1] in a hierarchical way. As Figure 3 shows, Sup_maintainer() can be succinctly described based on the sub-operators between itemsequential sets that have been introduced in Section 2.
5
Procedure Fre_maker(K, α,U[δ-1]) S0={α}? msU[δ-1]; IF (S0? Φ) K= K ? ms S0; Output(S0); ENDIF; Figure 2 Description of Fre_maker() Procedure Sup_maintainer(U, α, δ) S1={α}; FOR (i=1;i1, given the induction hypothesis that Lemma 1 holds for all n < k, we will prove that Lemma 1 also holds for n = k. According to Algorithm INSTANT, we have: k k -1 k -1 U1 = ( U1 ? ms {αk}) − ( U1 ? ms {αk}), k k -1 k -1 k -1 k -1 U 2 = ( U 2 ? ms ( U1 ? ms {αk})) − ( U 2 ? ms ( U1 ? ms {αk})),
… …, i −1
i
k k -1 k -1 k -1 U i = ( U i ? ms ( ( ? j=ms U j )? ms{αk}) ) − (( ? j=ms U j ) ? ms {αk})…………………………….(1) 1 1
… …, δ −2
δ −1
k k -1 k -1 k -1 Uδ -1 = ( Uδ -1 ? ms ( ( ? j=ms U j ) ? ms {αk})) − (( ? j=ms U j ) ? ms {αk}). 1 1
With (1), and based on the induction hypothesis, all itemsequences in U1i , Ui2 , …, Uik -1 have a support count of i, so we can obtain:
∀t ? U ik -1 , Csup(t)=i, …………………………..(2) i −1
∀t ? ( ?
ms j =1
k -1 U j ), Csup(t)? i-1, ………….......(3)
i −1
∀t ? ( ( ? ∀t ? (( ?
ms j =1 i
ms j =1
k -1 U j )? ms{αk} ), Csup(t)? i, ........(4) k -1 U j )? ms{αk}), Csup(t)? i+1. ......(5)
By (2) and (4), we have
7
i −1
∀t ? ( U ik -1 ? ms( ( ? ms U kj -1 )? ms{αk})), Csup(t)? i. ...............................................................(6) j =1 Applying (6) and (5) to (1), we can get the following result: ∀ t? U ik = ( U ik -1 ? (( ?
i ms j =1
i −1
ms
( (?
ms j =1
k -1 U j )? ms{αk}) ) −
k -1 k U j ) ? ms {αk}), Csup(t) satisfies ? i but does not satisfy ? i+1. That is, all elements in U i exactly have the
support count of i. Thus Lemma 1 holds for n = k. According to (a) and (b), we can induce as follows. For a data stream (α1, α2, … , αn, …), after processing an itemsequence αn (n=1, 2, …), INSTANT keeps U[i] with an exact support count of i (i=1, 2, … , δ-1). ? Lemma 2. At any time, Algorithm INSTANT satisfies: if i