Received August 12, 2014, accepted November 11, 2014, date of publication November 24, 2014, date of current version December 3, 2014. Digital Object Identifier 10.1109/ACCESS.2014.2373433
Efficiently Maintaining the Fast Updated Sequential Pattern Trees With Sequence Deletion JERRY CHUN-WEI LIN1 , (Member, IEEE), WENSHENG GAN1 , AND TZUNG-PEI HONG2,3 (Member, IEEE)
1 Innovative
Information Industry Research Center, School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China 2 Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung 81148, Taiwan 3 Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung 80424, Taiwan
Corresponding author: J. C.-W. Lin (
[email protected]) This work was supported in part by the Shenzhen Peacock Project, China, under Grant KQC201109020055A, in part by the Natural Scientific Research Innovation Foundation in Harbin Institute of Technology under Grant HIT.NSRIF.2014100, and in part by the Shenzhen Strategic Emerging Industries Program under Grant ZDSY20120613125016389.
ABSTRACT Among the discovered knowledge, sequential-pattern mining is used to discover the frequent subsequences from a sequence database. Most research handles the static database in batch mode to discover the desired sequential patterns. In the past, the fast updated (FUP) and Fast UPdated 2 (FUP2) concepts were adopted to, respectively, maintain and update the discovered sequential patterns with sequence insertion and sequence deletion based on the designed FUP sequential pattern (FUSP)-tree structure. Based on the FUP or FUP2 concepts, original customer sequences are required to be rescanned if it is necessary to maintain and update the unpromising (small) sequences from the original database. In the past, pre-large concept was designed to keep the prelarge itemsets as the buffer to avoid the database rescan each time whether transaction insertion or deletion in the dynamic databases. In this paper, the prelarge concept is adopted to handle the discovered sequential patterns with sequence deletion. An FUSP tree is first built to keep only the frequent 1-sequences from the original database. The prelarge 1-sequences are also kept in a set for later maintenance approach. When some sequences are deleted from the original database, the proposed algorithm is then performed to divide the kept frequent 1-sequences and prelarge 1-sequences from the original database and the mined 1-sequences from the deleted customer sequences into three parts with nine cases. Each case is then processed by the designed algorithm to maintain and update the built FUSP tree. When the number of deleted customer sequences is smaller than the safety bound of the prelarge concept, the original customer sequences are unnecessary to be rescanned, but the sequential patterns can still be actually maintained and updated. Experiments are conducted to show the performance of the proposed algorithm in terms of execution time and the number of tree nodes. INDEX TERMS Sequence deletion, pre-large concept, dynamic databases, FUSP-tree structure, data mining. I. INTRODUCTION
With the rapid growth of the Internet and data, mining useful and meaningful knowledge or information is a critical issue in recent years. Depending on specific applications [2], [7], [22], [26], the discovered knowledge can be classified as association-rule mining [3], [11], [13], classification [17], clustering [6], and sequential pattern mining [5], [24], [29], among others [20], [21]. Sequential-pattern mining (SPM) concerns the ordered sequence data such as DNA sequences, usage of Web log, Web-click streams, or the logs of network flow. It can also be used to predict the purchased 1374
behaviors of the customers in basket analysis. For example, a customer buys several socks in one transaction at shopping mall, he or she will buy some shoes in a later transaction. Agrawal et al. first designed AprioriAll algorithm [5] to level-wisely generate-and-test the candidates for deriving the sequential patterns (large sequences), which is extended from Apriori algorithm of association-rule mining [3]. Many algorithms have been proposed to discover the sequential patterns from the static database in batch mode [16], [24], [25], [27]–[29]. When customer sequences are changed whether sequence insertion or sequence deletion,
2169-3536 2014 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
VOLUME 2, 2014
J. C-W. Lin et al.: Efficiently Maintaining the FUSP Trees
the discovered sequential patterns may become invalid or some sequential patterns may arise. An intuitive way to handle the dynamic databases whether sequence insertion or sequence deletion is to re-scan the original database and re-mine the desired sequential patterns, which is not suitable in real-world applications. In the past, the Fast Updated (FUP) and FUP2 concepts [8], [9] were respectively proposed to maintain and update the frequent itemsets of association-rule mining with transaction insertion and transaction deletion. Lin et al. then extended the FUP concept and designed an incremental FASTUP algorithm [23] to maintain and update the discovered sequential patterns. The FASTUP algorithm divides the discovered sequential patterns from the original customer sequences and the all sequences in the inserted sequences into four cases. Each case is then performed by the designed procedure to maintain and update the discovered sequential patterns. Lin et al. designed a fast updated sequential pattern (FUSP)-tree structure and algorithms to respectively maintain and update the FUSP tree with sequence insertion and sequence deletion in dynamic databases [18], [19]. A FUSP tree is first built to keep the large sequences for later maintenance. Based on the FUSP tree with an index Header_Table, it is easier to maintain and update the built FUSP tree when sequences are inserted or deleted from the original customer sequences. Based on the FASTUP or FUSP-tree algorithm, original database is required to be rescanned if the small sequences are necessary to be maintained in the updated database. In the past, Hong et al. extended the pre-large concept [12] of association-rule mining to respectively maintain and update the discovered sequential patterns with sequence insertion [15] and sequence deletion [10]. Numerous candidates are still, however, required to be generated in a levelwise way based on the AprioriAll-like mechanism [5]. In this paper, a maintenance algorithm for sequence deletion based on the pre-large concept is proposed to efficiently maintain and update both the discovered large 1-sequences in the built FUSP and the pre-large 1-sequences for later mining process. The FUSP tree must build in advance before sequence deletion. The pre-large 1-sequences are also kept to speed up the updating process of the FUSP tree. When the sequences are deleted from the original database, the kept large and pre-large 1-sequences in the original database and all 1-sequences in the deleted sequences are then divided into three parts with nine cases. The 1-sequences of each case can thus be maintained and updated by the designed procedure. The original database is unnecessary to be rescanned until the cumulative number of deleted sequences achieves the designed safety bound based on the pre-large concept [10]. From the experimental results, the proposed algorithm has better performance than the state-of-art SPM algorithms in batch mode or the maintenance algorithm for sequence deletion based on the FUP2 concept. The remainder of this paper is organized as follows. Related work is reviewed in Section 2. The proposed algorithm is described in Section 3. An illustrative example is given in Section 4. VOLUME 2, 2014
Experimental results are given in Section 5. Conclusion is given in Section 6. II. REVIEW OF RELATED WORK
In this section, related works of sequential-pattern mining (SPM), FUSP-tree structure and pre-large concept are reviewed as follows. A. SEQUENTIAL-PATTERN MINING
In the past, Agrawal et al. proposed the AprioriAll algorithm [5] to generate-and-test candidates for mining the sequential patterns from a static database. Pei et al. designed the PrefixSpan algorithm to efficiently mine the sequential patterns based on the projection mechanism [27]. A sequence database is recursively projected into several smaller sets of projected database to speed up the computations for mining sequential patterns. Zaki et al. proposed a SPADE algorithm to fast mine the sequential patterns [30]. The SPADE algorithm utilizes the combinational properties based on the efficient lattice search techniques with join operations. Based on SPADE algorithm, the sequential patterns can be derived with three database scans. Many algorithms have been proposed to mine the sequential patterns, but most of them are performed to handle the static database. When the sequences are changed whether sequence insertion [8], [13] or deletion [9], [14] in the original database, the discovered sequential patterns may become invalid or new sequential patterns may arise. An intuitive way to update the sequential patterns is to re-process the updated database in batch mode, which is inefficient in real-world applications. To handle the dynamic database with sequence insertion, Lin et al. proposed a FASTUP algorithm [23] to incrementally maintain and update the discovered sequential patterns with sequence insertion. The original database is still, however, required to be rescanned if the discovered sequential pattern is large in the added sequences but small in the original database based on the FASTUP concept. Hong et al. then extended the pre-large concept [12] of association-rule mining to respectively maintain and update the discovered sequential patterns with sequence insertion [15] and sequence deletion [10] in a level-wise way, which requires more computations of multiple database rescans. Lin first designed a fast updated sequential pattern (FUSP)-tree and developed the algorithms for efficiently handling an incremental database with sequence insertion [18]. The FUSP tree is built in advance before the sequences are inserted into the original database. Two parts with four cases are then divided based on the FASTUP concept to maintain and update the FUSP tree for later mining process. Lin et al. also proposed the maintenance algorithm for sequence deletion [19]. The original database is, however, required to be rescanned if it is necessary to maintain a sequence which is small in the original database but large in the inserted sequences with sequence insertion or a sequence is small both in the original database and in the deleted sequences with sequence deletion. 1375
J. C-W. Lin et al.: Efficiently Maintaining the FUSP Trees
B. A FUSP-TREE STRUCTURE
The FUSP tree [18] is built to store only large 1-sequences in the original database. Each node in the FUSP tree consists of any large 1-sequences (item); its occurrence frequencies (count); the bi-directional links to connect its parent node (parent-link); a link to connect its offspring child nodes (child-links); and a link to connect the same 1-sequences in the FUSP tree (hyper-link). An index Header_Table is also built to trace the tree nodes in the FUSP tree for later mining process. It consists of large 1-sequences (item); its total occurrence frequencies (count); and a link of each 1-sequence (htable-link) in Header_Table. Based on the FUSP tree, the sequential patterns can be completely discovered without candidate generations in a level-wise way. The computations of multiple database rescans can thus be greatly reduced compared to the level-wisely approaches. An example in Table 1 is used to build the FUSP tree. TABLE 1. Original database.
Assume that the upper support threshold is set at 50%, and the lower support threshold is set at 30%. From the given customer sequences in Table 1, the large 1-sequences (A) and (B) are used to construct the FUSP tree and an index Header_Table. The pre-large 1-sequences are also kept in a set of Pre_Seqs = {C:4, G:4, E:3}. The built FUSP tree [18] is shown in Figure 1.
the corresponding nodes of the processed 1-sequence in the FUSP tree. After that, the FUSP tree is completely built. For the built FUSP tree in Figure 1, the link between nodes are marked as s if the relationship between two nodes is sequence relation; otherwise, it is marked as i which indicates the itemset relationship between two nodes. The Header_Table is used as an index table to find the corresponding nodes of the processed 1-sequence in the FUSP tree. C. PRE-LARGE CONCEPT
A pre-large sequence [15] is not truly large, but has highly probability to be large when the database is updated. The upper (Su ) and lower support (Sl ) thresholds are respectively set to discover large and pre-large sequences. When the support ratio of a sequence is larger than or equal to Su (the same as minimum support threshold in traditional data mining approach), it is concerned as the large sequence; otherwise, if the support ratio lies between the Su and Sl , it is concerned as the pre-large sequence. A sequence with a support ratio below the lower threshold is concerned as a small sequence. The pre-large sequences can be used as the buffer to reduce the movement of sequences directly from large to small and vice-versa in the maintenance process. When some sequences are deleted from the original database, two situations are then arisen with sequence deletion as: 1) All sequences of an old customer are completely deleted. This situation will also modify the number of customers in the original database. 2) Partial sequences from an original database are deleted. This situation will not change the number of customers in the original database. Considering the customer sequences in the original database and in the deleted parts, nine cases of the pre-large concept are arisen as:
FIGURE 1. Initial constructed FUSP tree with its Header_Table.
FIGURE 2. Nine cases arising from the original database and the deleted sequences.
First, the customer sequences in the original database is then refined to keep only the large 1-sequences. The itemsets of a sequence are then sorted in descending order by their occurrence frequencies. The FUSP tree is constructed tuple from tuple, from the first refined customer sequence to the last one. After refined sequences are processed, the Header_Table is also built and used as an index table to find
From Figure 2, cases 2, 3, 4, 7 and 8 will not affect the final sequential patterns when the database is updated. Some discovered sequential patterns may be removed in case 1, and some new sequential patterns may be arisen in cases 5, 6, and 9 in the updated database. When the large and pre-large sequences are already discovered, the cases 1, 5, and 6 can be easily handled. For the maintenance
1376
VOLUME 2, 2014
J. C-W. Lin et al.: Efficiently Maintaining the FUSP Trees
process in real-world applications, the ratio of the deleted sequences is normally very small compared to the original customer sequences. Based on the pre-large sequences, the sequence cannot possible be large for the updated database as long as the number of deleted sequences q is smaller than the safety bound as [10]: q≤f =
(Su − Sl ) × |D| , Su
where q is the number of deleted customers, |D| is the number of customer sequences from the original database, Su is the upper threshold, and Sl is the lower threshold. A summary of the nine cases and their merged results are shown in Table 2. TABLE 2. Summary of nine cases for sequence deletion.
Algorithm 1 Pre-Large FUSP-TREE-DEL Algorithm Input: D, Htable, Pre_Seqs, FUSP_tree, Su , Sl , c, and T. Output: Updated FUSP_tree and Pre_Seqs for updated database (U = D − T). (Su − Sl ) × |D| 1: calculate f = . Su 2: count number of q from T. 3: find all S T (s) of 1-sequences from T. 4: for each s ∈ S T (s) do 5: S U (s) := S D (s) − S T (s). 6: if s∈ Htable then 7: call Procedure_SubCases. 8: else if s∈ Pre_Seqs then 9: call Procedure_SubCases. 10: else 11: if S T (s) < Sl × |T | ∧ s ∈ / Htable nor s ∈ / Pre_Seqs then 12: put s in the set of Rescan_Seqs. 13: end if 14: end if 15: end for 16: //check original database is required to be rescanned 17: if c + q ≤ f or Rescan_Seqs := null then 18: c := c + q. 19: else 20: call Procedure_RescanCases; 21: D:= D − c − q; 22: c := 0. 23: end if
III. PROPOSED MAINTENANCE PRE-LARGE FUSP-TREE-DEL ALGORITHM
Before the maintenance approach for sequence deletion, a fast updated sequential pattern (FUSP)-tree [18] must built in advance to keep the large 1-sequences into the tree structure. A set is also created to keep the pre-large 1-sequences for speeding up the maintenance approach. When sequences are deleted from the original database, D T U q c Su Sl s SD (s) ST (s) SU (s) Pre_Seqs Reduced_Seqs
Rescan_Seqs
Branch_Seqs
VOLUME 2, 2014
Original customer sequences; A set of deleted customer sequences; Updated customer sequences, i.e., D âĂŞ T; Number of deleted customers from T; Number of last deleted customers; Upper support threshold for large 1-sequences; Lower support threshold for pre-large 1-sequences, 1-sequence; Occurrence frequencies of s in D; Occurrence frequencies of s in T; Occurrence frequencies of s in U; A set of pre-large 1-sequences; A set of 1-sequences for which the deleted sequences have to be reprocessed for updating the built FUSP tree; A set of 1-sequences for which the original database has to be rescanned to get the number of occurrence. A set of 1-sequences for which the original database has to be reprocessed for updating the built FUSP tree.
the proposed maintenance algorithm is then performed to divide the discovered information into three parts with nine cases. Each case is then processed by the designed procedure to maintain and update the built FUSP tree and pre-large 1-sequences. Based on the designed Pre-large FUSP-TREE-DEL algorithm (described in Algorithm 1), the database is unnecessary to be rescanned until the cumulative number of deleted sequences achieves the safety bound. Notations and the proposed algorithm are described below. The sub-procedure Procedure_SubCases (Described in Algorithm 2) is used to respectively maintain and update Htable, Pre_Seqs, and FUSP_tree for Cases 1 to 6. The 1sequences that satisfy the designed condition are then put in the set Reduced_Seqs for maintaining the built FUSP_tree when sequences are deleted from the original database. Details are given below. The sub-procedure Procedure_RescanCases (Described in Algorithm 3) is used to maintain and update the 1-sequences in case 9. The 1-sequences that satisfy the designed condition are then put in the set of Branch_Seqs. Details are given below. After the proposed maintenance algorithm, the final updated pre-large FUSP tree is constructed. The new customer sequences can be integrated into the original database. Based on the built pre-large FUSP tree, the desired large sequences for the updated database can be determined. 1377
J. C-W. Lin et al.: Efficiently Maintaining the FUSP Trees
Algorithm 2 Procedure_SubCases 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37:
if
S U (s)/(|D − c − q|)
≥ Su then if s∈Htable then put s in the set of Reduced_Seqs. else if s∈Pre_Seqs then move s from Pre_Seqs to Htable; put s in the set of Branch_Seqs. else ignore. end if SD (s) := SU (s) in Htable. else if Su >SU (s)/(|D − c − q|) ≥Sl then if s∈Htable then remove s from Htable to Pre_Seqs. end if SD (s) := SU (s) in Pre_Seqs. else if s∈Htable then connect s.Parent.node to s.Child.nodes. remove s from FUSP_tree and Htable. s∈Pre_Seqs remove s from Pre_Seqs. end if end if //remove 1-sequences from T if Reduced_Seqs != null then for each t∈T do if 1-sequence J∈Reduced_Seqs then if J∈corr.branch in FUSP_tree then J.count−−. if J.count := 0 then connect J.Parent.node to J.Child.nodes; remove s from FUSP_tree and Htable. end if end if end if end for end if
IV. AN ILLUSTRATED EXAMPLE
An example is given to illustrate the proposed algorithm to maintain and update the built FUSP tree for later mining process. An original database was shown in Table 1, which consists of 10 customer sequences with nine purchased items. An index table is created to speed up the maintenance procedure for updating the FUSP tree and the rescanning procedure when it is necessary. The corresponding customer IDs (CIDs) of each 1-sequence are kept and shown in Table 3. For the given example, the Su and Sl are respectively set at 50% and 30%. Note that the Su and Sl can be defined by user preference. The built FUSP tree was shown in Figure 1. Suppose four sequences are deleted from the original database shown in Table 4. The proposed maintenance algorithm is then processed by the steps as follows. 1378
Algorithm 3 Procedure_RescanCases 1: for each s∈Rescan_Seqs do 2: rescan D to find SD (s); 3: SU (s) := SD (s) − ST (s). 4: if SU (s)/(|D − c − q|)≥Su then 5: put s with SU (s) in the end of Htable; 6: put s in the set of Branch_Seqs. 7: else if Su >SU (s)/(|D − c − q|) ≥Sl then 8: put s with SU (s) in the set of Pre_Seqs. 9: else 10: ignore. 11: end if 12: for each t∈D and J∈Branch_Seqs do 13: if J∈corr.branch / in FUSP_tree then 14: insert J at the end of its corr.branch.pos; 15: set J.count := 1. 16: else 17: J.count++. 18: end if 19: end for 20: end for TABLE 3. An index table.
TABLE 4. Four deleted sequences.
The variable c is initially set at 0 since no sequences were deleted last time. The formula of pre-large concept [10] to determine whether the original database is required to be rescanned is calculated as: (Su − Sl ) × |D| (0.5 − 0.3) × 10 f = = =4 Su 0.5 From the given example, when the cumulative number of deleted customers achieves 4, the original database is required to be rescanned. Otherwise, the proposed algorithm is then processed to maintain and update the FUSP tree and the pre-large 1-sequences in the set of Pre_Seqs for later mining process. From Table 4, it can be found that the number of deleted customers is 2 since the customers 4 and 10 are entirely deleted from the original database but the sequence (B) is partial deleted from customers 5 and 6. VOLUME 2, 2014
J. C-W. Lin et al.: Efficiently Maintaining the FUSP Trees
TABLE 5. All 1-sequences with their counts in the deleted sequences.
The variable q is thus set at 2. From Table 4, the 1-sequence with their count is calculated and shown in Table 5. The 1-sequences in Table 5 are then divided into three parts with nine cases according to whether they are large (appearing in Header_Table), pre-large (appearing in the set of Pre_Seqs) or small in the original database. The divided results are then shown in Table 6.
1-sequence (E) was 3 in the Pre_Seqs, and its count in the deleted sequences is 1. The count of 1-sequence (E) is updated as (3 − 1)(= 2). The support ratio of (E) is smaller than Sl as 2/(10 − 0 − 2)(= 0.25) < 0.3. The 1-sequence (E) becomes small after the database is updated. (E) is then removed from the set of Pre_Seqs. After that, the Branch_Seqs = {C}. The count of 1-sequence (G) was 4 in the Pre_Seqs, and its count in the deleted sequences is 1. The count of 1-sequence (G) is updated as (4 − 1)(= 3). The support ratio of (G) lies between Su and Sl as 0.5 > 3/(10 − 0 − 2)(= 0.375) > 0.3. The 1-sequence (G) remains pre-large 1-sequence after database is updated. The count of (G) in Pre_Seqs is updated as 3. After that, the set of Pre_Seqs = {G}. The FUSP tree is updated according to the deleted sequences with 1-sequences existing in the Reduced_Seqs. The corresponding branches for the deleted sequences with any 1-sequences in the set of Reduced_Seqs are show in Table 7. TABLE 7. All 1-sequences with their counts in the deleted sequences.
TABLE 6. All 1-sequences with their counts in the deleted sequences.
The 1-sequences in cases 1, 2, and 3 are firstly processed to maintain and update the built FUSP tree and the set of Pre_Seqs. In this example, the 1-sequences (A) and (B) satisfy the condition; the count of (A) in Header_Table is 6, and its count in the deleted sequences is 1. The 1-sequence (A) is updated as (6 − 1)(= 5). The support ratio of (A) is still larger than Su since 5/(10 − 0 − 2) (= 0.625) > 0.5. The 1-sequence (A) is still large after the database is updated. The count of (A) in the Header_Table is updated as 5. Beside, an item (A) is also put into the set of Reduced_Seqs. The count of 1-sequence (B) in the Header_Table is 5, its count in the deleted sequences is 3. The count of 1-sequence (B) is updated as (5 − 3)(= 2). The support ratio of (B) is smaller than Sl since 2/(10 − 0 − 2) (= 0.25) < 0.3. The 1-sequence (B) becomes small after database is updated. The FUSP tree is then updated by removing the corresponding nodes of (B). The pre-large 1-sequences in the deleted sequences are processed. In this example, the 1-sequences (C), (E), and (G) satisfy the condition and are processed. The count of 1-sequence (C) was 4 in the Pre_Seqs, and its count in the deleted sequences is 0. The count of 1-sequence (C) is updated as (4 − 0)(= 4). The support ratio of (C) is larger than Su since 4/(10 − 0 − 2)(= 0.5) ≤ 0.5. The 1-sequence (C) becomes large after database is updated. (C) is then removed from the set of Pre_Seqs and put it into the se of Branch_Seqs. Also, put (C) into the end of Header_Table. The count of VOLUME 2, 2014
The first corresponding branch in Table 7 is then processed to decrease the count of (A) by 1. Since the other corresponding branches in Table 7 are null, nothing has to be done for them. After that, the updated FUSP tree is shown in Figure 3.
FIGURE 3. Updated FUSP tree after STEP 7.
Since the 1-sequences (D) and (F) are neither large nor pre-large in the original customer sequences (not appearing in the Header_Table nor in the set of Pre_Seqs), and small in the deleted sequences, they are thus put in the set of Rescan_Seqs, which is used when rescanning the original database is required. After that, Rescan_Seqs = {D, F}. Since (c + q)(= 0 + 2) ≤ f (= 4), original database is not required to be rescanned. The 1-sequences in Branch_Seqs are then processed to find the corresponding branches from the original database. In this example, there is only (C) in the set of Branch_Seqs; the corresponding branches of (C) are then discovered and shown in Table 8. In this example, only four corresponding branches of (C) are discovered, which can be seen from Table 8. For (CID = 2), since (A) is already built in the FUSP tree, the node (C) is then created and directly link to the node (A) as its offspring node. The relationship between node (A) and (C) is set as s (sequence relation). Its count is set as 1. For (CID = 5), 1379
J. C-W. Lin et al.: Efficiently Maintaining the FUSP Trees
TABLE 8. Original database.
only one node (C) is found for this corresponding branch, the node (C) is created and directly link to the root of FUSP tree as its offspring node. Its count is set as 1. The (CID = 6) and (CID = 7) are processed in the same way as (CID = 2). After that, the finally updated FUSP tree is shown in Figure 4. The variable c is then updated as c (= c + q) (= 0 + 2) (= 2). The updated value of c will be used for processing next deleted procedure. Based on the FUSP tree in Figure 4, the large sequences can thus be discovered by the FP-growth-like approach [11].
FIGURE 4. Finally updated FUSP tree.
V. PERFORMANCE EVALUATION
In this section, experiments were conducted to evaluate the performance of the proposed algorithm compared to the SPADE [30], the PrefixSpan [27], the PREAPRIORIALL-DEL [10], the FUSP-TREE-BATCH [19] and the FUSP-TREE-DEL algorithm [19]. Experiments are implemented in the Java language and performed on a computer with an Intel Corei5 2.6 GHz processor and 4 GB of RAM running the 32-bit Microsoft Windows 7 operating system. The SPADE, PrefixSpan and the FUSP-TREE-BATCH are executed in batch mode, which indicates that when sequences are deleted from the original database, the updated database is required to be rescanned to mine the updated sequential patterns. For the maintenance mechanism of the FUSP-TREE-DEL and the proposed algorithms, only the deleted sequences are required to be maintained to update the built FUSP trees for mining the updated sequential patterns. In the conducted experiments, the batch-mode algorithms and the maintenance approaches are executed five times to precisely evaluate the performance. Thus, the customer sequences are updated five times to perform the batch-mode algorithms for discovering the updated sequential patterns. In the experiments, a deletion ratio (DR) was set to evaluate the performance of the proposed algorithm. The sequences 1380
for deletion were selected bottom-up from the original customer sequences. The number of sequences for deletion was set at (|D| × DR), and the partial deleted ratio as 10% of the number of deleted sequences. Thus, only the number of 10% ×(|D| × DR) sequences are partial deleted from the deleted sequences. For example, assume a database size is 1000, DR is set at 5%, the bottom-up (1000 ×5%) (= 50) sequences are selected for sequences deletion; the (10 × 50) (= 5) sequences within 50 deleted sequences are partly deleted and the other (50 − 5) (= 45) sequences are completely deleted from the 50 deleted sequences. The Su is set as the upper support threshold, which is the same as the minimum support threshold in traditional data mining. The Sl is set as the lower support threshold, which is generally smaller than Su . Fewer sequential patterns are generated when the Su is set higher while more redundant or non-interested sequential patterns are generated when the Su is set lower. It is thus reasonable to set different Su and Sl for different databases. A. DESCRIPTION OF DATABASES
Both three real-life databases such as BMSWebView-1 [31], retail [1], and kosarak [1] and one synthetic database T10I4D100K [4] were used to evaluate the performance of the proposed algorithm. For the BMSWebView-1 database, it contains several months of click-stream sequences from an e-commerce website. For the retail database, it contains about 5 months of receipts. For the kosarak database, it contains click-stream sequences from the logs of an online news portal. For the synthetic databases T10I4D100K, it was generated by IBM Quest Synthetic Data Generation. The parameters and the characteristics of four databases are respectively shown in Tables 9 and 10. TABLE 9. Parameters of used databases.
TABLE 10. Characteristics of used databases.
B. PERFORMANCE OF RUNTIMES
In this section, experiments were first made to compare the runtime between six algorithms in three databases under various minimum support thresholds. The results are shown in Figure 5. For the BMSWebview-1 database in Figure 5(a), the DR is set at 1%, Su value is respectively set from 0.3% to 0.5%, increments 0.05% each time, and Sl value is respectively set at Su value minus 0.01%. For the retail database in Figure 5(b), the DR is set at 5%, Su value is VOLUME 2, 2014
J. C-W. Lin et al.: Efficiently Maintaining the FUSP Trees
FIGURE 5. Experiments of runtime under various minimum support thresholds in three databases. (a) BMSWebView-1 (DR:1%). (b) Retail (DR:5%). (c) T10I4D100K (DR:1%).
FIGURE 6. Experiments of runtime under various deletion ratios in three databases. (a) BMSWebView-1 (Su :0.4%, Sl :0.3%). (b) Retail (Su :1%, Sl :0.9%). (c) T10I4D100K (Su :2%, Sl :1.95%).
respectively set from 0.06% to 0.14%, increments 0.02% each time, and Sl value is respectively set at Su value minus 0.1%. For the T10I4D100K database in Figure 5(c), the DR is set at 1%, Su value is respectively set from 0.12% to 0.20%, increments 0.02% each time, and Sl value is respectively set at Su value minus 0.05%. From Figures 5, it can be observed that the proposed algorithm has best performance compared to the other five algorithms in all three databases since the pre-large 1-sequences are kept as the safety bound to avoid the computations of multiple database rescans especially when the size of the original database is very large. Also, it can be observed that SPADE algorithm outperforms the PrefixSpan algorithm in BMSWebview-1 and T10I4D100K databases but has worse performance than the PrefixSpan in retail database. Since the SPADE and PrefixSpan are executed in batch mode based on generate-and-test approach, their results are worse than the tree-based FUSP-TREE-DEL algorithm. Although the PRE-APRIORIALL-DEL algorithm adopts the pre-large concept to handle the deleted customer sequences, it still uses the generate-and-test approach to maintain and update the discovered sequential patterns. From the observed results in Figure 5, the PRE-APRIORIALL-DEL algorithm requires more computations than the other algorithms in three databases. Since the FUSP-TREE-DEL algorithm and the proposed algorithm adopted the same tree structures for maintenance, it has nearly the same performance compared to proposed algorithm. The FUSP-TREE-DEL algorithm still requires, however, to rescan the original database while some sequences are very small both in the original database and in the deleted sequences. Experiments were conducted to compare the runtime between six algorithms in three databases under various deletion ratios. The results are shown in Figure 6. For the BMSWebview-1 database in Figure 6(a), the Su and Sl are respectively set at 0.4% and 0.3%, and DR is set from 0.5% to 2.5%, increments 0.5% each time. For the kosarak database in Figure 6(b), the Su and Sl are respectively set at 1.0% and 0.9%, and DR is set from 1% to 5%, increments 1% each time. For the retail database in Figure 6(c), the Su and Sl are respectively set at 2.0% and 1.95%, and DR is set from 0.5% to 2.5%, increments 0.5% each time. When the deletion ratio is increased, the runtime of the FUSP-TREE-DEL and the proposed algorithm is also increased but the batch-mode algorithms of SPADE, PrefixSpan, and the FUSP-TREE-BATCH remains steady. When the number of deleted sequences is increased, the
FUSP-TREE-DEL algorithm and the proposed algorithm require more computations in the maintenance procedure. The SPADE, PrefixSpan and FUSP-TREE-BATCH algorithms are, however, processed in batch mode to re-process the updated database each time while the sequences are deleted in the original database. The PRE-APRIORIALL-DEL algorithm adopts the generateand-test approach to find the sequential patterns, which requires more computations compared to the other algorithms. Two tree-based FUSP-TREE-DEL algorithm and the proposed algorithm have better results than the other approaches. The proposed algorithm still has better performance than the FUSP-TREE-DEL algorithm since the proposed algorithm is unnecessary to rescan the original database when the sequence is small both in the original database and in the deleted sequences but the FISP-TREE-DEL has to maintain the built FUSP tree with multiple database rescans. Based on the proposed algorithm, the original database is required to be rescanned until the number of cumulative deleted sequences achieves the safety bound.
VOLUME 2, 2014
C. PERFORMANCE OF TREE NODES
In this section, the number of tree nodes are thus compared to show the space complexity of the proposed algorithm compared to the FUSP-TREE-BATCH and the FUSP-TREE-DEL algorithms in three databases. The results under various minimum support thresholds and various deletion rations are respectively shown in Figures 7 and 8. From Figures 7 and 8, it can be observed that the proposed algorithm has nearly the same number of tree nodes compared to the FUSP-TREE-BATCH and FUSP-TREE-DEL algorithms. The reason is that the pre-large 1-sequences are kept in the set for avoiding the multiple database rescans. Only large 1-sequences are kept in the FUSP-TREE-BATCH,
FIGURE 7. Experiments of tree nodes under various minimum support thresholds in three databases. (a) BMSWebView-1 (DR:1%). (b) Retail (DR:5%). (c) T10I4D100K (DR:1%) . 1381
J. C-W. Lin et al.: Efficiently Maintaining the FUSP Trees
FIGURE 8. Experiments of tree nodes under various deletion ratios in three databases. (a) BMSWebView-1 (Su :0.4%, Sl :0.3%). (b) Retail (Su :1%, Sl :0.9%). (c) T10I4D100K (Su :2%, Sl :1.95%).
the FUSP-TREE-DEL and the proposed algorithm. Since the FUSP-TREE-DEL and the proposed algorithm used the sorted order in the Header_Table to sort the 1-sequences for tree construction, more tree nodes are thus generated compared to the FUSP-TREE-BATCH algorithm which can be observed from Figures 7 and 8. D. NUMBER OF PRE-LARGE 1-SEQUENCES
The number of pre-large 1-sequences of the proposed algorithm is evaluated under various support thresholds and various deletion ratios are then compared. The results are then show in Tables 11 and 12. TABLE 11. Number of pre-large 1-sequences under various minimum support thresholds.
FIGURE 9. Scalability performance in kosarak database. (a) Kosarak (DR:3%, Su :1%, Sl :0.95%). (b) Kosarak (DR:3%, Su :1%, Sl :0.95%).
From Figure 9(a), it can be observed that the runtime of all algorithms are linearly increased along with the number of database size. The proposed algorithm has still the best performance compared to other algorithms. Besides, the batchmode SPADE, PrefixSpan and the FUSP-TREE-BATCH algorithms are sharply increased but the FUSP-TREE-DEL and the proposed algorithm are steadily increased. From Figure 9(b), it can also be observed that the proposed algorithm has generated nearly the same number of tree nodes compared to the FUSP-TREE-BATCH and the FUSP-TREEDEL algorithms. From the conducted experiments, it can be observed that the proposed algorithm achieves the trade-off between time and space complexities. VI. CONCLUSION
TABLE 12. Number of pre-large 1-sequences under various deletion ratios.
From Tables 11 and 12, it can be observed that fewer pre-large 1-sequences are kept compared to the number of tree nodes in the FUSP-TREE structures. From the above results, it can be concluded that the proposed algorithm achieves good performance with slightly additional memory based on pre-large concept. E. PERFORMANCE OF SCALABILITY
In this section, the scalability of the six algorithms is evaluated. Experiments were conducted on real-life kosarak database. The sequences used for deletion were selected up-to-down from the original database. The DR is set at 3%, Su is set at 1% and Sl is set at 0.95%. The database sizes were respectively set 10k, 50k, 100k, 150k and 200k. The results of runtime and the number of tree nodes are shown in Figure 9. 1382
Most approaches are executed to process the static database in batch mode. Original database is required to be rescanned when the sequences are consequentially deleted from it. In this paper, the pre-large concept is adopted to reduce the multiple database rescans by keeping fewer pre-large 1-sequences. A FUSP-tree structure is also applied in the proposed algorithm to avoid the generate-and-test approach for generating the numerous candidates. When the sequences are deleted from the database, the proposed algorithm partitions the 1-sequences into three parts with nine cases according to whether they are large, pre-large or small in the original database and in the deleted sequences. Each part is then processed by the designed algorithm to easier update the tree structures. The Header_Table, and the built FUSP tree are correspondingly updated whenever necessary. Based on the proposed algorithm, the original database is unnecessary to be rescanned until the number of cumulative number of deleted sequences achieves the safety bound. From the experimental results, it can also be observed that the tree complexity of the proposed algorithm may not be optimal because the sorted order of 1-sequnces in the Header_Table may not be accurately kept. The proposed algorithm generates, however, nearly the same number of tree nodes compared to the batch-mode FUSP-TREE-BATCH and the FUSP-TREE-DEL algorithms. From the observed results, it can also be concluded that the proposed algorithm achieves the trade-off between time and space complexities. VOLUME 2, 2014
J. C-W. Lin et al.: Efficiently Maintaining the FUSP Trees
REFERENCES [1] (2012). Frequent Itemset Mining Dataset Repository. [Online]. Available: http://fimi.ua.ac.be/data/ [2] R. Agrawal, T. Imielinski, and A. Swami, ‘‘Database mining: A performance perspective,’’ IEEE Trans. Knowl. Data Eng., vol. 5, no. 6, pp. 914–925, Dec. 1993. [3] R. Agrawal and R. Srikant, ‘‘Fast algorithms for mining association rules in large databases,’’ in Proc. Int. Conf. Very Large Data Bases, 1994, pp. 487–499. [4] R. Agrawal and R. Srikant. (1994). Quest Synthetic Data Generator. [Online]. Available: http://www.Almaden.ibm.com/cs/quest/syndata.html [5] R. Agrawal and R. Srikant, ‘‘Mining sequential patterns,’’ in Proc. Int. Conf. Data Eng., 1995, pp. 3–14. [6] P. Berkhin, ‘‘A survey of clustering data mining techniques,’’ in Grouping Multidimensional Data. Berlin, Germany: Springer-Verlag, 2006, pp. 25–71. [7] M.-S. Chen, J. Han, and P. S. Yu, ‘‘Data mining: An overview from a database perspective,’’ IEEE Trans. Knowl. Data Eng., vol. 8, no. 6, pp. 866–883, Dec. 1996. [8] D. W. Cheung, J. Han, V. T. Ng, and C. Y. Wong, ‘‘Maintenance of discovered association rules in large databases: An incremental updating technique,’’ in Proc. 25th Int. Conf. Data Eng., Mar. 1996, pp. 106–114. [9] D. W.-L. Cheung, S. D. Lee, and B. Kao, ‘‘A general incremental technique for maintaining discovered association rules,’’ in Proc. Int. Conf. Database Syst. Adv. Appl., Apr. 1997, pp. 185–194. [10] C.-Y. Wang, T.-P. Hong, and S.-S. Tseng, ‘‘Maintenance of sequential patterns for record deletion,’’ in Proc. IEEE Int. Conf. Data Mining, Nov. 2001, pp. 536–541. [11] J. Han, J. Pei, Y. Yin, and R. Mao, ‘‘Mining frequent patterns without candidate generation: A frequent-pattern tree approach,’’ Data Mining Knowl. Discovery, vol. 8, no. 1, pp. 53–87, 2004. [12] T.-P. Hong, C.-Y. Wang, and Y.-H. Tao, ‘‘A new incremental data mining algorithm using pre-large itemsets,’’ Intell. Data Anal., vol. 5, no. 2, pp. 111–129, Apr. 2001. [13] T.-P. Hong, C.-W. Lin, and Y.-L. Wu, ‘‘Incrementally fast updated frequent pattern trees,’’ Expert Syst. Appl., vol. 34, no. 4, pp. 2424–2435, May 2008. [14] T.-P. Hong, C.-W. Lin, and Y.-L. Wu, ‘‘Maintenance of fast updated frequent pattern trees for record deletion,’’ Comput. Statist. Data Anal., vol. 53, no. 7, pp. 2485–2499, May 2009. [15] T.-P. Hong, C.-Y. Wang, and S.-S. Tseng, ‘‘An incremental mining algorithm for maintaining sequential patterns using pre-large sequences,’’ Expert Syst. Appl., vol. 38, no. 6, pp. 7051–7058, Jun. 2011. [16] C. Kim, J.-H. Lim, R. T. Ng, and K. Shim, ‘‘SQUIRE: Sequential pattern mining with quantities,’’ J. Syst. Softw., vol. 80, no. 10, pp. 1726–1745, Oct. 2007. [17] S. B. Kotsiantis, ‘‘Supervised machine learning: A review of classification techniques,’’ in Proc. Conf. Emerg. Artif. Intell. Appl. Comput. Eng., Real Word AI Syst. Appl. eHealth, HCI, Inf. Retr. Pervasive Technol., 2007, pp. 3–24. [18] C.-W. Lin, T.-P. Hong, W.-H. Lu, and W.-Y. Lin, ‘‘An incremental FUSP-tree maintenance algorithm,’’ in Proc. 8th Int. Conf. Intell. Syst. Design Appl., Nov. 2008, pp. 445–449. [19] C.-W. Lin, T.-P. Hong, and W.-H. Lu, ‘‘An efficient FUSP-tree update algorithm for deleted data in customer sequences,’’ in Proc. Int. Conf. Innovative Comput., Inf. Control, Dec. 2009, pp. 1491–1494. [20] C.-W. Lin, T.-P. Hong, and W.-H. Lu, ‘‘An effective tree structure for mining high utility itemsets,’’ Expert Syst. Appl., vol. 38, no. 6, pp. 7419–7424, Jun. 2011. [21] C.-W. Lin and T.-P. Hong, ‘‘A new mining approach for uncertain databases using CUFP trees,’’ Expert Syst. Appl., vol. 39, no. 4, pp. 4084–4093, Mar. 2012. [22] C.-W. Lin and T.-P. Hong, ‘‘A survey of fuzzy web mining,’’ Wiley Interdiscipl. Rev., Data Mining Knowl. Discovery, vol. 3, no. 3, pp. 190–199, 2013. [23] M.-Y. Lin and S.-Y. Lee, ‘‘Incremental update on sequential patterns in large databases,’’ in Proc. IEEE Int. Conf. Tools Artif. Intell., Nov. 1998, pp. 24–31. [24] C. H. Mooney and J. F. Roddick, ‘‘Sequential pattern mining—Approaches and algorithms,’’ ACM Comput. Surveys, vol. 45, no. 2, pp. 1–39, Feb. 2013. [25] F. Nakagaito, T. Ozaki, and T. Ohkawa, ‘‘Discovery of quantitative sequential patterns from event sequences,’’ in Proc. IEEE Int. Conf. Data Mining Workshops, Dec. 2009, pp. 31–36. VOLUME 2, 2014
[26] B. Nath, D. K. Bhattacharyya, and A. Ghosh, ‘‘Incremental association rule mining: A survey,’’ Wiley Interdiscipl. Rev., Data Mining Knowl. Discovery, vol. 3, no. 3, pp. 157–169, May/Jun. 2013. [27] J. Pei et al., ‘‘Mining sequential patterns by pattern-growth: The PrefixSpan approach,’’ IEEE Trans. Knowl. Data Eng., vol. 16, no. 11, pp. 1424–1440, Nov. 2004. [28] J.-M. Ren and J. R. Jang, ‘‘Discovering time-constrained sequential patterns for music genre classification,’’ IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 4, pp. 1134–1144, May 2012. [29] R. Srikant and R. Agrawal, ‘‘Mining sequential patterns: Generalizations and performance improvements,’’ in Proc. Int. Conf. Extending Database Technol., Adv. Database Technol., 1996, pp. 3–17. [30] M. J. Zaki, ‘‘SPADE: An efficient algorithm for mining frequent sequences,’’ Mach. Learn., vol. 42, nos. 1–2, pp. 31–60, Jan. 2001. [31] Z. Zheng, R. Kohavi, and L. Mason, ‘‘Real world performance of association rule algorithms,’’ in Proc. ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2001, pp. 401–406.
JERRY CHUN-WEI LIN received the Ph.D. degree from the Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, in 2010. He is currently an Assistant Professor with the School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China. He has authored over 100 research papers in refereed journals and international conferences. He received the High-End Professionals Award in 2013 under the Peacock Plan of the Shenzhen Government (B-level). His interests include data mining, soft computing, privacy preserving data mining and security, social network, and cloud computing.
WENSHENG GAN is currently pursuing the master’s degree with the School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China. His research interests include data mining and cloud computing.
TZUNG-PEI HONG received the B.S. degree in chemical engineering from National Taiwan University, Taipei, Taiwan, in 1985, and the Ph.D. degree in computer science and information engineering from National Chiao Tung University, Hsinchu, Taiwan, in 1992. From 1987 to 1994, he was with the Laboratory of Knowledge Engineering, National Chiao Tung University, where he was involved in applying techniques of parallel processing to artificial intelligence. He was an Associate Professor with the Department of Computer Science, Chung Hua University, Hsinchu, from 1992 to 1994, and the Department of Information Management, I-Shou University, Kaohsiung, Taiwan, from 1994 to 1999. He was a Professor with I-Shou University from 1999 to 2001. He was in charge of the whole computerization and library planning with the National University of Kaohsiung, Kaohsiung, from 1997 to 2000, where he also served as the first Director of the Library and Computer Center from 2000 to 2001, the Dean of Academic Affairs from 2003 to 2006, the Administrative Vice President from 2007 to 2008, and the Academic Vice President in 2010. He is currently a Distinguished Professor with the Department of Computer Science and Information Engineering and the Department of Electrical Engineering. His current research interests include knowledge engineering, soft computing, and granular computing.
1383