BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce Dongjin Yu1, Wei Wu2, Suhang Zheng1, and Zhixiang Zhu1 1
School of Computer, Hangzhou Dianzi University, Hangzhou, China
[email protected] 2 Zhejiang Provincial Key Laboratory of Network Technology and Information Security, Hangzhou, China
[email protected]
Abstract. Parallel processing is essential to mining frequent closed sequences from massive volume of data in a timely manner. On the other hand, MapReduce is an ideal software framework to support distributed computing on large data sets on clusters of computers. In this paper, we develop a parallel implementation of BIDE algorithm on MapReduce, called BIDE-MR. It iteratively assigns the tasks of closure checking and pruning to different nodes in cluster. After one round of map-combine-partition-reduce, the closed frequent sequences with round-specific length and the candidates for the next round of computation are generated. Since the candidates and their pseudo project databases are independent with each other, BIDE-MR achieves high speed-ups. We implement BIDE-MR on an Apache Hadoop cluster and use BIDE-MR to mine the vehicles which frequently appear together from massive records collected at different monitoring sites. The results show that BIDE-MR attains good parallelization. Keywords: frequent closed sequences, parallel algorithms, BIDE, MapReduce.
1
Introduction
Sequential pattern mining is trying to find the relationships between occurrences of sequential events, or in other words, to find if there exist any frequently occurring patterns related to time or other sequences. Since many business transactions, telecommunications records and weather data are time sequence data, discovery of sequential patterns is an essential data mining task with broad applications, such as targeted marketing, customer retention and weather forecasting. Among several variations of sequential patterns, closed sequential pattern is the most useful one since it retains all the information of the complete pattern set but is often much more compact than it. Some well-known algorithms, such as BIDE[1-2], CloSpan[3] and CMP-Miner[4] have been proposed for mining closed sequential patterns. BIDE adopts a closure checking scheme, called BI-Directional Extension, which mines closed sequential patterns without candidate maintenance. CloSpan follows a candidate maintenance-and-test paradigm to prune the search space and check if a newly found candidate sequential pattern is likely to be closed. CMP-Miner mines closed patterns in a time-series database where each record in the database contains multiple time-series sequences. Y. Xiang et al. (Eds.): ICA3PP 2012, Part II, LNCS 7440, pp. 177–186, 2012. © Springer-Verlag Berlin Heidelberg 2012
178
D. Yu et al.
With advances in data collection and storage technologies, large data sources have become ubiquitous. Today, organizations routinely collect terabytes of data on a daily basis with the intent of gleaning non-trivial insights on their business processes. To benefit from these advances, it is imperative that sequence mining techniques scale to such proportions. Such scaling can be achieved through the design of new and faster algorithms and/or through the employment of parallelism. However, achieving such scaling is not straightforward and only a handful of research efforts in the data mining communities have attempted to address these scales. Fortunately, the past few years have witnessed the emergence of several platforms for the implementation and deployment of large-scale analytics. MapReduce, which has been popularized by Google, is a scalable and fault-tolerant data processing model that enables to process a massive volume of data in parallel with many low-end computing nodes [5]. The MapReduce model consists of two primitive functions: Map and Reduce. During the Map step, the master node takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node. During the Reduce step afterwards, the master node collects the answers to all the sub-problems and combines them in some way to form the output. Users can define the Map() and Reduce() functions however they want the MapReduce framework works. This paper presents the parallel implementation of BIDE on MapReduce framework, called BIDE-MR. It iteratively distributes the tasks of closure checking and pruning to different nodes in cluster. After one round of map-combine-partition-reduce, the closed frequent sequences with round-specific length and the candidates to be checked for the next round of computation are generated. Since the candidates and their pseudo project databases are independent with each other, BIDE-MR achieves high speed-ups. To the best of our knowledge, previous work on parallel closed sequential pattern mining has mainly focused on multi-core computer architectures [6-7] or MPI [8]. There is no parallel algorithm that targets on MapReduce framework. The rest of the paper is organized as follows. In Section 2, we discuss the related works. The problem is defined in Section 3 and the traditional serial BIDE algorithm is given in Section 4. In Section 5, we describe the parallel implementation of BIDE on MapReduce in detail. The results from the real case are then presented in Section 6. Finally, the last section concludes the paper.
2
Related Works
Sequential pattern mining, since its introduction in [9], has become an essential data mining task, with broad applications, including market and customer analysis, web log analysis and pattern discovery in protein sequences. Some efficient sequential pattern mining algorithms have been proposed in the literature such as CloSpan [3], BIDE [2] and SeqStream [10]. Many studies present convincing arguments that for mining frequent patterns, one should not mine all frequent patterns but the closed ones because the latter leads to not only more compact yet complete result set but also better efficiency [8-9, 11-12]. In the dynamic sequence database environment, sequences (or items) are often added to and deleted from databases and thus the mining of closed frequent itemsets over streaming data become more difficult. For such online mining of closed
BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce
179
frequent itemsets, one of the most important issues is about how to maintain the set of closed frequent itemsets. In [13], Chang et. al. presented a compact structure CSTree to keep closed sequential patterns, while in [14], Li, Ho and Lee proposed a one-pass algorithm, NewMoment, to maintain the set of closed frequent itemsets in data streams with a transaction-sensitive sliding window. Episode mining used to discover the events occurring often in the vicinity of each other is another well-studied field in sequential pattern discovery. Such approaches include mining closed episodes with simultaneous events [15], with minimal and non-overlapping occurrences [16], etc. Parallel frequent pattern discovery algorithms exploit parallel and distributed computing resources to relieve the sequential bottlenecks of serial frequent pattern mining algorithms. Although there have been numerous studies on sequential-pattern mining, the study on parallel sequential-pattern mining is still limited and is largely confined to mining the complete set of sequential patterns such as pSPADE[17]. Many partition based approaches in distributed databases are mainly employed for preserving individual confidentiality, but not for the efficient mining [18-20]. Guralnik and Karypis presented in [21] some parallel sequential-pattern mining approaches toward a distributed-memory system for mining the complete set of sequential-patterns, via the tree-projection-based sequential algorithm. However, to the best of our knowledge, there are only a few parallel algorithms that target closed sequential-pattern mining. In [8], Cong, Han and Padua developed an algorithm, called Par-CSP, to conduct parallel mining of closed sequential patterns on a distributed memory system. Par-CSP partitions the work among the processors by exploiting the divide-and-conquer property so that the overhead of inter-processor communication is minimized. Moreover, it applies dynamic scheduling and selective sampling to avoid processor idling and load imbalance. In [22], Luo and Chung proposed a parallel algorithm, named PMSPX, which mines close frequent sequences by using multiple samples to exclude infrequent candidates. In PMSPX, the asynchronous local closed frequent sequence mining on each processing node followed by synchronous global mining approach minimizes the synchronization and communication among the processing nodes.
3
Problem Definition
A sequence is an ordered list of events, denoted as , ,…, , or simply … , where is an event, or an item, i.e., I , , … , , for 1 i m. The number of events, or the instances of items, in a sequence is called the length of the sequence , and a sequence with a length is also called a -sequence. A sequence … is contained in another sequence … , or in other words, is the sub-sequence of or is the super-sequence of , if there exist integers 1 m such that , ,…, . An input sequence database is a set of tuples , , where is a sequence identifier, and an input sequence. The support of a sequence in a sequence database is the number of tuples of that contains, denoted as | |.
180
D. Yu et al.
Given a support threshold _ , a sequence is a frequent sequence on _ . If sequence is frequent and there exists no super-sequence of if | | with the same or bigger support, is called as a frequent closed sequence. which contains a prefix sequence … , the remaining Given an input sequence part of after the first instance of prefix is called the projected sequence of prefix sequence … in . Given an input sequence database , the complete set of projected sequences of prefix … in called the projected database of prefix … in . Instead of physically constructing the projected database, pseudo projected databases only keep a set of pointers, one for each projected sequence, pointing at the starting position in the corresponding projected sequence. The problem can be then defined as following. Given an input sequence database and a support threshold _ , find in all frequent closed sequences with support equal to or bigger than _ .
4
BIDE: BI-directional Extension Based Frequent Closed Sequence Mining
According to the definition of a frequent closed sequence, if an –sequence, … , is non-closed, there must exist at least one event ′ , which can be used to extend sequences to get a new sequence ′ with the same support. The sequence can be extended in two ways: ′ ′ … and | | , where ′ is a forward-extension event (or item) 1) ′ ′ and a forward-extension sequence of . ′ ′ ′ 2) ′ … or 1 , ′ … … , and | | , ′ ′ is a backward-extension event (or item) and a backward-extension where sequence of . The BIDE algorithm is illustrated as follows. The detail can be found in [2]. 1) scan the sequence database once to find the frequent 1-sequences, treat each frequent 1-sequence as a prefix while all prefixes form the prefix set . 2) if there exists no in prefix set , terminate the procedure. in prefix set with depth-first order, build projected database 3) for each prefix and compute its backward-extension items if cannot be pruned, otherwise delete in prefix set and then go to 2). 4) scan the pseudo projected database of prefix to find its locally frequent items, or the forward-extension items. 5) if there is no backward-extension item nor forward-extension item, output as a frequent closed sequence. with its locally frequent items if exists any, and 6) grow prefix set by appending go to 2).
BIDE adopts a strict depth-first search order and can output the frequent closed patterns in an online fashion. It avoids the curse of the candidate maintenance-and-test paradigm and does not need to maintain the set of historic closed patterns. In addition, it prunes the search space more deeply and checks the pattern closure in a more efficient way in contrast to some other closed pattern mining algorithms like CloSpan algorithm.
BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce
5
181
The BIDE-MR Algorithm
To make sequential pattern mining practical for large datasets, the mining process must be efficient, scalable, and have a short response time. Since the projected datasets of the frequent -sequences are independent in BIDE, its parallel implementation is convenient and thus could lead to the great improvement of its performance. The following presents a BIDE-based algorithm, called BIDE-MR, which conducts the parallel mining of closed sequential patterns on Apache Hadoop. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model [23]. As the free and open source implementation of MapReduce, Hadoop MapReduce allows for distributed processing of large data sets on compute clusters. It exploits a master/slave architecture, implemented by Apache Hadoop Distributed File System, or HDFS. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are also a number of DataNodes in HDFS, usually one per node in the cluster, which manage storage attached to the nodes that they run on. BIDE-MR distributes the tasks with respect to each node on the same level in the sequence tree, such as backward-extension, forward-extension and pruning, to one DataNode in HDFS, while constructing and traversing the sequence tree. If extension events exist, the newly extended prefix sequences as the children of the current node are then assigned to another new DataNode in HDFS to fulfill further extension and pruning. In this way, BIDE-MR identifies the closed sequences with certain length in just one loop, or one job in Hadoop MapReduce’s word. BIDE-MR exploits Hadoop MapReduce as the software framework for distributed processing of large data sets on compute clusters. The mappers identify the candidate -sequences and calculate their local counts from the split of data in the local DataNode. The candidates are then partitioned and to be reduced on different DataNodes in parallel. Those have supports equal to or bigger than _ are separated on different Datanodes again and checked if they could be pruned or closed via bi-directional extension. In this way, the closed frequent -sequences are finally obtained. The above loop could be repeated to obtain the closed frequent 1-sequences. The following shows BIDE-MR in detail. BIDE-MR( , , _ ) Input : sequence database : the longest length of the closed frequent sequence _ : minimum support threshold Output : set of frequent closed sequences found 1: split into n blocks, each assigned to one DataNode; 2: execute in parallel on each DataNode: , _ ) 2: , = frequent 1-sequences( 3: … , , , 4: 1; into m blocks, each 5: split , assigned to one DataNode
182
D. Yu et al.
6: execute in parallel on each DataNode: 7: _ ) , = PAR_BIDE( , , k, … 8: , , , 9: 1; 10: if , go to 5 11: output the closed frequent sequences: PAR_BIDE( , , _ ) Input FS: candidate frequent sequences k: length of the frequent sequence min _sup: minimum support threshold Output : closed frequent -sequences : extended candidate frequent 1 -sequences 12: for (each sequence s in FS) do { 13: PPD = pseudo projected database (FS) 14: if (!prunable(s, PPD )) { 15: if (!backward-extensible(s, PPD )) { 16: if (!forward-extensible(s, PPD )) { 17: CFS CFS s 18: } bout s extensible items 19: else bout 20: } 21: } 22: } 12: return bout
…
Figure 1 shows the running process of BIDE-MR.
6
Performance Evaluation
6.1
Test Environment and Dataset
The experiment was performed on a cluster of 4 computers, each with E7500 2.93GHZ CPU and 2G memory, and Ubuntu 10.04 and Apache Hadoop 0.20.2 installed. Among these 4 computers, one runs both the NameNode and DataNode software, while other three run only the instance of DataNode. Because the synthetic datasets have far different characteristics from the real-world ones, in our experiments we only used some real datasets to do the tests. We chose the vehicle passing-through records collected at different monitoring sites and ran BIDE-MR on these data to mine the Vehicles Frequently Appearing Together, or VFATs, which are sometimes regarded as one of the valuable hints when solving a criminal case. The test datasets contained 2.5 million records that were collected at 183 monitoring sites during 3 months.
BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce
183
Fig. 1. Running process of BIDE-MR algorithm
6.2
Experimental Results
Table 1 gives the runtime of BIDE-MR in seconds on different sizes of test data. The result shows that BIDE-MR runs faster with more participant DataNodes. In other words, BIDE-MR is quite scalable especially with larger amount of test data. Figure 2 presents the speed-ups of BIDE-MR with different numbers of DataNodes and sizes of test data.
184
D. Yu et al. Table 1. Result of experiments
Number of records 100,000 500,000 1,000,000 1,500,000 2,000,000 2,500,000
Single data node 129 584 735 1077 1398 1928
Run time ( in seconds) 2 data nodes 3 data nodes 82 71 356 291 438 342 623 481 773 598 1036 741
4 data nodes 65 270 305 416 516 632
Fig. 2. Speed-ups with respect to the numbers of data nodes and sizes of test data
7
Conclusions
In this paper, we propose a parallel closed sequential pattern mining algorithm BIDE-MR. It takes full advantage of MapReduce paradigm on the Apache Hadoop cluster. Our experimental results on real data show that BIDE-MR attains good parallelization efficiencies. To the best of our knowledge, it is the first MapReduce-based solution for the closed pattern mining problem. In the future, we will conduct the more extensive experiments with larger scales to demonstrate its scalability on the Apache Hadoop cluster consisting of thousands of machines. Acknowledgments. The work is supported by Natural Science Foundation of Zhejiang (No.LY12F02003), the open project of Zhejiang Provincial Key Laboratory of Network Technology and Information Security. The authors would like to thank anonymous reviewers who gave valuable suggestion to improve the quality of the paper.
BIDE-Based Parallel Mining of Frequent Closed Sequences with MapReduce
185
References 1. Wang, J., Han, J., Li, C.: Frequent Closed Sequence Mining without Candidate Maintenance. IEEE Transactions on Knowledge and Data Engineering 19(8), 1042–1056 (2007) 2. Wang, J., Han, J.: BIDE: Efficient mining of frequent closed sequences. In: 20th International Conference on Data Engineering, pp. 79–90. IEEE Computer Society (2004) 3. Yan, X., Han, J., Afshar, R.: CloSpan: Mining Closed Sequential Patterns in Large Databases. In: SDM 2003, San Francisco, CA, pp. 166–177 (2003) 4. Lee Anthony, J.T., Wu, H.-W., Lee, T.-Y., Liu, Y.-H., Chen, K.-T.: Mining closed patterns in multi-sequence time-series databases. Data and Knowledge Engineering 68(10), 1071–1090 (2009) 5. Dean, J., et al.: MapReduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008) 6. Lucchese, C., Orlando, S., Perego, R.: Parallel Mining of Frequent Closed Patterns: Harnessing Modern Computer Architectures. In: 7th IEEE International Conference on Data Mining, pp. 242–251 (2007) 7. Benjamin, N., Alexandre, T., Jean-Francois, M., Takeaki, U.: Discovering Closed Frequent Itemsets on Multicore: Parallelizing Computations and Optimizing Memory Accesses. In: 2010 International Conference on High Performance Computing and Simulation, pp. 521–528 (2010) 8. Shengnan, C., Jiawei, H., David, P.: Parallel Mining of Closed Sequential patterns. In: 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 562–567 (2005) 9. Agrawal, R., Srikant, R.: Mining sequential patterns. In: 11th IEEE International Conference on Data Engineering, pp. 3–14 (1995) 10. Chang, L., Wang, T., Yang, D., Luan, H.: SeqStream: Mining closed sequential patterns over stream sliding windows. In: 8th IEEE International Conference on Data Mining, pp. 83–92 (2008) 11. Lin, M.Y.: Mining closed sequential patterns with time constraints. Journal of Information Science and Engineering 24(1), 33–46 (2008) 12. Bolin, D., David, L., Jiawei, H., Siau-Cheng, K.: Efficient mining of closed repetitive gapped subsequences from a sequence database. In: 25th IEEE International Conference on Data Engineering, pp. 1024–1035 (2009) 13. Chang, L., Wang, T., Yang, D., Luan, H., Tang, S.: Efficient algorithms for incremental maintenance of closed sequential patterns in large databases. Data and Knowledge Engineering 68(1), 68–106 (2009) 14. Li, H.-F., Ho, C.-C., Lee, S.-Y.: Incremental updates of closed frequent itemsets over continuous data streams. Expert Systems with Applications 36(2, pt. 1), 2451–2458 (2009) 15. Nikolaj, T., Boris, C.: Mining closed episodes with simultaneous events. In: 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1172–1180 (2011) 16. Zhu, H., Wang, P., He, X., Li, Y., Wang, W., Shi, B.: Efficient episode mining with minimal and non-overlapping occurrences. In: 10th IEEE International Conference on Data Mining, pp. 1211–1216 (2010) 17. Zaki, M.J.: Parallel sequence mining on shared-memory machines. Journal of Parallel and Distributed Computing 61(3), 401–426 (2001) 18. Rozenberg, B., Gudes, E.: Association rules mining in vertically partitioned databases. Data and Knowledge Engineering 59(1), 378–396 (2006)
186
D. Yu et al.
19. Kapoor, V., Poncelet, P., Trouss, F., et al.: Privacy preserving sequential pattern mining in distributed database. In: 15th ACM Conference on Information and Knowledge Management, CIKM 2006, pp. 758–767 (2006) 20. Nguyen, S.N., Orlowska, M.E.: A partition-based approach for sequential patterns in large sequence databases. Knowledge-Based Systems 21(2), 110–122 (2007) 21. Guralnik, V., Karypis, G.: Parallel tree-projection-based sequence mining algorithms. Parallel Computing 30(4), 443–472 (2004) 22. Luo, C., Chung Soon, M.: Parallel mining of maximal sequential patterns using multiple samples. Journal of Supercomputing 59(2), 852–881 (2012) 23. The Apache Software Foundation, http://hadoop.apache.org