Journal of Computational Information Systems 7:5 (2011) 1420-1427 Available at http://www.Jofcis.com
A Real Environment Oriented Parallel Duplicates Removal Approach for Large Scale Chinese WebPages Hongzhi GUO1,2,†, Qingcai CHEN1,2, Cong XIN1, Xiaolong WANG1,2,3, Ye BI1 1
Dept. of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, China 2 Key Laboratory of Network Oriented Intelligent Computation, Shenzhen 518055, China 3 Intelligent Technology & Natural Language Processing Lab, Harbin Institute of Technology, Harbin 150001, China
Abstract Aiming at the problems that existing Chinese webpage duplicate removal approaches are restricted by the page scale and the algorithm efficiency, we propose an efficient distributed parallel duplicates elimination approach based on Linux cluster. This paper not only solves the problems of memory limitations and large computation caused by the huge data scale, but also researches into the time-series problems in the distributed parallel computing and gives an effective solution. Experimental results on 10 million webpage dataset show that the proposed approach can deal with duplicates from massive web pages well and truly. Keywords: Duplicates Elimination; Distributed Parallel Computing; Linux Cluster; Time-series; Speedup
1. Introduction Due to the frequent reprinting among different websites, duplicate WebPages are quite normal in the returned results from search engines [1]. Duplicates and near-duplicates make users have to check the same information for many times. Lots of time and energy are wasted. Meanwhile a lot of network bandwidth and storage of search engines are taken up and wasted. More importantly, the really needed information might be buried in these duplicate or near-duplicate results. Duplicate WebPages not only reduce the searching efficiency, but also cause a great impact on the precision and recall of search engines. Recently different kinds of methods have been proposed in duplicates detection and elimination [1-5]. However existing methods can only ensure a high precision in a small data scale, they cannot provide an effective solution for the complexity caused by the huge data scale in search engines. To solve these problems we present a distributed parallel duplicates elimination approach based on Linux cluster. 2. Related Work Duplicates elimination originates from text copy detection [6]. Due to the information used, duplicates elimination methods can be divided into URL-based [4, 7], hyperlink-based [8] and content-based [9], three types. Among them URL-based and hyperlink-based methods are simple, intuitive and easy to be implemented, but due to the information used they all don't deal well with the reprinting web pages, which are the most in the real environment. Content-based methods tell the duplicates according to the contents and structures of the web pages. In recent years the content-based methods have been the mainstream in this field. The methods can usually be divided into three steps: feature extraction, similarity computation †
Corresponding author. Email addresses:
[email protected] (Hongzhi GUO),
[email protected] (Qingcai CHEN),
[email protected] (Xiaolong WANG)
1553-9105/ Copyright © 2011 Binary Information Press May, 2011
1421
H. Guo et al. /Journal of Computational Information Systems 7:5 (2011) 1420-1427
and duplicates elimination. Some features are firstly extracted from a web page as the representative, and then the similarities between the pages are calculated, after that the duplicates are detected and dropped. Among these three steps, feature extraction is a key link. Content-based methods can be divided into two types according to the ways of feature extraction [6], syntax-based [10, 11] and semantic-based [12]. The main difference between them is that, syntax-based methods just look the contents as strings without any consideration of the semantic information in these documents, and select different syntax marks as the rules for feature extraction; semantic-based methods generally need to analyze the text contents and use a variety of statistical natural language processing techniques to extract the features. Considering the differences between Chinese documents and English documents, a special word segmentation step is always needed. The results of word segmentation will inevitably influence the final the duplicates elimination results. Therefore most existing duplicates elimination approaches for Chinese WebPages adopt the syntactic-based methods for the consideration of the efficiency and effectiveness. In [13] a feature code based duplicates removal algorithm is developed. Word strings with a length of L/2 are selected from both sides of the periods, these strings constitute a feature code, and then the feature codes are indexed by B-Trees for detecting duplicates. A similar duplicates elimination algorithm based on feature strings is presented in [14] by Pingbo Wu et al.. It firstly uses high-frequency-items to separate web pages into “sentences”, and then the characters in font and end of the “sentences” are extracted as the feature strings, finally clustering techniques are applied to reduce the comparisons in duplicates detection. Besides, there is still another kind of methods, which takes the keywords extracted in the indexing as the feature items and uses them to eliminate the duplicates. The approach used in Peking Tianwang search engine [4], which uses the abstracts and keywords for duplicates elimination, serves as a representative. In duplicates elimination, all of the above representative methods acquire satisfactory results in a laboratory data scale. However with the increasing of the number of the web pages, the problems like the memory limitations, huge computation etc, severely limit the capability of the duplicates elimination algorithms. Facing hundreds of millions of web pages, or TB level as well as PB level page scale in a real environment, there is nothing the existing single-machine based duplicates elimination algorithms can do. Aiming at the problems above, we propose a parallel Chinese WebPages duplicates elimination approach, and solves the problems such as data inconsistencies, parallel time-series etc in distributed parallel computing. The duplicates removal efficiency is greatly improved with the duplicates removal accuracy being ensured meanwhile. This provides a solution for the duplicates elimination of Chinese WebPages in a real environment. 3. Parallel Duplicates Elimination Based on Linux Cluster In previous studies, we have proposed a length-variable feature code based fuzzy duplicates elimination approach for Chinese WebPages [16]. The basic flow of the duplicates elimination approach is that: first, referring to our length-variable feature code extraction algorithm a short feature code is extracted from a purified web page as its representative; and then generalized suffix trees are built using the feature codes, the duplicates are detected and discarded using these built suffix tress. Considering that there are so many web pages, an extremely large memory space is absolutely necessary during the implementation of the generalized suffix tree, and wider comparisons are required for our previous definition of one-way repeatability. Distributed parallel computing is considered to be a good solution for these problems in the implementation of the length-variable feature code based fuzzy duplicates elimination approach. 3.1. Problems Facing Overall, the paragraphs below list the main problems facing in a real environment oriented duplicates elimination approach for massive web pages: I, Huge data scale General commercial search engines usually have to deal with hundreds of GBs or TB-level purified web pages every day; II, Huge memory space required A larger memory space is usually required while a higher speed is obtained using the
H. Guo et al. /Journal of Computational Information Systems 7:5 (2011) 1420-1427
1422
generalized suffix tree based duplicates elimination. III, Wide comparison ranges Due to our previous definition of one-way repeatability any page has to be compared with the feature codes with a length of no less than its, the comparison ranges are extremely wide; IV, Incapability of common inner sorting algorithms In our previous algorithm, longer feature codes are disposed before the shorter ones, therefore the feature codes to be detected should be firstly sorted, however common inner sorting algorithms cannot deal with so huge amount of data. How to arrange the duplicates eliminating procedure, solve the above problems and finish the duplicates elimination tasks with a high efficiency are usually the difficulties in a real environment oriented duplicates elimination method. 3.2. Framework of Distributed Parallel Duplicates Elimination In figure 1, the framework of our feature code based parallel duplicates elimination approach and its data flow are detailed. Data files
3 steps in parallel duplicates elimination
Wangjingqi.new3.ripes.0000 …… Wangjingqi.new3.ripes.0008
Purified web pages
Extract the feature codes in parallel 4.code 5.code …… 51.code
The file group of the feature codes
Remove the duplicated web pages with the same feature codes Maincode Maincode.index Mainsim
Feature code files and its index, duplicates pair file
FValid FSim
The ID files of the legitimate pages, the files of the repeatability
Remove the duplicates with partly matching feature codes
Fig.1 Framework of the Parallel Approach for Removing Duplicates
The parallel duplicates elimination system mainly contains three steps: 1. Extract the feature codes in parallel This part of work is to read the purified web pages and divide them into N blocks based on a specified block size, and then dynamically schedule M working nodes to extract the feature codes, complete the N tasks and store them into the file system ordered by the lengths of the feature codes. 2. Remove the duplicated web pages with the same feature codes After the first step a file group of ordered feature codes is obtained. In this step we only need to remove the same pages within each file, and no merging is required. After that the retained feature codes are written into a feature code file maincode in descending order of their lengths. 3. Remove the duplicates with partly matching feature codes This step also contains two phrases: first, local duplicates elimination, the duplicates within each block are removed in parallel; second, global merging, it can be verified that there is no duplicate within each block after the previous phrase, however there is no guarantee that no duplicate exists between the blocks, and a merging operation is introduced to eliminate the duplicates between the blocks. 3.3. A Parallel Resolution for the Suffix-Tree Based Duplicates Elimination Firstly the feature code files are divided into several blocks (N blocks) by the master node with a specified block size. Since the following suffix tree based duplicates elimination is carried out within each block, the block size cannot exceed the maximum memory space of a single machine. Local duplicates elimination is executed within the N blocks, for each block we build the suffix trees using their feature codes, and then the duplicates are dropped after the comparisons between the feature codes, the legitimate records are written into a temporary file. After that we have N temporary files and it can be guaranteed that there is no duplicate within these N files, but there may still be duplicates between them. The duplicates between the
1423
H. Guo et al. /Journal of Computational Information Systems 7:5 (2011) 1420-1427
blocks are disposed in the second global merging step, and it ensures the uniqueness of each record in the data set. Figure 2 shows the schematic of this part of the work. Compared with local duplicates elimination, the global merging step is more complex. As mentioned earlier, each feature code has to be compared with the ones of a no less length, and the duplicate relationships cannot be determined by only a part of them. These make that traditional merging algorithms cannot be used. Based on an overall consideration of various factors, we adopt the following merging scheme: Suppose that there are N temporary result sets (from result set 1 to result set N) obtained after the local duplicates elimination, i. Suffix tress are built using the feature codes in result set 1 (host data), and then the records in result sets 2, 3, ..., N (guest data) are located in to carry out the comparisons, during the comparisons the duplicates are detected and discarded subsequently; ii. Suffix tress are built using the feature codes in result set 2 (host data), and then the records in result sets 3, ..., N (guest data) are located in to carry out the comparisons, during the comparisons the duplicates are detected and discarded subsequently; ... To this analogizes, the N result sets are completed. block informatio n
block informatio n
node 1
temporary file 1
node 2
temporary file 2
node… 1
temporary file 1
node 2
temporary file 2
2
2
Mainc ode
1
1
1
… … …
… … …
… … …
… … …
node M
temporary file M
node M
temporary file M
N
N
Step 2 Global merging
… result 2
result 3 ……
…
result N
use result set 1 as the host data, detect the duplicates in result sets 2-N 2
result 1
… … …
Step 1 local duplicate elimination
… result 1 (a)
Writing results FValid FSim
… result 3 ……
…
result N
(b) use result set 2 as the host data, detect the duplicates in result sets 3-N … … ...
N Following disposition
… result 2
result 1
result 2
result 3 ……
…
result N
(c) result set N
Fig. 2 Sketch Map of Parallel Duplicates Removal
Fig. 3 Sketch Map of Global Merging
The core of the global merging is that each record is compared with the longer records in other blocks. The records in result sets 1 to N are originally retained in descending order of the lengths of the feature codes, and we build suffix trees (host data) in process 1, …, N using the N result sets respectively. Then the records from larger result sets (guest data) are located in orderly to carry out the comparisons between them and the guest data, the duplicates are detected and dropped. Figure 3 shows a sketch map of the global merging. The differences between the global merging and the local duplicates elimination are that, even if there is no duplicate, the record won't be inserted into the suffix tree, and when a suffix tree is once established, its size won't change any more, no more memory burden is required. 3.4. Time-series Analysis in Distributed Parallel Computation In our experiments we find there is a very important time-series problem to be considered in the above scheme (scheme A). The global merging cannot be completely implemented in parallel since the processes condition each other. For the sake of discussion, we assumes that a host data requires two units of time (dark shadings in the figures), and the detection of a guest data needs one unit of time (light shadings in the figures), and there are totally four result sets to be disposed. The corresponding time-series diagram is shown in figure 4(a). In figure 4(a), process 2 begins after process 1 has built its suffix tree and disposed the first guest data block (result set 2), and then it adopts a new result set 2 (the duplicates have been removed) as the host data to deal with others. To this analogizes, the global merging is finished. These procedures cost a total of nine
H. Guo et al. /Journal of Computational Information Systems 7:5 (2011) 1420-1427
1424
units of time. Although this time-series arrangement satisfies the system's requirements, it also brings some difficulties in practice. Since we won't start a new process k until process k-1 has disposed the first guest data block (data block k), extra communication between the master node and the slave node happens. Furthermore there is still a question how to transfer the data to the newly started process. Firstly, if they are sent directly, the communication increases sharply and the efficiency drops significantly; secondly if there are not so many working nodes currently, the task has to be disposed in the next round of node allocation and we have to write the data into a temporary file. As a result, inconsistent disposition, more complex time-series logic, and more cumbersome controls between the master node and the slave nodes all come. These are not worth the candle.
(a) Scheme A
(b) Scheme B Fig. 4 Time-series Diagram of Scheme A and B
After considering various factors and the merits of the above scheme, we adopt another scheme (scheme B) in this paper: carry out the duplicates elimination in parallel and obtain independent duplicates pairs, then use the master node to execute a following sequential processing. The corresponding time-series diagram is shown in figure 4(b). During the merging each process goes on completely in parallel without any waiting, and accomplishes their tasks alone. The running time is reduced to five units of time. Obviously this scheme will lead to some redundant operations, which cannot be avoided to obtain a maximum parallelism. To get the final results a follow-up processing is executed by the master node, it sequentially reads the duplicates pairs detected by each process and removes "nonexistent" duplicates pairs, which should have been removed in the disposition of the former result sets. After that a final DocId collection of the preserved web pages is constructed from the newly updated duplicates pair set. Actually the records with the same feature codes have been removed in the second step, and the records that should be removed in this step are not too much as long as the repeatability threshold isn't too small. This means that the redundant operations mentioned above won't be too much. Overall the efficiency losses caused by the mentioned redundant operations are far less than the benefits of this scheme. The parallelism is greatly raised in Scheme B, and less communication also lightens the burden on the master node. The controlling and time-series logic in the program become simper. Eventually merging independently with a following sequential processing is eventually adopted in the implementation of the global merging. 4. Experimental Results and Discussions The speedup is one of the main criteria in parallel computing. The speedup of a parallel computing method means how many times does the execution speed of the parallel algorithm performs over that of a serial algorithm. Its definition is given in formula (1), T (0) S= (1) T ( n) Where T(n) is the running time of the parallel algorithm using n processors, and T(0) denotes the running time of a best serial algorithm on the same problem. 4.1. Correctness Verification of Parallel Duplicates Elimination In our previous duplicates elimination algorithm the duplicates are detected and dropped subsequently and
1425
H. Guo et al. /Journal of Computational Information Systems 7:5 (2011) 1420-1427
the final results relate to the disposing sequence of the records. To avoid the influence of the sequence on the experimental results we write three programs, which separately correspond to the three steps (feature code extraction in parallel, removing duplicated web pages with the same feature codes and removing the duplicates with partly matching feature codes using the suffix tree) in the parallel duplicates elimination approach, and compare them with the corresponding single-machine programs in the correctness verification of our parallel duplicates elimination approach. News_xc.ripes with 25,476 records is adopted as the data set. Experiment I feature code extraction The parallel program divides the data into 3 portions and uses eight computers as working nodes to extract the feature codes, and then the feature codes are written into a feature code file group ordered by their lengths. The corresponding single-machine program does the same work without any node allocation. Finally the running time of the single-machine program is 37 seconds, and that of the parallel program is 13 seconds. After comparing their results, we find they all contain 25,476 records, but these feature codes are in a different order. This is because that the processing time of the data in each node is different, and the sequence of writing results differs from that in the data block allocation. Experimental results show that this part only get a speedup of 2.85. This is because that although we have eight working nodes, only three of them are used for the small data scale, and the speedup can only reach 3 in the best case. That the speedup is less than 3 is because additional task assignment, task scheduling and communications between the master and the slave nodes cost a certain time. Moreover our parallel file system (pvfs2 1.40) only supports reading concurrently, but doesn't support writing concurrently, therefore the efficiency of our parallel program is partly reduced. In experiments II and III, the time spent is both very short in the single-machine program and the parallel one for the small data scale (huge data scale cannot be disposed by the single-machine program.), the time spent on the task assignments takes a larger proportion. The comparisons are not reasonable. Therefore we only focus on the correctness verification of the parallel method in this section, and the running time of the parallel methods is detailed in Sect. 4.3. Experiment II removing duplicated web pages with the same feature codes We still use eight working nodes in the parallel program. After comparing the result files (feature code files without any same feature code and been sorted in order) of the single-machine program with those of the parallel one, we find they have the same number of records (23,980 after duplicates removal), but some of the records are different. This is because that our duplicate elimination method discards the duplicates once they are detected, and in this step the results are related to the sequence of the coming feature codes, regretfully the sequences in the two files have been different in step I. In order to verify the correctness of the parallel method, the results in step I from the single-machine program are adopted as the input of our parallel program. The final results show that the outputs are the same with those of the single-machine program. Experiment III removing the duplicates with partly matching feature codes using the suffix tree This part is divided into two steps: local duplicates elimination within a data block and global merging between data blocks. The corresponding single-machine program deals with the feature codes using the suffix tree directly, and there is no task assignment and global merging. We still use eight working nodes in the parallel method, the partitioned task block is 0.5MB and we totally get three task blocks. In the single-machine program the same data are inputted. After the duplicates have been eliminated by the parallel program and the single-machine one, both the numbers of their final legitimate web pages are 23,617, and they have the same contents too. In brief, the experimental results agree with our parallel duplicates elimination approach. 4.2. Efficiency of Parallel Duplicates Elimination As the corresponding single-machine duplicates elimination program in our paper cannot deal with huge data, if we just compare our parallel duplicates elimination approach with the single-machine program
H. Guo et al. /Journal of Computational Information Systems 7:5 (2011) 1420-1427
1426
using a small amount of data, the efficiency of our parallel is not so significant. In our experiments different numbers of working nodes are implemented, and the efficiency of the parallel method is verified through the variations of their running time. A purified web page file, which has a size of 232.56GB and 26,203,124 web pages, is chosen from the Haitianyuan search engine in our lab as the data set. In the first step of extracting the feature codes, the block size is set to 1.5GB, and there are totally 155 sub-tasks. In the second step of removing the pages with the same feature codes, there is no task assignment. But in the experiment we find that the file 20.code is too large, and we divide it into 20 small files by hashing. Therefore we get 67 sub-tasks in this step. In the third step of removing the duplicates with partly matching feature codes using the generalized suffix tree, the block size is set to 2MB, and there are totally 426 sub-tasks. After removing the link type web page and the ones with few contents, the parallel duplicates elimination approach is implemented, and there are 14,726,030 valid web pages left. The table of the running time and the corresponding curve are shown in table 1 and figure 8 respectively. Table 1 Running Time of the Parallel Duplicates Elimination Algorithm Nums of Processes
4
6
8
10
12
14
16
Running time (hours)
16.24
10.34
7.73
6.56
5.87
5.4
5.23
Fig. 5 Curve of the Running Time of the Parallel Duplicates Elimination Algorithm along with the Numbers of Nodes
As shown in table 1 and figure 5, with the increasing of the numbers of the working nodes, the running time of our parallel duplicates elimination reduces a lot. Also from the variation of the curve, both the running time and the speedup vanishingly converge. Therefore if more working nodes are added, there will be few improvements on the efficiency. This is because that in the first two steps of the parallel duplicates elimination approach, exclusive writing exists and with the increasing of working nodes more resources are wasted. Furthermore each working node has to communicate with the master node, and the communication time increase too, thus the enhancing of the efficiency of the overall approach becomes smaller and smaller. 5. Conclusion In order to solve the problems that existing duplicates elimination algorithms cannot deal well with the huge data in a real environment, this paper proposes a distributed parallel method based on Linux cluster. For the three steps in duplicates elimination, we adopt the master-slave mode, the master node is responsible for the tasks allocated to each slave node, sorting and merging the results from the slave nodes. Specifically aiming at the cross duplicate elimination among the slave nodes, this paper focuses on the problems such as the time-series and the communications, and gives an efficient and accurate solution. The experiments not only prove the efficiency of the distributed parallel duplicates elimination approach, but also validate its accuracy. Experimental results show that the distributed parallel duplicates elimination
1427
H. Guo et al. /Journal of Computational Information Systems 7:5 (2011) 1420-1427
approach proposed can deal well with the huge data in a real environment, and it also has a high efficiency while the duplicates elimination accuracy is ensured. Our distributed parallel computing scheme is of generality. It can not only be implemented in the duplicates elimination of huge data scale, but also be used to solve similar problems arisen by the huge data scale and memory limitations in other applications, such as parallel crawling, parallel indexing, and so on. Acknowledgement This work is supported by Natural Science Foundation of China (No. 60703015 and No. 60973076). The authors are grateful for the anonymous reviewers who made constructive comments. Reference [1] [2]
[3]
[4] [5]
[6] [7]
[8] [9]
[10] [11]
[12] [13]
[14]
[15]
J. P. Kumar, P. Govindarajulu. Duplicate and Near Duplicate Documents Detection: A Review. European Journal of Scientific Research, 32(4), pp. 514-527, 2009. Cao Yu-Juan, Niu Zhen-Dong, WangWei-Qiang and Zhao Kun. The study on Detecting Near-Duplicate WebPages. In Proceedings of the 8th IEEE International Conference on Computer and Information Technology (CIT 2008). Sydney, Australia, IEEE, pp. 95-100, 8-11 July, 2008. Gurmeet Singh Manku, Arvind Jain and Anish Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web. Banff, Alberta, Canada, ACM, pp. 141-150, 2007. Wang Jian-Yong, Xie Zheng-Mao, Lei Ming and Li Xiao-Ming. Research and Evaluation of Near replicas of Web Pages Detection Algorithms. Acta Electronica Sinica, 28(11), pp. 130-132, 2000. Rajiv Yerra, Yiu-Kai Ng. Detecting similar HTML documents using a fuzzy set information retrieval approach. In Proceedings of 2005 IEEE International Conference on Granular Computing, Beijing, China, pp. 693-699, 25-27 July, 2005. Bao Jun-Peng, Shen Jun-Yi, Liu Xiao-Dong and Song Qin-Bao. A Survey on Natural Language Text Copy Detection. Journal of Software, 14(10), 1753-1760, 2003. Deng, F., Rafiei, D.. Approximately detecting duplicates for streaming data using stable bloom filters. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, ACM, pp. 25-36, Chicago, IL, USA: 2006. Jeffrey Dean, Monika R. Henzinger. Method for identifying near duplicate pages in a hyperlinked database. U. S. Patent 6138113, USA, Oct 24 2000. K. Monostori, A. Zaslavsky and H. Schmidt. MatchDetectReveal: Finding overlapping and similar digital documents. In Proceedings of the Information Resources Management Association International Conference (IRMA2000), Anchorage Hilton Hotel, Anchorage, Alaska, USA, pp. 955-957, 21-24 May, 2000. S. Brin, J. Davis, H. Garcia-Molina. Copy Detection Mechanisms for Digital Documents. In Proceedings of the ACM SIGMOD Annual Conference, New York, USA, ACM, pp. 398-409, 1995. K. Monostori, A. Zaslavsky and H. Schmidt. Parallel and distributed overlap detection on the Web. In Proceedings of the Workshop on Applied Parallel Computing (PARA2000), Bergen, Norway, pp. 206-214, 18-21 June 2000. A. Si, H.V. Leong and R.W.H. Lau. CHECK: A document plagiarism detection system. In Proceedings of the ACM Symposium for Applied Computing, San Jose, California, United States, ACM, pp. 70-77, 1997. Zhang Gang, Liu Ting, Zheng Shi-Fu, Che Wan-Xiang and Li Sheng. Fast Deletion Algorithm for Large Scale Duplicated Web Pages. The twentieth anniversary Proceedings of the Chinese Information Processing Society of China (sequel), pp. 18-25, 2001. WU Ping-bo, CHEN Qun-xiu and MA Liang. The Study on Large Scale Duplicated Web Pages of Chinese Fast Deletion Algorithm Based on String of Feature Code. Journal of Chinese Information Processing, 17(2), pp. 29-36, 2003. Cong Xin. Method of Parallel Removing Duplicates in Large Scale Chinese Web Pages Based on Feature Code. Master thesis in Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, 2008.