Detecting Near-Duplicate Documents Using Sentence ...

2 downloads 0 Views 1MB Size Report
Detecting Near-Duplicate Documents Using Sentence. Level Features .... represent documents and stores chunks in the inverted index structure [14]. In [5], the.
Detecting Near-Duplicate Documents Using Sentence Level Features Jinbo Feng1 and Shengli Wu1,2 1

School of Computer Science, Jiangsu University, Zhenjiang, China 212013 School of Computing & Mathematics, Ulster University, Newtownabbey, UK BT370QB

2

Abstract. In Web search engines, digital libraries and other types of online information services, duplicates and near-duplicates may cause severe problems if unaddressed. Typical problems include more space needed than necessary, longer indexing time and redundant results presented to users. In this paper, we propose a method of detecting near-duplicate documents. Two sentence level features, number of terms and terms at particular positions, are used in the method. Suffix tree is used to match sentence blocks very efficiently. Experiments are carried out to compare our method with two other representative methods and show that our method is effective and efficient. It has potential to be used in practice. Keywords: Text documents, Information Search, Near-duplicate detection, Suffix tree, Sentence-level features.

1

INTRODUCTION

As the World Wide Web is increasingly popular, numerous copies of the same documents can be generated and spread very easily. A large amount of redundant web documents may cause severe problems for search engines: more space is needed to store the web documents; more time is needed for indexing; and the search results are less useful to users due to the large amount of redundant information. Sometimes it also causes problems that are related to copyright or intellectual property right. Thus, researches on detecting near-duplicate documents have gained attention in IR, WWW, digital libraries, and other related areas [1, 2, 3, 4]. The definition of “near-duplicate” is documents that differ only slightly in content [4]: such as one document is a modification of the other via insertion, deletion or replacement of some terms. A near-duplicates detection algorithm compares two documents mainly based on syntactic similarity. For all the web documents, some prior treatment is necessary since they are usually very noisy. We need to extract the main content of a web document by removing html tags, navigation links, advertisements, and so on. The similarity of two documents can be calculated by comparing how many words or sentences are the same in one way or another. If the similarity between the two documents is greater than a given threshold, then we may regard them as near-duplicates. Although quite a few methods for detecting near-duplicate documents have been proposed, it is still a challenge to have one that is both effective and efficient. One DEXA 2015. © Springer-Verlag Berlin Heidelberg 2011

very efficient method was proposed by Wang and Chang in [4]. They take the number of terms in a sentence as the surrogate of that sentence. Thus each sentence is represented as a number. Two sentences are assumed to be identical if they have the same number of terms. In order to calculate the similarity of two documents, they use a fixsized sliding window to decide how many sentences in the documents are to be compared. Obviously one problem of this method is its effectiveness since the number of terms in a sentence is not accurate enough to distinguish a sentence from others. Another problem is that using a fixed-size window is not able to find all possible matches unless the window is large enough to map the whole document. Effectiveness of the method is therefore further affected. However, using the number of terms in a sentence can be a good starting point, and errors in detection can be reduced by a few measures implemented at a relatively low cost. In this piece of work, we also apply the same feature (number of terms in the sentence) for the matching of a sequence of sentences. Specifically, we apply three measures to make the detection very effective and efficient:  Instead of using fix-sized sliding windows, we use a suffix tree to compare two documents. By using a suffix tree we are able to find all possible pairs of identical sentences. This is good for improving effectiveness of the detection process. The detection process is also very efficient;  To further mitigate the problem of relatively low accuracy caused by using the number of terms in a sentence as the only factor, we add a simple validation process after sentence sequence match, e.g., by comparing a few terms at particular positions in all the sentences involved. Our experiment shows that these measures work and the proposed method is efficient and effective. Thus we believe the method has good potential to be used in practice. The remainder of this paper is structured as follows: in Section 2, we describe some related work. Section 3 details the near-duplicate document detecting method including all the components. Section 4 presents the experimental results. Conclusions are provided in Section 5.

2

RELATED WORK

Near-duplicate document detection can be used in different situations such as in Web search engines for duplicates removal [10, 11] or in digital libraries for document versioning and plagiarism detection [9, 12]. Two pieces of earliest work are from Garcia-Molina and his colleagues [5, 12]. Another early piece of work is done by Broder [1]. Broder used shingles to represent documents. A shingle is basically an n-gram of terms and a document is composed of a series of n-grams. For two documents, if they share more shingles, then they bear more similarity. However, it is a tedious task to compare all possible pairs of shingles between two documents, especially long ones. In Broder’s work, not all shingles, but

only a subset of them are considered and the hash function is used for speeding up the comparison process. SCAM [5] takes the bag-of-words approach. It uses word as the unit of chunking to represent documents and stores chunks in the inverted index structure [14]. In [5], the Relative Frequency Model (RFM) is proposed for detecting overlap between two documents. RFM is different from the traditional Vector Space Model (VSM) [13]. Chowdhury et al. proposed a method, I-Match, by using statistics of the whole collection [2]. It ignores very infrequent terms or very common terms according to the IDF value of the collection. SpotSigs [19] is one form of n-gram. But instead of using a fixed number of n terms, SpotSigs divides a sentence into short chains of adjacent content terms by using stopwords as separators. According to [19], Performance of SpotSigs is affected significantly by the stopwords used. Lin et al. [3] proposed a supervised learning algorithm to detect near-duplicates using sentence-level features. A support vector machine (SVM) is adopted to learn a discriminate function from a training pattern set to calculate the similarity degree between any pair of documents. Zhang et al. [6] presented an efficient algorithm for detecting partial-duplicates. This approach was explored with MapReduce [7] which was a framework for largescale distributed computing and sequence matching algorithm. Wang and Chang [4, 8] proposed a method that takes the number of terms in a sentence as the surrogate of that sentence, and then using a sliding window to compare two documents by a certain number of sentences inside the window. The size of the sliding window and the number of sentences that each time the sliding window moves forward affect the performance. Therefore, they empirically investigate many different combinations. Similar to Wang and Chang’s work, we use the same feature as they do. Apart from that, we made significant contributions in several different ways as mentioned at the end of Section 1. These measures ensure our method work efficiently and effectively at the same time.

3

PROPOSED APPROACH

Depending on the format of documents, some pre-processing is usually necessary before we can use a near-duplicate document detection method. In the following, we assume that all the documents have been treated properly and are ready for use as pure textual documents. The pseudo-code of our near-duplicate document detection method SL+ST (Sentence Length + Suffix Tree [15, 16]) working with two documents is shown in Figure 1. If we need to do it with a collection of documents, then the extension is straightforward by comparing all different pairs.

Algorithm 1. The near-duplicate document detection algorithm (SL+ST) Input: a pair of documents ( di,dj ), given threshold τ; Output: A boolean variable Y, indicates if di and dj are near-duplicates; 1: for each document d in ( di,dj ) do 2: SL(d) ← divide d into a list of sentences; 3: for every sentence si in SL(d) do 4: remove stopwords & stemming; 5: store the first two terms as a feature; 6: count the number of terms; 7: end for 8: generate a list L(d) indicating the number of terms of all the sentences; 9: end for 10: T ← build_suffix tree (L(di),L(dj)); 11: featureList ← traverse_suffix_tree( ); 12: newFeatureList ← null; 13: for all feature fi in featureList do 14: if validate (fi) = true then 15: newFeatureList.add (fi); 16: else if split(fi).length ≥ 2 then 17: newFeatureList.add(split(fi)) 18: end if 19: end for 20: similarity = calculate(newFeatureList ) //see Eq. 1 in Section 3.4 21: if (similarity ≥ τ) then {Y=true; return;} 22: else {Y=false; return;} 23: end if Fig. 1. Pseudo-code of SL+ST

In Figure 1, there are 4 major steps (lines 1-9, 10-12, 13-19, 20-23). In the following let us discuss them one by one. 3.1

Generating surrogates of two documents

In Algorithm 1, the first step is to represent each document as a string, in which each character indicates the number of terms in a given sentence. This task can be divided into three sub-tasks:  First divide the whole documents into a list of sentences;  Second remove stopwords and stem every term for every sentence in the document;  Finally count the number of terms in each sentence and map each number to a character and put them together to form a string. All these three sub-tasks can be done in one scan of the document. For the first sub-task, we need to set a group of predefined delimiters. Some punctuation marks

such as period, question mark, exclamation mark, and so on, are good candidates for this. In the second sub-task, stopwords are removed and stemming is done to all remaining terms, thus every sentence is shorter than its original form. Thus it is more efficient to process them at later stages. In the third sub-task, we count how many terms are in each sentence, and then transform such information into a string in which one character is corresponding to a unique number (or regard each number as a character). After this, every document is represented as a string. The length of the string is the number of characters or the number of sentences in the whole document.

Fig. 2. The Suffix tree of string “839767i#7839678i*”

3.2

Finding common sentence blocks in two documents

The second step is: for the two strings that are corresponding to the two documents, we compare them to see how many of the sentences are the same. This can be done by using a suffix tree. A suffix tree is a trie structure built over all the suffixes of a string [15,16]. For our purpose, we concatenate two strings together to form a long single string. In order to remember the initial two strings in the merged string, we add an extra character ‘#’ at the end of the first string and ‘*’ at the end of the second string. For example, suppose that we have two strings: “839767i” and “7839678i”. Note that digits, letters, and so on comprise a string. Then these two strings are concatenated to form one long string “839767i#7839678i*”. Its suffix tree is shown in Figure 2. Every leaf node indicates the position of a given suffix string. When the suffix tree is built, we may traverse the suffix tree and obtain all the common substrings. Note that the number of terms in a sentence is a very obscure feature and not enough for identifying a sentence. Therefore, if we match all common substrings, then we may obtain a lot of false matches. Some sentences are not the same but are matched simply because they include the same number of terms. In order to reduce such false matches, we may increase the number of sentences needed for a match. Therefore, we define a series of sentences in a document as a sentence block and use it as a basic unit of match. It needs some consideration to decide how many

sentences we should have in a sentence block. Nor a very big value neither a very small value can be good options. According to some observations and experiments, 2 appears to be a balanced option for this, although some other options (3 or more) are possible. In the above example, we obtain a set of 3 substrings: “39”, “67” and “839”. Another problem is: in all the common substrings identified, some of them are substrings of others. In the above example, “39” is a substring of “839”. In the above example, obviously it is not good to take “39” and “839” into consideration at the same time. Thus we only consider all longest common substrings whose length is above the given threshold. In implementation, we represent each common substring identified as a triple . Here n1 is the end position of its first occurrence and n2 the end position of its second occurrence and n3 the length of the substring. Thus “39”, “67” and “839” are represented as , and , respectively. We can find that is a substring of quite easily by comparing these two triples. When the suffix tree is built, finding all longest common sentence blocks can be done by traversing the suffix tree once. 3.3

Validation

This step is to validate our findings (all longest common substrings, or all longest common sentence blocks) in the previous step. For all those substrings obtained in step 2, we check if those corresponding sentences are the same. For every pair of sentences involved, we compare their first two terms. If one or two terms do not match in at least one sentence, then we discard the whole block or take smaller block(s) that can pass the validation step, depends on which option is appropriate. For example, if we have two sentence blocks of size 5 and the 3rd is different, then we split it into two blocks, each with 2 sentences. As another example, if we have two sentence blocks of size 4, and the 2nd and 3rd are different, then we discard the whole block. After this step, we obtain all the longest common sentence blocks in which each pair of sentences include the same number of terms and the same two beginning terms. It is regarded that such sentence blocks are identical. This step is helpful for us to reduce the number of errors, in which two blocks of sentences are regarded to be identical but they are not. 3.4

Similarity calculation

Now we can calculate the degree of similarity between a pair of documents [17]. For documents d1 and d2, their similarity is measured by similarity (d1,d2)=

|d1 ∩ d2 | |d1 ∪ d2 |

(1)

Here |d| denotes the number of sentences in d. We need to set a threshold τ. If similarity(d1,d2) is no less than τ, we claim that d1 and d2 are near duplicates; otherwise, we claim they are not [18].

3.5

Complexity analysis

Suppose that document d1 has |d1| sentences and t1 terms and d2 has |d2| sentences and t2 terms. Step 1 can be done by one scan of d1 and d2. Thus the time needed is O(t1+t2). In Step 2, we first need to construct the suffix tree for d 1 and d2. The suffix tree has a maximum of |d1|+|d2|+2 leaf nodes, and the time needed for this is O(|d1|+|d2|). We need to traverse the suffix tree once for common substring match, the time for this is also O(|d1|+|d2|). Both Step 3 and Step 4 can be done in O(|d 1|+|d2|) time. Therefore, the time complexity of all the steps is O(|d1|+|d2|+t1+t2) and it shows that the method is very efficient.

4

EXPERIMENTS

In this section, we evaluate our method empirically. All the experiments are carried out on a desktop computer, which has an Intel Core i7 quad-core CPU (3.4 GHz) and 32 GB of RAM. The data sets we use are two English document collections, AP90-S and Twitter. AP90-S is a subset of AP90 and Twitter is a subset of ClubWeb12-B. AP90 and ClubWeb12-B are two data sets used in TREC before. The AP90 document set includes copyrighted stories from the AP Newswire in 1990. In 1990’s, NIST disseminated five discs and AP90 is on the third disk. Those five discs were used in the ad hoc tasks from TREC 1 to TREC 8. AP90-S is generated by the following method: first we manually select 6 documents {AP900629-0221, AP901120-0196, AP900326-0205, AP900524-0185, AP900706-0266, AP900806-0154} from AP90. Then we treat these 6 documents as queries to retrieve documents from AP90 using the Terrier retrieval system. Top 150 documents from each resultant list are put together with duplicates being removed (those having the same ID number). Thus we obtain AP90-S, which contains 891 documents and 56,432 sentences. ClueWeb12-B was disseminated in 2013 and was used for the web task. The Twitter documents are chosen from the 6 sub-folds of the ClueWeb12-B dataset. The information on the two data sets is summarized in Table 1. Table 1. Summary of data sets used in our experiments

4.1

Disk 3-AP

Number of documents 891

5.2MB

ClubWeb12-B

2,716,306

116GB

Collection

Source

AP90-S Twitter

Size

Effectiveness

Apart from our method SL+ST, we also test two other representative methods, 3shingles [1] and SpotSigs [19], for comparison. The harmonic mean, or F1[15], is used to evaluate all the methods involved. F1 is defined as

F1 = 2

precision∗recall

(2)

precision+recall

Recall in Equation 1, τ is the threshold that we use to determine if two documents are near-duplicates or not. After running those methods to obtain documents that are regarded as near-duplicates, human judgment is involved to decide if those documents are near-duplicates or not. It is also required to find out those missing near-duplicates in the collection that has not been identified by the detection program. Figure 3 shows the experimental results with varying τ values.

Fig. 3. Performance comparison of the three methods on AP90-S with different thresholds

From Figure 3, we can see that the performance of all three methods vary with different τ values. Both SL+ST and SpotSigs achieve their best when τ=0.70, and 3Shingles achieves its best when τ=0.60. All three methods do better when τ is close to neither 0 nor 1. This phenomenon is understandable, because F1 is a measure that combines precision and recall. When τ is very close to 1, then all the methods would be good on precision but bad on recall; when τ is very close to 0, then all the methods would be good on recall but bad on precision. Referring to Equation 2, F 1 obtains its highest value when precision and recall are equally good at the same time. In overall SL+ST performs better than the two others. SpotSigs is close to SL+ST when τ is no less than 0.3, but worse off when τ is smaller than 0.3. Let us have a closer look at the points at which each method obtain its best F1 value. Table 2 lists both precision and recall values of all the three methods at that particular point. Table 2. The best possible perforrnance of each method at certain point Method

τ

Precision

Recall

F1

SL+ST

0.70

0.97

0.94

0.96

3-Shingles

0.60

0.82

0.88

0.85

SpotSigs

0.70

0.94

0.93

0.93

4.2

Efficiency

In this experiment, we compare the time needed for each of the three methods to do the task. Based on [19], pruning is conducted before applying each of the methods. The Twitter document collection is used in the experiment. Figure 4 shows the time needed for each of the three methods: SL+ST, 3-Shingles and SpotSigs with different collection size. Those collections with different size are generated by choosing documents at certain intervals from the Twitter document collection.

Fig. 4. Running time of SL+ST, SpotSigs and 3-Shingles with different collection sizes

In Figure 4, the horizontal axis shows the size of the document collection used, while the vertical axis indicates the time (in minutes) needed for detecting all nearduplicates in the given document collection. When a collection of 100,000 documents are considered, it takes SL+ST 1.2 minutes to do the work, while the time needed for 3-Shingles is 3.3 minutes and for SpotSigs is 1.6 minutes. When the document collection adds up to 1,000,000 documents, the time needed for SL+ST, 3-Shingles and SpotSigs are 45 minutes, 168minutes and 69 minutes, respectively. L+ST is always faster than the other two methods. The difference between them is even slightly larger when more documents are involved.

5

Conclusions

In this paper we have presented our method SL+ST of detecting near-duplicates. Experiments with two groups of documents show that our method is effective and efficient. In our experiments, it performs better than the two representative methods SpotSigs and 3-Shingles. Therefore, the method proposed has good potential to be used in practice. One advantage of our method is that we are able to display the location of identical contents between the two documents compared, although this part has not been presented in this paper due to space limitation. This may be useful in certain applications.

6

References

1. Andrei Z. B., Steven C. G., Mark S. Manasse, G. Z.: Syntactic Clustering of the Web. Computer Networks 29(8-13): 1157-1166 (1997) 2. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection Statistics for Fast Duplicate Document Detection. ACM Transaction on Information Systems, 20(2):171191(2002) 3. Lin, Y.S., Liao, T.Y., Lee, S.J.: Detecting Near-Duplicate Documents Using SentenceLevel Features and Supervised Learning. Expert System with Applications, 40:14671476(2013) 4. Wang, J.H., Chang, H.C.: Exploiting Sentence-Level Features for Near-Duplicate Document Detection. Proceedings of the 5th Asia Information Retrieval Symposium, pp.205217 (2009) 5. Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. Proceedings of the International Conference on Theory and Practice of Digital Libraries (1995) 6. Zhang, Q., Zhang, Y., Yu, H.M., Huang, X.J.: Efficient Partial-Duplicate Detection Based on Sequence Matching. Proceedings of ACM SIGIR, pp.675-682 (2010) 7. Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters. Proceedings of 6th Symposium on Operating System Design and Implementation (2004) 8. Chang, H.C., Wang, J.H., Chiu, C.Y.: Finding Event-Relevant Content from the Web Using a Near-duplicate Detection Approach. Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence, pp.291-294 (2007) 9. Hoad, T., Zobel, J.: Methods for Identifying Versioned and Plagiarized Documents. Journal of the American Society for Information Science and Technology, 203-215 (2003) 10. Manku, G.S., Jain, A., Sarma, A.D.: Detecting Near-duplicates for Web Crawling. Proceedings of the 16th International Conference on World Wide Web,141-150 (2007) 11. Schleimer, S., Wilkerson, D., Aiken, A.: Winnowing: Local Algorithms for Document fingerprinting. Proceedings of the 2003 Proceedings of ACM SIGMOD, 76-85 (2003) 12. Brin, S., Davis, J.,Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. Proceedings of ACM SIGMOD, pp.388-409 (1995) 13. Salton, G.: The State of Retrieval System Evaluation. Information Processing Management, 28(4):441-448(1992) 14. Frakes, W.B., Baeza-Yates, R.A.: Information Retrieval: Date Structures & Algorithms. (1992) 15. Baeza-Yates, R. & Ribeiro-Neto, B.: Modern Information Retrieval: the concepts and technology behind search. ACM Press (2011) 16. Ukkonen, E.: On-Line Construction of Suffix Tree. In: Algorithmica. 14(3): 249-260 (1995) 17. Huang, L., Wang, L., Li, X.: Achieving Both High Precision and High Recall in NearDuplicate Detection. Proceedings of ACM CIKM, pp.63-72 (2008) 18. Yerra, R., Ng, Y.K.: A Sentence-Based Copy Detection Approach for Web Documents. Proceedings of 2005 International Conference on Fuzzy Systems and Knowledge Discovery, Part 1 (2005) 19. Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. Proceedings of ACM SIGIR, pp.563-570 (2008)