A Copy Detection Method Based on SCAM and

12 downloads 0 Views 1MB Size Report
ABSTRACT. With the widespread use of the Internet and the availability of a huge amount of digital documents online, plagiarism is increasing. This is a serious ...
A Copy Detection Method Based on SCAM and PPCHECKER NGUYEN Lương-Hien

NGUYEN Thi-Oanh

Department of Information System SOICT, HUST 1 Đại Cồ Việt, Hà nội, Việt nam

Department of Information System SOICT, HUST 1 Đại Cồ Việt, Hà nội, Việtnam

[email protected]

[email protected]

ABSTRACT With the widespread use of the Internet and the availability of a huge amount of digital documents online, plagiarism is increasing. This is a serious problem not only in publishing of scientific documents but also in education. Copying is a frequent way used in plagiarism. Documents can be copied completely or some parts. Many document copy detection (DCD) methods have been proposed, however, few of them allow us to detect partial copy with high efficiency and in reasonable time. In this paper, we propose a schema for detecting copies including partial copies. This proposed method is based on SCAM and PPCHECKER methods, that benefits advantages of both methods. Experimental results with high precision demonstrate the effectiveness of the proposed method

CCS Concepts • Information systems~Retrieval models and ranking • Information systems~Near-duplicate and plagiarism detection

Keywords Copy detection; partial copy detection; plagiarism; SCAM; PPCHECKER

1. INTRODUCTION Nowadays, most of documents usually exist as digital format and they are frequently shared through email, websites ... therefore their content can be easily accessed. Although documents are sometime protected by using copy prevention schemes, which makes sharing information difficult, it cannot prevent entire or partial copy of a document. Users can record the content by using special software or simply by retyping. Therefore, content-based copy detection is always an important issue. It helps us to avoid and prevent any potential or accidental plagiarism. Due to the importance of copy detection, many software tools have been developed for plagiarism detection such as EVE2 [13], Plagiarism-Finder[14], WCopyFind [1], Turnitin[15], SafeAssign [16], .... They can be web-based systems or standalone systems [3]. However, most of them are commercial. WCopyFind is one of good software for free. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SoICT 2015, December 03-04, 2015, Hue City, Viet Nam © 2015 ACM. ISBN 978-1-4503-3843-1/15/12…$15.00

DOI: http://dx.doi.org/10.1145/2833258.2833268

Since the 1990s, many copy detection methods have been introduced such as: COPS [2], WCOPYFIND [1], MDR [7], SCAM [9, 10], PPCHECKER [8], CHECK [11], SNITCH [12], etc. COPS [2] performs comparison between the sentences of the query document and original ones. Each sentence is hashed and the hash value is used for detecting a copied sentence. If two documents share more than some threshold number of sentences, then a violation is flagged. This method can detect exactly the whole copy sentences but it cannot detect the partial copy sentences. Furthermore, this method is case sensible. The authors of [1] (WCOPYFIND) use phrases having at least six words as checking unit. Plagiarism rate is calculated as a ratio between the number of words from matched phrases and the total words in documents. This method works very well for exact copy but not in the case of changing or deleting some words. Suffix tree is used in MDR [3] for matching sentences. This method can detect either partially or completely duplicated sentences. However, building suffix tree for a document is very expensive. SCAM [5] detects copies based on comparing the word frequency occurrences of the query document and the other ones. Speed is one of its main advantages. It can find the partial copied sentences and works correctly if copied parts are significant. However if there are only some parts which are copied from other documents, SCAM method shows results with less precision. It means SCAM reports more false positives than COPS [5]. PPCHECKER [4] uses the local similarity as unit for checking the possibility of copy. This measure is computed at sentence level by comparing a sentence of query document with all sentences of another one. It bases not only on the set of common words but also the synonyms set between two sentences. PPCHECKER can detect both the whole copied sentences and the partial copied sentences affected by some modifications with high precision. However, it is time-consuming and its performance is affected by stop words. Recently, in PAN 2013, PAN 2014 competitions, different methods for copy detection have been developed. Their main principle is finding seeds (small similar fragments) between two documents, then forming larger similar text fragments and filtering the final results [6]. For seeding, authors can use technique of "bag of word", "context n-grams", context skip ngram, name-entity n-gram [4]. The winner at PAN 2014 obtained results with high precision. It works both on type of exact copy or copy with obfuscation. Sentence is usually used as seed unit; it seems that these methods take time. With the aim of preventing fraud soon, we wanted to find a copy detection method that works effectively in terms of precision and processing time. The method should report correct results either in case of global copy or local copy. So, we propose in this paper a method based on SCAM and PPCHECKER. Objective is to fill

out rapidly paragraphs that have high possibility of plagiarism and then suspicious paragraphs will be checked carefully. The paper is organized as follows. In section 2, we present SCAM and PPCHECKER methods. Our proposed approach is described in section 3. Section 4, we show some experimental results and our discussions. In section 5, we draw conclusion and the future works.

2. RELATED WORKS 2.1 SCAM

Let R be the query document and S be the original document, the authors of SCAM define firstly the closeness set c(R, S) to contain those word 𝑤𝑖 that have similar number of occurrences in the two documents. A word 𝑤𝑖 is in c(R, S) if it satisfies the following condition: 𝐹𝑖 𝑅 𝐹𝑖 𝑆 𝜀−( + ) 𝐹𝑖 𝑆 𝐹𝑖 (𝑅) where 𝜀 = (2+, +∞) is constant parameter. The value of 𝜀 was set to 2.5 as the best value in practice. 𝐹𝑖 𝑅 , 𝐹𝑖 𝑆 are respectively the number of occurrences of 𝑤𝑖 in R and S. Next, the subset measure of document R to be a subset of document S, denoted 𝑠𝑢𝑏𝑠𝑒𝑡(𝑅, 𝑆), is defined as follows: 𝑠𝑢𝑏𝑠𝑒𝑡 𝑅, 𝑆 =

𝑆𝑖𝑧𝑒𝑂𝑣𝑒𝑟𝑙𝑎𝑝 𝑆𝑜 , 𝑆𝑞 =

𝐷𝑖𝑓𝑓 𝑆𝑜 , 𝑆𝑞

+ |𝐷𝑖𝑓𝑓(𝑆𝑞 , 𝑆𝑜 )|

Similarity value between 𝑆𝑜 𝑎𝑛𝑑 𝑆𝑞 :

SCAM (Stanford Copy Analysis Mechanism) [9] determines the degree of plagiarism for a document based on a set of common words.

2 𝑤 𝑖 𝜖 𝑐 𝑅,𝑆 𝛼𝑖 ∗ 𝐹𝑖 𝑅 ∗ 𝑁 2 2 𝑖=1 𝛼𝑖 ∗ 𝐹𝑖 (𝑅)

𝑆𝑦𝑛𝑊𝑜𝑟𝑑 𝑆𝑜 , 𝑆𝑞 = 𝑤𝑖 𝑤𝑖 𝜖 𝐷𝑖𝑓𝑓 𝑆𝑞 , 𝑆𝑜 ∩ 𝑆𝑦𝑛 𝑤𝑖 𝜖 𝑆𝑜 } 𝑊𝑜𝑟𝑑𝑂𝑣𝑒𝑟𝑙𝑎𝑝 𝑆𝑜 , 𝑆𝑞 |𝑆𝑜 | = 𝐶𝑜𝑚𝑚 𝑆𝑜 , 𝑆𝑞 + 𝛼 × |𝑆𝑦𝑛𝑊𝑜𝑟𝑑(𝑆𝑜 , 𝑆𝑞 )| where 𝛼 is weight value, usually set to 1.

𝐹𝑖 𝑆

𝛼𝑖 is weighting value of 𝑤𝑖 , usually set to 1. And then, 𝑠𝑖𝑚 𝑅, 𝑆 the similarity measure between two documents R and S is defined as follows: 𝑠𝑖𝑚 𝑅, 𝑆 = 𝑚𝑎𝑥{𝑠𝑢𝑏𝑠𝑒𝑡 𝑅, 𝑆 , 𝑠𝑢𝑏𝑠𝑒𝑡(𝑆, 𝑅)} If 𝑠𝑖𝑚 𝑅, 𝑆 > 1then it is set to be 1. This method has advantage in processing time because determining the closeness set is very fast. However, with a fixed value of 𝜀 the chance of matching unrelated documents (the false positives) is increased in function of document lengths because relative position between common words has not been taken in considering. We can control the false positives by modifying the value of 𝜀. A low value of 𝜀 will decrease false positives but also decrease the ability to detect the minor overlaps.

2.2 PPChecker Basically, PPChecker (Plagiarism Pattern Checker) algorithm [8] compares a sentence in a query document R with a sentence in an original document S. If R has n sentences and S has m sentences, this algorithm will compare 𝑛 × 𝑚 sentence pairs. The plagiarism degree between R and S is computed from the similarity of each pair. Let 𝑆𝑞 be a sentence in the query document R, 𝑆𝑜 be a sentence in the original document S, 𝑠𝑖𝑚(𝑆𝑜 , 𝑆𝑞 ) denotes their similarity value, 𝐶𝑜𝑚𝑚(𝑆𝑜 , 𝑆𝑞 ) set of common words between 𝑆𝑜 𝑎𝑛𝑑 𝑆𝑞 , 𝐷𝑖𝑓𝑓(𝑆𝑜 , 𝑆𝑞 ) set of words existing in 𝑆𝑜 but not in 𝑆𝑞 . So = w1 , w2 , … wk , … . wn , Sq = w1 , w2 , … wl , … . . wm 𝐶𝑜𝑚𝑚 𝑆𝑜 , 𝑆𝑞 = 𝑆𝑜 ᴖ 𝑆𝑞 𝐷𝑖𝑓𝑓 𝑆𝑜 , 𝑆𝑞 = 𝑆𝑜 − 𝑆𝑞 Syn(w) be the synonym words of w.

𝑠𝑖𝑚 𝑆𝑜 , 𝑆𝑞 =

𝑆𝑜 𝑊𝑜𝑟𝑑𝑂𝑣𝑒𝑟𝑙𝑎𝑝 𝑆𝑜 ,𝑆𝑞 −1

𝑒 + 𝑆𝑖𝑧𝑒𝑂𝑣𝑒𝑟𝑙𝑎𝑝(𝑆𝑜 , 𝑆𝑞 ) Therefore, similarity value between query document R and original document S can be calculated as follows: 𝑥

𝑠𝑖𝑚 𝑆, 𝑅 =

𝑠𝑖𝑚(𝑆𝑠 , 𝑆𝑅𝑖 ) 𝑖=1

where s𝑖𝑚(𝑆𝑠 , 𝑆𝑅𝑖 ) is the largest similarity value between the sentence 𝑅𝑖 in R (𝑆𝑅𝑖 ) and asentence in the original document S (𝑆𝑠 ); x denotes the number of pair (𝑆𝑠 , 𝑆𝑅𝑖 ) such 𝑆 that 𝐶𝑜𝑚𝑚 𝑆𝑆 , 𝑆𝑅𝑖 > 𝑆 2. In case of exact copy or exchanging synonyms, PPCHECKER gives better results than other tested systems [8]. However, one of its disadvantages is processing time because it should check all n x m pairs of sentences.

3. PROPOSED METHOD We found that PPCHECKER works effectively to detect either partially or completely copied sentences. This is very useful since we need not only to check if a document is identical with another but also to see if some parts of a document are copied from other ones. However, the question is how to improve the processing time when comparing documents because all pairs of sentences are taken in considering? Our solution is to fill out quickly paragraphs that have a high possibility of copy. And then, copy degrees for suspicious paragraphs will be recalculated carefully. They are finally aggregated to define the copy degree at document-level. It is a natural solution. In fact, we found that people often copy some paragraphs rather than some sentences. Moreover, plagiarism at paragraph level is more serious than at sentence level because in general one paragraph describes one or more ideas complete. So, we choose SCAM algorithm to find out similar paragraphs and then PPChecker algorithm for determining degree of plagiarism at paragraph level and document level. Let R be the query document, S be the original document, our strategy is described as follows. Step 1: Splitting documents into paragraphs R = {pR1 , pR2 , … . pRk , … . pRn } S = {pS1 , pS2 , … . pSl , … . pSm } Step 2: Applying SCAM on n*m pairs of paragraphs. The similarity degree of each pair sim pRi , pSj will be calculated as described in section 2.1. For each paragraph pRi in R, we find the most similar one in S, matched(pRi ). We keep only pairs whose similarity degree is greater than a threshold. After this step, only k (k ≤ n) pairs are retained for checking plagiarism. G = PRS 1 , PRS 2 , … , PRSi , . . PRSk PRSi = {pRi , matched(pRi )}

matched(pRi ) = pSl

sim(pRi , pSl ) = maxj (sim(pRi , pSj )) sim(pRi , pSl ) > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑

Step 3: Computing similarity degree at paragraph level PPCHECKER algorithm is applied for each paragraph pair from G. Here, a paragraph plays a role of a document described in section 2.2. The similarity value 𝑠𝑖𝑚𝑖 of the pair PRSi based PPCHECKER algorithm is computed as in section 2.2 with a small change. Here, two sentences are matched (copied) if 𝑆 + 𝑆𝑅𝑖 𝐶𝑜𝑚𝑚 𝑆𝑆 , 𝑆𝑅𝑖 > 𝑠 . Let ni is the number of copied 4 sentences for PRSi . Step 4: Computing the similarity degree and copy rate date document level. We define the similarity between two documents R and S and the copy rate of R in comparing to S as follows: 𝑘

𝑠𝑖𝑚 𝑅, 𝑆 =

𝑠𝑖𝑚𝑖 𝑖=1

rate R, S =

k i=1 ni

|R|

The sim value shows the quantity of copied parts while rate value presents the ratio between number of copied sentences and number of all sentences in query document. In other words, rate value indicates how many percentages of sentences in the query document R is copied from the original one S. In the fact, the nature languages have stop words that are not important for copy detection. So, we remove stop word during the document processing. Therefore, the effective copy detection enhanced significant. In above, we have described how to compute the similarity and copy rate between two documents. In case of checking multidocuments (one query document vs. multi original documents or multi-queries vs. multi-original documents), the sim value is used for ranking. In this case, we remark the significant reduction of processing time. Specific results are presented in the next section.

4. EXPERIMENTAL RESULTS 4.1 Data Sets For our experiments we use both synthetic document set and real document set based on Vietnamese language.

4.1.1 Synthetic Documents We generated the synthetic documents as follows: we use one student report as an original document. This document is a report of assignment from “Multimedia Database” course. We denote it as document A. Five documents having the same length as original document were generated by copying 100%, 70%, 50%, 30%, 10% content (counted by words) from the original document. The rest of each document was randomly selected from another document, called B. These five documents were considered as set of query documents to compare with the original document A. We consider two data sets corresponding to two cases:  B is in other topic with the original document. We denote D1 as set of these five documents. In this case, document A includes 16105 words. The proportions of copied sentences are respectively 100%, 76%, 62%, 39%, 11% for each document in D1. 

B is in the same topic as the original document. Set of these generated documents in this case is denoted as D2. In this case, the original document has 3000 words. The proportions of copied sentences are respectively 100%, 62%, 50%, 28%, 14% for each document in D2.

4.1.2 Real Documents We use 29 student reports which come from “Multimedia Database” course in our university. There are 7 different subjects. Each subject can be realized by different groups, each group has a report. The report name shows topic number (two first digits) and group number (two other digits) (see Figure 2). Between these documents, there are some reports which have high degree of similarity (reports having same topic) but they are not mutually copied. However, it exists also some reports that are partially copied from others. Each report is compared to all others for finding copied documents. So, we performed 29x29 = 841 tests.

Figure 2. Name of real documents.

4.1.3 Preprocessing Data

Figure 1. Schema of the proposed approach.

At the preprocessing, we perform as follow: - We split document to paragraphs and then to sentences. - Then, we use vnTokernizer [5] for segmenting sentence into words - Next, we remove all tokens: special character (not start from a letter or digit) and stop words. - All characters are converted to lowercase. In our experiments, we set value for three parameters as follows: - set ε to 2.5, this value is the best value in practice. - set α to 1, the weight value is equal for every word. - set threshold to 0.5.

4.2 Experiments on Synthetic Documents

Table 3. Group 1 of experiments on real documents

We conduct experiments on two sets D1 and D2 based on three algorithms: SCAM algorithm, PPCHECKER algorithm, and our proposed approach. Each document in D1 or D2 is compared to the original document. Table 1 and Table 2 indicate the obtained copy rates. Since SCAM based on words sharing two documents for computing the copy degree between them, while PPChecker and our approach used proportion of copied sentences, we denote in Table 1 and 2 two types of information about query documents: % copied words and % copied sentences.

Query document 02-11 02-23 05-05 05-10

% copied words 100%

% copied sentences

SCAM

PPChecker

Proposed approach

100%

100%

100%

100%

70%

76%

90%

76%

76%

50%

62%

55%

62%

62%

30%

39%

14%

39%

39%

10%

11%

5%

12%

11%

Query document 10-32 10-32 09-17 09-17 10-08

% copied words 100%

% copied sentences

SCAM

PPChecker

Proposed approach

100%

100%

100%

100%

70%

62%

89%

63%

63%

50%

50%

89%

51%

49%

30%

28%

37%

30%

28%

10%

14%

25%

17%

16%

In the case of copy 100%, all methods work perfectly. In other cases, our approach and PPCHECKER give better results than SCAM. On data set D1, copy part and non-copy part in the query document are really different therefore the good results should be very close to the truth. It is easy to find that our method and PPChecker give the best results. However, SCAM does not provide correct results, especially for minor copy (see Table 1). On data set D2, copy and non-copy parts are in the same special topic, therefore non-copy part shares a lot of common or synonym words with original document. This is why SCAM gives the high degree of similarity between query documents and the original document although the copy proportion is not so much (see Table 2).That causes false positives. Copy rates obtained by PPChecker and ours are very close to the truth. Moreover, our method detects more correctly in the case of minor copy (10%, 30%) than SCAM and PPChecker.

4.3 Experiments on Real Documents We conduct experiments with 841 pair of documents based on our approach. The results are divided into 3 groups (see Table 3, 4 and 5):

100% 100% 100% 100%

Original document 06-07 07-18 08-14 08-33 08-33

copy rate 3% 3% 3% 3% 1%

Table 5. Group 3 of experiments on real documents Query document 10-08

Table 2. Copy rate on Synthetic Documents: set D2 Query document

copy rate

Table 4. Group 2 of experiments on real documents

Table 1. Copy rates on Synthetic Documents: set D1 Query document

Original document 02-11 02-23 05-05 05-10

Original document 10-08 10-29 10-26 10-35 10-32 09-17 08-27 06-24 08-30 08-33

Copy rate 100% 59% 26% 21% 17% 2% 2% 1% 1% 1%



Group 1: document is checked with itself: very high copy degree (100%, see Table 3). It is evident.  Group 2: two compared documents are not in the same topic. Copy degrees are very low. This is also logic. Some examples are showed in Table 4.  Group 3: documents are in the same topic (first five rows in Table 5). Table 5 shows top ten documents having the highest possibility to be copied by the query document. We remark that all documents of the same topic with query document are on the top. Some of them have very high copy rate (10-29, 10-26 for ex.) that means plagiarism degree is important. For others documents in the same topic, the copy degree is not so high but more important than documents in other topic. It is naturally true. In order to evaluate performance of our approach in term of time, we measure the execution time of PPChecker and ours for checking copy degree of all 29 x 29 pairs of documents (not including the data preprocessing time). Experiments are effected on personal computer under Windows 8 64 bits, RAM 4G, core i3, M370 2.4 GHz. The result is showed in Table 6. We find that although our approach is slower than SCAM but much faster than PPChecker. Table 6. PPChecker vs. our approach in term of time Algorithm

Execution Time(s)

Our approach

12,569

PPCHECKER

205,417

SCAM

0.201

5. CONCLUSIONS Plagiarism is a serious problem, especially in education. That is a frequent mistake of students. Plagiarism can range from minor copy-past to a more serious problem as major copy even duplication of other document. In order to prevent plagiarism as soon as possible, we propose an effective method for copy detection. It can detect not only major copy but also minor copy in long documents. Because of huge documents, we pay attention to processing time as well. The key point is that we process documents at paragraph level before sentence level. Our experiments effected in comparing with SCAM and PPChecker demonstrate the efficiency of our approach in terms of accuracy and of time. The proposed method shows less false positives than SCAM and is faster than PPChecker. We can detect correctly minor copy in a document. Actually, we implemented a simple system for testing. In term of processing time, our system is not optimal because we used text files for storage of pre-processed documents. Besides, our system is just working on offline documents. Improving this system is necessary to have a useful and complete system. The proposed method processes documents both at paragraph level and sentence level. The copy degree of a document is accumulated from copy degrees at paragraph level. However, spatial relationship between paragraphs is not taken in considering. Therefore, we cannot distinguish between the case of copying consecutive paragraphs (more serious) and nonconsecutive ones. We think about a hierarchical method which treats document at different levels. That can improve further the proposed method in terms of time and accuracy.

6. ACKNOWLEDGMENTS Our thanks to LE Hong-Phuong for his software vnTokernizer [5] that we used for segmenting Vietnamese texts into units.

Multilinguality, Multimodality, and Visualization. 4th International Conference of the CLEF Initiative (CLEF 2013), September 2013. [5] Le-Hong, P., T M H. Nguyen, A. Roussanaly, and T V. Ho. 2008. A hybrid approach to word segmentation of Vietnamese texts. Proceedings of the 2nd International Conference on Language and Automata Theory and Applications, p.240-249. [6] Miguel Sanchez-Perez, Grigori Sidorov, Alexander Gelbukh. 2014. The Winning Approach to Text Alignment for Text Reuse Detection at PAN 2014. In: L. Cappellato, N. Ferro, M. Halvey, W. Kraaij (eds.). Notebook for PAN at CLEF 2014. CLEF2014 Working Notes. Sheffield, UK, September 15-18, 2014. CEUR Workshop Proceedings, ISSN 1613-0073, Vol. 1180, CEUR-WS.org, 2014, pp. 1004–1011. [7] Monostori, K,Zaslavsky, A., Schmidt, H. 2000. Document Overlap Detection System for Distributed Digital Libraries, In proceedings of the fifth ACM conference on Digital libraries, pp. 226 – 227. [8] NamOh, K., Gelbukh, A., Sang Yong, H. 2006. PPChecker: Plagiarism Pattern Checker in Document Copy Detection. In Proceedings of the 9th international conference on Text, Speech and Dialogue, pages 661-667. [9] Shivakumar, N., Garcia-Molina, H. 1995.SCAM: A Copy Detection Mechanism for Digital Document, International Conference in Theory and Practice of Digital Libraries (DL 1995). [10] Shivakumar, N., Garcia-Molina H.. 1996. Building a Scalable and Accurate Copy Detection Mechanism. 1st ACM International Conference on Digital Libraries (DL'96),pp. 160-168

7. REFERENCES

[11] Si, A., Leong, H., and Lau, R. 1997. CHECK: A Document Plagiarism Detection System. In Proceedings of ACM Symposium for Applied Computing, pp. 70-77(Feb 1997).

[1] Bloomfield, L. 2014. The Plagiarism Resource Site. http://plagiarism.bloomfieldmedia.com (last updated 2014)

[12] Sebastian Niezgoda and Thomas P. Way. 2006. SNITCH: a software tool for detecting cut and paste plagiarism. In Proceedings of the 37th SIGCSE technical symposium on Computer science education (SIGCSE '06). ACM, New York, NY, USA, 51-55. DOI=http://dx.doi.org/10.1145/1121341.1121359

[2] Brin, S., Davis, J., and Garcia-Molina, H. 1995. Copy Detection Mechanisms for Digital Documents. In Proceedings of ACM SIGMOD Annual Conference, San Jose, CA (May 1995). [3] Bin-Habtoor, A. S, and Zaher, M. A. 2012. A Survey on Plagiarism Detection Systems, International Journal of Computer Theory and Engineering Vol. 4, No. 2, April 2012.

[13] Eve 2: http://www.canexus.com/

[4] Forner, P., Müller, H., Paredes, R., Rosso, P., & Stein, B. (eds). 2013. Information Access Evaluation Meets

[15] TurnItin : http://turnitin.com/

[14] Plagarism-Finder: http://www.m4-software.com/enindex.htm (last update 2004) [16] SafeAssign http://www.safeassign.com/

Suggest Documents