Detecting Near-duplicates in Russian Documents ... - Science Direct

19 downloads 0 Views 170KB Size Report
Procedia Computer Science 103 ( 2017 ) 421 – 425. Available online at ... Plagiarism is one of the major problems in the age of communication. In many languages such as ... Using exact phrase without putting the text in quotation mark. • Adding untrue .... Journal of Universal Computer Sciences, 12(8):1050. 2. Manber, U.
Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 103 (2017) 421 – 425

XIIth International Symposium «Intelligent Systems», INTELS’16, 5-7 October 2016, Moscow, Russia

Detecting near-duplicates in russian documents through using fingerprint algorithm Simhash N. Rezaeian, G.M. Novikova∗ RUDN University, 6 Miklukho-Maklaya str., Moscowf 117198, Russia

Abstract Plagiarism is one of the major problems in the age of communication. In many languages such as English, this issue is seriously of high importance and many powerful devices have been invented to prevent this problem from occurring. This article aims at discovering plagiarism in Russian texts based on fingerprint algorithm. The fingerprint algorithms have high speeds in finding out the plagiarism due to the compact features it creates and purely because of the comparison of these properties between original documents and dubious documents. Increasing the power and accuracy of plagiarism discovery, there must be elimination of general words and word rooting before pre-processing applications such as words separation, numbers replacement, and homogenization. In this article, four Simhash algorithms have been used. The implementation of these algorithms confirmed on 800 articles with the scientific topics was found to have satisfactory results. c 2017  2017The The Authors. Published by Elsevier © Authors. Published by Elsevier B.V. ThisB.V. is an open access article under the CC BY-NC-ND license Peer-review under responsibility of the scientific committee of the XIIth International Symposium “Intelligent Sys(http://creativecommons.org/licenses/by-nc-nd/4.0/). tems”. Peer-review under responsibility of the scientific committee of the XIIth International Symposium “Intelligent Systems” Keywords: plagiarism; fingerprint algorithm; Simhash.

1. Introduction With an increase in the accessibility easiness to the data in the network, the plagiarism has been one serious problem. This problem has caused the authors and publishers to have less confidence to the internet. Plagiarism in the texts is divided into two types, the source code plagiarism and free text plagiarism. According to the limitations and the keywords of programming languages, the investigation of this type of plagiarism is easier than a text. The text plagiarism has different forms that Maurer categorizes different forms of plagiarism as following 1 : • Copy and paste or the plagiarism of word to word, or in a way that the content of copied text is from one or several sources. The copied content can be changed a little. ∗

Corresponding author. E-mail address: [email protected]

1877-0509 © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the scientific committee of the XIIth International Symposium “Intelligent Systems” doi:10.1016/j.procs.2017.01.006

422

N. Rezaeian and G.M. Novikova / Procedia Computer Science 103 (2017) 421 – 425

• Change in grammar through using synonyms, movement of lines of the original test or the expression of sentences differently • Using exact phrase without putting the text in quotation mark • Adding untrue sources or not stating them • Translating a text without stating the original reference According to aforementioned categorization, the plagiarism discovery devices are divided into three fundamental types 1,2 : • Recognition of the writing style of a writer and the discovery of every incompatible change of the style • Comparison of several documents and the discovery of their common and similar parts, mostly used method. • Reception of a document as the entrance and to find the documents copied on the web pages. The following figure shows this categorization. In this article, the second group is focused. The posed method compares one document with a set of documents based on syntax. The semantic methods are sensitive to the sentence changes, the use of synonyms instead of the words themselves, generally rephrasing the sentences. Therefore, these methods are smarter in the recognition of plagiarism, but the discovery of plagiarism through utilizing the method of this group requires a lexical database of the language word such as Wordnet in English language. In the following, we will explain the important required stages of fingerprint algorithms.

2. Preprocessing Preprocessing includes the performance of some actions on the text, improving the outcomes of similarity detection algorithms. These actions increase the accuracy and decrease the time of investigation. 2.1. Tokenization This part must have the ability of sentence recognition in the input text regarding the sentence divider characters in Russian language 3 . To create this device, first all symbols, characters, especially syntactic rules which break the sentences must be identified. Since the sentence is basic in many language processes, the accurate outcome of this section is of high importance. 2.2. Replacement numbers To replace and eliminate the numbers, a method 4 so called Token-making, is employed. This method is an appropriate method to recognize the similarities especially, in the computer programs. The Token-making algorithm replaces the elements of program with the unit tokens. For example, every ID is replaced by the token < ID >. Or every numerical value with < value >. Now if a program has a statement in the form of a = b + 4, it will be replaced by the line or string of < ID >=< ID > + < value >. Therefore, if we change a variable name, there is no change in the translation. 2.3. Remove Stop words In this stage, the general words will be omitted 3 . This makes the investigation operation faster. The general words are the ones whose importance in the sentences are trivial. Some think that the general words ("O","C","CO","B",...) are as the same as high frequency and repeated words while the repeated words do not include all of general words.

N. Rezaeian and G.M. Novikova / Procedia Computer Science 103 (2017) 421 – 425

2.4. Stemming Stemming 2,3 is to find out the roots of words through eliminating their prefixes and suffixes in a way that words with similar root become an identical form. The most common purpose of rooting is to homogenize words and verbs written in different grammatical forms. This can increase the effectiveness of the system. 3. Fingerprint 3.1. The important points in the fingerprint system In designing any fingerprint system 5,6,7 , four points must be put under investigation. The creation of fingerprint: the functional role making the fingerprint is very important. This function must create different values for two different inputs and similar values for two similar inputs Fingerprint seed: is the number of inputs given to the function of creation. The fingerprint seeds depend on the investigation of similarities. For example, if the goal is to determine the similar sentences and paragraphs, an appropriate seed is the one which is sentence or paragraph seed. Clarity of fingerprint: is the number of fingerprints which show the document. Depending on the storage space, the number can be stable or variable. In order to increase the accuracy, it is better to use all created fingerprints. But most of the cases, in order to increase the speed of investigation, a subset of created fingerprints are chosen. Selection of fingerprint: is a strategy which determines which fingerprints must be chosen. This strategy depends on the clarity. If the clarity is stable (for example n), the strategy must select (n) fingerprints. The strategies used for this goal are four types: full fingerprinting, positional strategies, frequency-based strategies, and structure-based strategies. The strategy of full fingerprinting, the simplest one, selects the fingerprints of equal lengths with seeds without any conditions. 3.2. Fingerprint generation The fingerprint generation 2,7,8,5,9 process of a document and its comparison with another document is as following (see Figure 1): • • • •

The division of each document to a set of continuous pieces of tokens The performance of a Hash function on the pieces and the creation of fingerprint The selection of some fingerprints based on the selection strategy The comparison of fingerprints of two documents with each other in a way that if two documents have the same fingerprints at least to the size of threshold, the system considers them as identical ones.

Fig. 1. Plagiarism detection with document source comparison

423

424

N. Rezaeian and G.M. Novikova / Procedia Computer Science 103 (2017) 421 – 425

4. The investigation of plagiarism through SimHash method The performance stages are as following. 4.1. The creation of n grams and the actions of the Hash function N grams 10,3 is a sequence of unites in the lengths (n) characters and (n) words. The sequential n grams are overlapped on each other in length of n − 1. For example, if n = 3 units based on the word, n grams of sentences will be appeared. So in the creation of n grams, according to the rule of all algorithms, a Hash function must be performed on it. To increase the accuracy, the MD5 function was used. This function, with high accuracy, draws different n grams to the 32-bit in base of 16. 4.2. Simhash method Simhash 11 , is an interesting hashing technique, which is first coined in 2002 by Charikar, is a technique by which several fingerprints can be drawn to a fingerprint. The significant superiority of Simhash to the Simhash function is that the outcome of Simhash for two similar texts, has a similar outcome while the Hash outcome for two close and similar texts are different. After extracting the properties and receiving a set of 32-bit numbers made by the actions of Hash function on the properties, we take the numbers in a binary based way. Now, we have just made the set of 128-bit of numbers. In the following, we consider a 128-dimensional vector with the primary value so called V. For each property:  +1 if v t w ≥ 0 t h(w) = sign(v w) = −1 otherwise. Finally, we only deal with a 128-dimensional vector out of whole set of properties. Now we, with the help of this vector, can determine the outcome of the SimHash. If v[i] > 0, the Simhash[i] is equal to 1. Otherwise, it is equal to zero. In order to recognize the amount of similarity between two texts, the Hamming distance between two are calculated. The less the value, the more similarity between two. 5. Measuring effectiveness To investigate the effectiveness of algorithms 12 , effectiveness criteria are calculated as following: Precision =

tp tp+f p

and Recall =

tp tp+f n

• True positive (TP): the documents which are copied and are recognized as copies • False Positive (FP): the documents which are not copied but are recognized as copies • False Negative (FN): the documents which are copied but are recognized as the originals The more complete criterion used is the criterion F. this criterion is the harmonic mean of two precision and recall calculated as following: F-measure = 2 ∗

Recall∗Precision Recall+Precision

6. Experiments The whole of stages of the investigation of plagiarism were performed on 800 articles as the references having the topics including art, history and physiology. To investigate the accuracy of performance, 150 dubious articles were given to the system. Out of these 150 articles, 75 articles were copied. The copy

N. Rezaeian and G.M. Novikova / Procedia Computer Science 103 (2017) 421 – 425

was performed in forms of manual coping with sentence changes of the articles, adding some subjects and eliminating some parts. To recognize whether the article have copied the texts, the limitation of 50% was considered. Therefore, if the extent of similarity between two articles was more than 50%, the system regards them as similar ones otherwise dissimilar. The SimHash method, due to the more limited search, has more speed than other similar methods such as Winnowing and RareChunk. The more the documents in the database, the more differences in speed are felt. References 1. Maurer, H., Kappe, F., and Zaka, B. Plagiarism Űa survey. Journal of Universal Computer Sciences, 12(8):1050. 2. Manber, U. Finding similar files in a large file system. In In Winter USENIX Technical Conference. San Francisco, CA, 1994, pages 1–10, 1994. 3. Rezaeian, N. and Novikova, G.M. Morphological and syntactic analysis of persian text with conditional random fields. International research journal, 2016. 4. Ceska, Z. and Fox, C. The Influence of Text Preprocessing on Plagiarism Detection. 5. Hoad, T.C. and Zobel, J. Methods for identifying versioned and plagiarised documents. Journal of the American Society for Information Science and Technology, 54(3):203–215, 2003. 6. Monostori, K., Finkel, R.A., Zaslavsky, A., Hodasz, G., and Pataki, M. Comparison of overlap detection techniques. In In International Conference on Computational Science. 2002, 2002. 7. Liu, Y., Zhang, H., Chen, T., and Teng, W. Extending web search for online plagiarism detection. IEEE, 2007. 8. Heintze, N. Scalable document fingerprinting (extended abstract). In In Proc.USENIX Workshop on Electronic Commerce. 1996, 1996. 9. Shivakumar, N. and Garcia-Molina, H. Finding near-replicas of documents on the web. In In Proc. Workshop on Web databases. 1998, 1998. 10. Yew-Huey, L., Dantzig, P., Sachs, M., Corey, J., Hinnebusch, M., Darnashek, M., and Cohen, J. Visualizing document classification: A search Aid for the digital library. IBM T.J Watson Research Center. 11. Singh Manku, G., Jain, A., and Das Sarma, A. Detecting near-duplicates for web crawling. Data mining, 2007. 12. Witten, I.H., Mo, A., and Bell, T.C. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann, 2 edition, 1999.

425

Suggest Documents