performance criteria set by PAN revolve around four metrics: macro and ... case, we use C++, Java and Python. Also, we ... respective dimensions of the structures (i.e. string lengths). The chart dot .... concatenated, to obtain compact intervals.
Automatic Plagiarism Detection System for Specialized Corpora Filip Cristian Buruiana, Adrian Scoica, Traian Rebedea, Razvan Rughinis Faculty of Automatic Control and Computers University Politehnica of Bucharest Bucharest, Romania {filip.buruiana, adrian.scoica}@gmail.com,{traian.rebedea, razvan.rughinis}@cs.pub.ro Abstract—Plagiarism in the academic writing is considered one of the worst breaches of professional conduct in western society and automatic detection of it has become an important use case for Natural Language Processing research. We created a new system, AuthentiCop, aimed at detecting plagiarism instances in Computer Science academic writings. This paper focuses on designing and implementing such a system. We analyze the solutions proposed in the PAN 2011 conference and present a novel approach based on ppjoin. Keywords— Plagiarism Detection, Information Retrieval, Natural Language Processing, Document Similarity
I.
INTRODUCTION
Plagiarism is defined as the unauthorized appropriation of the language or thoughts of another author and the representation of that author's work as pertaining to one's own without according proper credit to the original author. Plagiarism in the academic writing is considered one of the worst breaches of professional conduct in western society and automatic detection of it has become an important use case for Natural Language Processing research. To date, automatic plagiarism detection in academic writing is an open problem in the Natural Language Processing field, due to the complexities of dealing with the ambiguity of natural language and to the challenge of distinguishing true cases of plagiarism from mere coincidental similarity of wording. Given that plagiarism in the Academia is considered to be one of the worst cases of professional misconduct, and the amount of published written material in any given field increases constantly, surmounting expected reading capabilities for the entire lifespan of a person, the need for an aid in determining possible cases of plagiarism is an open problem of considerable importance. This paper discusses a possible approach in creating a system for automatically flagging possible cases of plagiarism encountered in Bachelor theses submitted by the students of the University Politehnica of Bucharest. We describe the overall architecture and functioning of the system and we present in detail its main stages. Finally, we carry out a brief performance analysis on the corpus of the 2011 PAN plagiarism detection competition and we outline future development goals.
II.
MOTIVATION
The need for an automatic plagiarism detection system in the University Politehnica of Bucharest stems from our University’s commitment to excellence in supervision and evaluation for graduation theses. The system described in this paper would provide professors and students with an online platform for helping them to make sure that the submitted papers are plagiarism-free. The PAN workshop (http://pan.webis.de/) is an annual workshop dedicated to Natural Language Processing topics, such as plagiarism and authorship identification, Wikipedia vandalism, social software misuse, etc. The workshop competition provides with an invaluable dataset and benchmark [1] for the algorithms and subsequent implementations which attempt to solve the plagiarism detection problem. III.
TASK OVERVIEW AND RELATED WORK
In designing our plagiarism detection system, it was important to set clear performance measurement goals and a rigorous methodology for evaluating the system. The performance criteria set by PAN revolve around four metrics: macro and micro-averaged recall and precision, granularity, and overall plagiarism detection score. Micro-averaged precision is intuitively defined as the ratio between the total length of correctly identified plagiarism detections and the total length of all plagiarism detections. Similarly, micro-averaged recall is defined as the ratio between the total length of correctly identified plagiarism detections and the total length of all plagiarism cases. The macro-averaged precision and recall have similar formulae, but do not account for the lengths of individual plagiarised cases. Granularity is a measure that reflects whether a plagiarism case is detected as a whole or in multiple pieces. Lastly, the previous metrics can be combined to give the plagiarism detection score (plagdet score), a metric allowing for absolute ranking of the various solutions. The formulae to all metrics are given in [1].
A. Related Work Turnitin [2] is the real world example of a leading academic plagiarism detector, utilized by teachers and students to avoid plagiarism and to ensure academic integrity. It is a well-known web service for plagiarism detection in academic writings, having a database of over 220 million archived student papers.
academic papers and journals). Since G oogle does not provide a free API, the AuthentiCop system creates its own HTTP requests and retrieves the links automatically from the DOM, using XPath.
The ANTIPLAG system has been developed since 2008, and from 2010 was implemented as an official project by Slovak Centre of Scientific and Technical Information under the auspices of the Ministry of Education of the Slovak Republic [3]. This system ranked first in the PAN 2011 competition, obtaining the best score for all four main parameters. In developing our plagiarism detection system, we have evaluated the best ranking solutions submitted to the 2011 PAN edition and we have built on the approaches published on this occasion, mainly focusing on Fastdocode [4] and Encoplot [5] and their applications. IV.
SYSTEM ARCHITECTURE
Developing a plagiarism detection system raises both technical and algorithmic challenges. On one hand, such a system will most likely have modules written in many programming languages that need to be interconnected. In our case, we use C++, Java and Python. Also, we need to integrate multiple frameworks, each performing a specific task (i.e. converting documents to raw text or running semantic analysis). On the other hand, the system needs to use a multitude of algorithms, from topic detection to local alignment algorithms. Also, there are size issues: a corpus can contain thousands of documents, each containing tens of thousands of words. Therefore, the data structures need to be carefully designed in order to be efficient. Our general architecture is presented in Fig. 1. The evaluated document is uploaded using a web interface. Afterwards it is transferred to a converter, which obtains the raw text starting from a PDF, doc, docx, or an HTML document. Because of the multitude of formats supported, we use Apache Tika (http://tika.apache.org/), a project of the Apache Software Foundation, which extracts meta-data and structured text content from various documents using existing parser libraries. Also, at this step we employ all the necessary preprocessing, i.e. stemming and stop-words removal. The uploaded document is compared both against a corpus and the Web. The corpus contains all Computer Science articles from the English Wikipedia and a collection of theses from previous years. Because there are thousands of documents to compare to, a candidate selection phase is introduced, which will return the documents that the current text is most likely to plagiarize from. To compare the document against the Web, we simulate human behavior: we first determine a set of related topics, and then fetch the first documents returned by search engines when queried against the topics. The documents may be in multiple formats (HTML, PDF, doc, etc), so the converter is used again on the results in order to obtain the raw text. We query both google.com (for web pages) and scholar.google.com (for
Fig. 1. General architecture of AuthentiCop
The candidate documents from both the corpus and the web are passed to a detailed analysis phase aiming to find the exact plagiarized passages. There is also a post-processing phase, where results from the detailed phase are checked in even more detail, using semantic analysis and local alignment techniques. The results are displayed in a Web interface. The evaluated document and the source documents, if any, are displayed side by side, and the plagiarized passages are marked with distinct colors. V.
CANDIDATE SELECTION
We cannot compare in detail the document under evaluation with every document in the corpus, because the collection can contain thousands of texts. The candidate selection phase will retrieve only the documents that the current text is most likely to plagiarize from. A. Encoplot Encoplot [5] is a plagiarism detection system which was derived from the better-known Dotplot system [6] used in bioinformatics for inspecting similarity between the structure of molecule chains. Variants of Dotplot have been previously proposed to be used for computing self-similarity between text documents, but also for source code [7]. Both algorithms work by building and analysing a bidimensional similarity chart with dimensions equal to the respective dimensions of the structures (i.e. string lengths). The chart dot at coordinates (i, j) is assigned the colour black iff there is a match between the atoms at positions i in the first string and j in the second string, respectively. Alternatively, if the atoms do not match, the dot is assigned the colour white. The resulting plot thus reveals patterns of similarity between the two strings of atoms which will show up as line segments on the chart.
1) Differences from Dotplot The main problem with Dotplot lies in its intractability. The algorithm assembles a data structure with a size on the order of O(N*M), where N and M are the lengths of the atom string from each document. Further processing of this data structure in order to detect patterns takes a prohibitive amount of computational effort. Thus, Encoplot introduces a simplification by allowing each atom in either string to be matched exactly once. This is accomplished in the algorithm by sorting and then merging the N-grams of atoms in the two documents, rather than performing a full Cartesian product. Because of the nature of selecting the matching pairs with Encoplot, it is guaranteed that the set of matching pairs produced by the algorithm is a strict subset of the matching pairs produced by Dotplot, while still preserving many of the geometrical features indicative of plagiarism. 2) Main Algorithm The algorithm starts off with a pre-processing stage that extracts lexical atoms from the documents to be compared. This presented us with a choice of defining what actually makes an atom for the purpose of running Encoplot for the candidate selection phase. The standard calibration of the algorithm's thresholds was first done by defining an atom as a sequence of consecutive alphanumeric characters in the document and by performing no normalization on the resulting atoms. The next step is putting together and sorting the list of all existing N-grams in the document. We experimented with Ngram sizes of 3, 4 and 5 atoms and found that using 4-grams yields the best results in the absence of normalization and stemming, while trigrams give more fine-grained results when the atom lists are purged of stop words, normalized, and stemmed (giving that the rest of the thresholds in the algorithm are adjusted accordingly). The third step consists in merging the N-gram list of the two documents. With many document pairs, the merging phase will give far fewer matches than the lengths of the two N-gram lists combined. In these circumstances, performing a sub-linear merge improves the final complexity of the algorithm. The fourth and final step of the algorithm consists in the application of a clustering heuristic on the projection of the matching set on its component of the source document. The clustering process aims at discovering the sub-sequences of Ngrams from the source document that have matches in the suspected document with a large enough combined length and density to justify a suspicion of plagiarism. Projection is done on the source document rather than on the suspicious one, because of the intuitive expectation that a plagiarised document would often contain obfuscated text with scrambled word order and gaps in the matches large enough to throw off detection due to low match density along any given segment. 3) Clustering Heuristics Special consideration must be given to the clustering heuristic when discussing any implementation of the Encoplot
algorithm. The clustering method recommended by the authors involves greedy extension of segments starting from a seed matching either to the right or to the left as long as the density of the matches remains above a desired threshold. While the greedy extension is relatively straightforward, the selection of seeds for the segments presents us with a choice. One of the alternatives is using a Monte Carlo optimisation loop, and this has been applied by [5] with good results in the combined stages of candidate selection and detailed analysis. For the purpose of candidate selection alone, though, we have discovered that iterating through all possible seed segments has a relatively low impact on performance while leading to more reliable results. B. FASTDOCODE The approach used in [4] to select the candidates is based on n-grams: if two documents have at least two word 4-grams coincidences close enough as to be in the same paragraph, the documents are given to the next phase. Otherwise the pair is discarded. Because of the prohibitive size, the entire corpus cannot be held in memory, and a direct approach would be to read the files from disk, every time they need to be evaluated. This will result in severe performance penalties, since every file needs to be accessed O(N) times, where N is the number of documents in the corpus. To avoid this, a "caching" system is employed. chunks, each containing The documents are split in documents. At each time, only one chunk is present in memory. Every chunk needs to be reloaded times, so every file will be now accessed O( ) times from disk, which is a major improvement over the initial solution. Using our own implementation of this method resulted in about one third of the pairs being correctly identified, the other two thirds being missed. Also, it is worth mentioning that a large number of identified pairs are false positives. C. PPJoin PPJoin [8] is an algorithm that can identify near duplicate records efficiently, i.e. it can return all pairs (i j) such that the similarity between the i-th and the j-th record is greater than a given threshold. The similarity can be computer using overlap, Jaccard or cosine similarity. We show in the next section that passages can be compared using cosine similarity with tf-idf weighting, with satisfactory results regarding precision and recall. Thus, our idea is to use ppjoin as a candidate selection method. The input cannot be whole documents, because we need to identify plagiarized passages inside them, so we split each document in passages, each having the length a power of 2. The length of the passages may vary from 32 to 2048 words in our implementation, in order to cover a range of possible lengths of the plagiarized passages. ppjoin is open-source. However, some parts need to be reimplemented: the vectors in the Vector Space Model need to be weighted according to the tf-idf scheme (we observed empirically that failing to do so will result in poor performances) and the inverted index needs to be changed,
because it is implemented in-memory only, but the total size of all passages will be over 10 GB. We will use an existing implementation of an inverted index based on perfect hashing. To the best of our knowledge, ppjoin was mostly used in copy detection systems (i.e. to determine similar pages on the Web) and has not been tried extensively in the plagiarism detection field. VI.
With this empirical observation, we conclude that the method of splitting is effective. For this reason, in the following we will focus on finding pairs of passages that are highly similar under the cosine similarity with tf-idf weighting.
DETAILED ANALYSIS
documents (doci docj), with , resulting from the candidate selection phase, we need to determine all passages from doci that are plagiarized from passages of docj. For
each
pair
of
For this purpose, we experimented with the cosine similarity and tf-idf weighting scheme on the PAN 2011 corpus. When using this similarity, each pair of passages is assigned a score between 0 and 1, with 0 meaning that the passages have no term in common, and 1 meaning that the passages are identical under the bag of words model. Also, before computing the cosine similarity, we performed preprocessing, removing stop words and applying stemming on all the remaining words. We tested whether there is a separation between plagiarized and non-plagiarized passages under these circumstances, i.e. if there is a threshold θ for which most of the pairs with similarity larger than θ are plagiarized and most of the pairs with lower similarity are not.
Fig. 3. Distribution of scores for highly obfuscated passages
Fig. 2 shows the distribution of scores for a sample of about 1000 lowly obfuscated plagiarized passages. The peak of the graph is for θ around 0.8.
Fig. 4. Distribution of scores for the most similar passages in 1000 random document pairs
Fig. 2. Distribution of scores for lowly obfuscated passages
Fig. 3 also shows the distribution of scores for a sample of about 1000 plagiarized passages, but this time the plagiarism cases are highly obfuscated. The peak decreases to 0.3, which is natural, because obfuscation implies, among others, replacing words with their synonyms and paraphrasing. However, the distribution for the best matching passages from 1000 random pairs is much smaller and is displayed in Fig. 4. Is this case, the peak is below 0.1.
The brute-force approach is prohibitive, as it means generating all possible subsequences and then testing their score linearly, giving a total complexity of O(N5). This can be refined to O(N4logN), because when adding a new element the score does not need to be computed from scratch. A binary search can be performed instead, in order to find if the newest inserted element in the second passage is found in the first one. Another approach is presented in [4]. The idea is to start from every possible pair of equal terms, and then extend to the left and to the right as long as the percentage of similar words is not less than a threshold value. Word bigrams and trigrams are used. This method is faster than the brute-force approach, but it still has prohibitive running time. Our idea is to avoid trying every possible starting point. For this, we split the suspicious document in chunks, each chunk
having the size a power of 2, ranging from 32 to 2048. The offset of a chunk of size X is divisible by X and the chunks may overlap. For each of these chunks we check if there is a similar chunk of the same size in the source document (i.e. with the cosine similarity bigger than a threshold). If there is, we add the chunk to a candidate set. After all the chunks are considered, the intervals from the candidate set are concatenated, to obtain compact intervals. These represent the kernels (namely, passages from the suspicious document that have a similar correspondent in the source document). For each kernel, we can determine in O(NlogN) the passage in the source document that most closely matches it. Having a pair of passages, we can now further refine the boundaries using the idea from [8], trying to either extend or shrink the intervals, using a hill-climbing approach and maintaining the best solution on the way. We perform extensions / shrinks until the global optimum no longer improves for a given number of steps. According to [9], our method so far can be classified as paired (two documents are processed together to compute the metrics) and superficial (the metrics are computed without any knowledge of the linguistic rules or a document structure). VII. RESULTS AND IMPROVEMENTS With the approach described in the previous section, with no parameter tuning and using the golden standard of candidate pairs, we obtained the results displayed in Table 1.This ranked our algorithm between the 2nd and the 3rd place for PAN 2011 competition. TABLE I.
OVERVIEW OF THE RESULTS OBTAINED BY AUTHENTICOP Plagiarism parameter
Score
Pladget score
0.3965
Recall
0.3377
Precision
0.7609
Granularity
1.2653
Even though we would have obtained the second recall in the competition, we only detect a third of the total plagiarism cases, which is still low. The recall can be improved by lowering the threshold limits, which will have a two-fold effect: • highly obfuscated passages are detected, and thus the recall increases; • passages that are not plagiarized are also reported (false positives), and the overall precision decreases. We would like to increase the recall without affecting precision. For this, we need to prune the false positives out. We intend to achieve this by further checking each pair of passages using both LSA (Latent Semantic Analysis) and the Smith Waterman algorithm.
A. Using Latent Semantic Analysis Two passages can speak about the same subject, even if their vocabulary is not exactly the same. This can be done by paraphrasing and synonymy. However, even though the vocabulary differs, the used terms are correlated to each other, meaning that they are likely to be found in the same context. For example, "heap data structure" is correlated to "binary trees", because they refer to related concepts and are both probable to be found in computer science texts dealing with hierarchical data structures. Replacing terms with their most common synonyms, using a dictionary like WordNet for example, is not effective, because terms need to be replaced depending on the context in which they appear. Thus, in order to identify whether two passages are related to each other (i.e. they speak about the same subject), we use Latent Semantic Analysis (LSA) [10], which is also referred to as Latent Semantic Indexing (LSI), in the context of Information Retrieval. LSA performs a Singular Value Decomposition (SVD) on the term-document matrix M, which means that M is written as a product of the form U∑V*, where U and V* are unitary matrices and ∑ is a diagonal matrix with the singular values on the diagonal. We can reduce the data to a lower dimensional space by keeping only the biggest k values on the diagonal of ∑, and their corresponding singular vectors from U and V, and multiplying these new matrices. Performing this semantic analysis makes our detection system to be Structural rather than Superficial, considering the classification proposed in [10]. It is known that LSA is affected by polysemy (words having different meanings in different contexts). For example, the word tree has a different meaning in computer science texts than in biology documents. This is why the correlation of tree and trunk might be greater that the correlation between tree and graph, which is not desirable. This can be avoided by choosing the relevant semantic space, i.e. a corpus constructed from texts relating only to the area of interest. In our case, our software is aimed at detecting plagiarism in Computer Science documents, so we perform LSA only on Computer Science texts taken from Wikipedia. To perform LSA, we use Gensim (http://radimrehurek.com/gensim/), which is an open-source framework written in Python that already contains algorithms for semantic analysis such as Initially, we employ a preprocessing step for the whole corpus. Then, each sequential document that needs to be analyzed is brought in the lower dimensional space, by performing a dimensionality reduction. B. Smith-Waterman Algorithm All algorithms presented so far fall under the bag of words model, and therefore all positional information is lost. It is possible for related texts to use the same vocabulary without plagiarism. For example, for long passages, if the words are highly scrambled, it is improbable that the passages are plagiarized. Smith-Waterman is a dynamic programming algorithm similar to the longest common subsequence, the difference being that operations like deletions or insertions can have a cost higher than one. The algorithm is widely-used in
finding good near-matches, or so-called local alignments, within biological sequences [11]. The integration of the algorithm in our system and proper set-up is still in progress. VIII. CONCLUSIONS We have currently identified a number of different directions for further development. The algorithms implemented for the candidate selection phase and for the detailed analysis phase need to be further benchmarked for various combinations of threshold parameters. Furthermore, we would like the application to support more document formats and to be accessible through a user-friendly web interface. Last but not least, there is a need to improve the semantic analysis components and stemming support. Despite the challenges posed by the task, initial results indicate that satisfactory detection results can be obtained on specialized corpora. REFERENCES [1]
M. Potthast, et al., “An evaluation framework for plagiarism detection,” Proc. Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Association for Computational Linguistics, 2010, pp. 997-1005.
[2] [3] [4]
“Turnitin,” http://www.turnitin.com. “ANTIPLAG,” http://www.svop.sk/en/antiplag.aspx. G. Oberreuter, et al., “FastDocode: Finding Approximated Segments of N-Grams for Document Copy Detection - Lab Report for PAN at CLEF 2010,” CLEF (Notebook Papers/LABs/Workshops), M. Braschler, et al., eds., 2010. [5] C. Grozea and M. Popescu, “Who's the thief? automatic detection of the direction of plagiarism,” Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing, Springer-Verlag, 2010, pp. 700-710. [6] J.W. Tukey, Exploratory Data Analysis, Addison-Wesley Publishing Company, 1977. [7] K. Church and J. Helfman, “Dotplot: a Program for Exploring SelfSimilarity in Millions of Lines of Text and Code,” Proceedings of the 24th Symposium on the Interface, Computing Science and Statistics V24, pp. 58-67, March, 1992. [8] C. Xiao, et al., “Efficient similarity joins for near duplicate detection,” Proceedings of the 17th international conference on World Wide Web, ACM, 2008, pp. 131-140. [9] T. Lancaster and F. Culwin, “Classification of Plagiarism Detection Engines,” E-journal ITALICS, vol. 4, no. 2, 2005. [10] T.K. Landauer and S.T. Dumais, “A solution to Plato's problem: the Latent Semantic Analysis theory of acquisition, induction and representation of knowledge,” Psychological Review, vol. 104, no. 2, 1997, pp. 211-240. [11] R. Irving, Plagiarism and Collusion Detection using the SmithWaterman Algorithm, University of Glasgow, 2004.