efficient and effective plagiarism detection for large code ... - CiteSeerX

15 downloads 13804 Views 321KB Size Report
Oct 18, 2004 - 6.1 Index Ranking Algorithm Results . ... it is generally possible to only check for plagiarism among submissions for a single ... popular Google search engine;2 a query for the C system call printf returned approximately two.
E FFICIENT AND E FFECTIVE P LAGIARISM D ETECTION FOR L ARGE C ODE R EPOSITORIES S TEVEN B URROWS Supervisors: S EYED M. M. TAHAGHOGHI and J USTIN Z OBEL Honours Thesis School of Computer Science and Information Technology RMIT University Melbourne, AUSTRALIA October, 2004 Abstract The copying of programming assignments is a widespread problem in academic institutions. Manual plagiarism detection is time-consuming, and current popular plagiarism detection systems are not scalable to large code repositories. While there are text-based plagiarism detection systems capable of handling millions of student papers, comparable systems for code-based plagiarism detection are in their infancy. In this thesis, we propose and evaluate new techniques for code plagiarism detection. Using small and large collections of programs, we show that our approach is highly scalable while maintaining similar levels of effectiveness to that of JPlag.

1

Declaration I declare that this work is entirely my own except where due acknowledgement has been made and has not been previously submitted for any other academic award.

Steven Burrows 18 October 2004

2

Contents 1

Introduction

5

2

Background 2.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Local Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 6 8

3

Related Work 3.1 Text-based Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Attribute-oriented Code-based Systems . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Structure-oriented Code-based Systems . . . . . . . . . . . . . . . . . . . . . . . .

10 10 11 12

4

Scalable Plagiarism Detection 4.1 Tokenisation . . . . . . . 4.2 Index Construction . . . 4.3 Querying . . . . . . . . 4.4 Local Alignment . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

13 14 17 17 19

5

Evaluation 5.1 Data Set . . . . . . . . . . 5.2 Ground Truth . . . . . . . 5.3 Test Environment . . . . . 5.4 Effectiveness Measurement

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

19 19 20 21 23

6

Results 6.1 Index Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Local Alignment Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23 23 24 25

7

Conclusion 7.1 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27 27 28

A Appendix A.1 Program 0221.c . A.2 Program 0261.c . A.3 Program 0240.c . A.4 Program 022522.c

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

30 31 32 33 34

List of Figures 2.1 2.2 2.3 2.4 4.1 4.2 4.3 4.4 4.5 4.6 5.1 6.1 6.2 6.3

Simple Inverted Index . . . . . . . Local Alignment Matrix . . . . . Optimal Local Alignment . . . . . Multiple Local Alignment . . . . Sample Program #1 . . . . . . . . Program #1 N-grams . . . . . . . Sample Program #2 . . . . . . . . Program #2 N-grams . . . . . . . N-gram Index . . . . . . . . . . . Zettair Query Ranking . . . . . . JPlag Results . . . . . . . . . . . Index Ranking Algorithm Results Local Alignment Results . . . . . Scalability Results . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

6 9 10 11 14 15 17 17 18 19 22 24 25 26

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

15 16 20 20 27

List of Tables 4.1 4.2 5.1 5.2 6.1

Partial SIM Token Set . . . . . . . Program #1 Token Representation Sample Data Statistics . . . . . . Large Collection Statistics . . . . Running Times . . . . . . . . . .

4

Steven Burrows

Efficient and Effective Plagiarism Detection for Large Code Repositories

1 Introduction The RMIT Plagiarism Policy 2003 [RMIT University, 2002] defines plagiarism in part as “the presentation of the work, idea or creation of another person as though it is your own.” Code plagiarism is the reuse of program structure and programming language syntax either from another student, or from online resources. Plagiarism is a problem that universities across the world need to deal with. A joint study conducted on 287 Monash University and Swinburne University students [Sheard et al., 2002] revealed that between 69.3% and 85.4% admitted to cheating. With such a high level of cheating, it is imperative that we have adequate mechanisms to detect plagiarism. Detecting plagiarism has become increasingly difficult due to larger class sizes and the ability for students to share work with each other and across the Internet. For example, the School Sucks website 1 claims to have 50,000 available term papers, book reports, dissertations and college essays online. Class sizes of 300 students or more are not uncommon in university environments. In Computer Science, this problem is also serious due to the abundance of web sites that have easily accessible program code relevant to assessable tasks. For large class sizes, manual plagiarism detection is a slow and laborious process. Many courses have multiple staff assessing student assignments, making manual detection impractical; it is rare for a single person to look at all submissions. Many departments have used automated plagiarism detection systems. Unfortunately, these plagiarism detection systems are restricted by upper limits on the volume of programs that they can check. One example is system memory requirements. Hence, it is generally possible to only check for plagiarism among submissions for a single assessment task. It is also very important for a system to be as accurate as possible. University staff lack the time and resources to examine large numbers of false matches. An efficient, effective, and scalable solution needs to be found to address this problem. Such a solution would allow staff to search for plagiarism against large volumes of previous assignments; we refer to this as historical plagiarism detection. This is of value because the nature of the course material in many core courses makes it difficult to vary assignment specifications. We could search for plagiarism across courses that have content in common. For example, this could be used for two courses that teach similar material but are only separated due to the cohort of students enrolled. We could also search for plagiarism against very large collections of programs obtained from the Web. To demonstrate the large volume of code available, we conducted a simple experiment using the popular Google search engine;2 a query for the C system call printf returned approximately two million query results, suggesting that there are around two million C programs on the web, as the great majority of programs include a printf call. 3 The motivation for this work is to find a solution that addresses plagiarism detection for the purposes described above: historical, multi-course and Internet plagiarism detection. Most previous approaches to plagiarism detection perform the task in an exhaustive pairwise fashion such as the approach used in JPlag [Prechelt et al., 2002]. This is not scalable to large code repositories. Our approach is to construct an index of student programs and query this index once for each program. We refine the results generated by this process using the local alignment approximate string matching technique. Our initial experiments conducted on a repository of 296 programs allowed us to identify the most efficient and effective indexing, querying and local alignment techniques. Further experiments on a much larger repository of 61,540 programs revealed that the efficiency and effectiveness of our approach scales to large code repositories. 1

http://www.schoolsucks.com http://www.google.com 3 Experiment conducted July 21, 2004. 2

5

Efficient and Effective Plagiarism Detection for Large Code Repositories

apple

1: 25,3

banana

1: 26,2

grape

1: 22,5

orange

3: 31,1 33,3 15,2

pear

2: 24,6 26,1

Lexicon

Steven Burrows

Inverted Lists

Figure 2.1: A simple inverted index. In Section 2, we discuss background knowledge in search engines and local alignment. In Section 3, we review existing approaches to plagiarism detection and note their strengths and limitations. In Section 4, we describe our approach. In Section 5, we describe our experimental data and evaluation techniques. We present results in Section 6. In Section 7, we summarise our findings and outline potential future work.

2 Background The plagiarism detection techniques described in this thesis use search algorithms and local alignment to identify likely matches as explained in Section 4. In this section, we introduce search engines. We describe what they are, their purpose, and their major components. We also introduce local alignment and describe how this is utilised in genomic information retrieval.

2.1 Search Engines Arasu et al. [2001] describe a web search engine as a tool that allows users to submit a query as a list of keywords, and receive a list of possibly relevant web pages that contain most or all of the keywords. One of the first search engines was the World Wide Web Worm [Brin and Page, 1998] which indexed approximately 110,000 documents in 1994. As of July 2004, Google claims to maintain a collection of over four billion documents. The purpose of a search engine is to index huge volumes of data in a compressed fashion for efficient retrieval. To facilitate efficient retrieval, search engines use an inverted index. Figure 2.1 shows that an inverted index is comprised of two main components: a lexicon and a series of inverted lists. The lexicon stores all distinct words in the collection, in our example the names of fruits. The inverted lists contain some statistics about each item in the lexicon. This includes the term frequency, which tells us how many times a particular term occurs in the collection; and some integer pairs that us how many times the term occurs in each relevant document, otherwise known as the document identifier and the within-document frequency. To understand the postings list, consider two examples. We can see that the term “banana” has a term frequency of 1, which tells us that this term occurs in one document. The postings list indicates that it appears twice in document 26. Similarly, the term “pear” appears in two documents; it appears six times in document 24 and once in document 26.

6

Steven Burrows

Efficient and Effective Plagiarism Detection for Large Code Repositories

This index demonstrates why search engines are highly scalable for large collections of data. Each term need only be stored once in our lexicon and the inverted lists comprise of integers that are highly compressible [Witten et al., 1999]. Having this information stored in main memory allows us to efficiently locate the documents we are looking for rather than searching through all documents exhaustively. An effective compression technique is the use of d-gaps. D-gaps are the differences between document identifiers rather than the absolute values. These differences are likely to be much smaller numbers for a large collection therefore these numbers can be stored in a smaller number of bits. The query engine is the component that the user most closely interacts with. The user requests information by providing a query string that the query engine parses. The query engine retrieves matching inverted lists for the terms the user provided. The ranking engine provides matching results to the user ranked by decreasing estimated relevance. Relevance is estimated using a similarity measure. Such a measure must consider several factors. Documents that have more instances of a term that we are looking for, that is a higher term frequency, are likely to be more relevant. Lengthy documents are likely to have the same term occur many times. Therefore, we penalise lengthy documents so that shorter pages will also be considered. This is known as inverse document length. It is also useful to weight terms based upon their rarity. Terms that appear rarely (such as “aardvark”) are likely to locate our information need than commonly occurring terms (such as “the”). This is known as inverse document frequency. The Cosine measure is a popular similarity measure and it takes into account term frequency, inverse document length and inverse document frequency as discussed above. The full formula is provided below [Witten et al., 1999].

Cosine(Q, Dd ) =

X 1 . wq,t .wd,t WD WQ t∈Q∩Dd

v u n uX 2 WD = t wd,t

v u n uX 2 WQ = t wq,t

t=1

t=1

wd,t = 1 + ln fd,t

Q = query WD = document weight wd,t = document-term weight fd,t = within-document frequency ft = collection frequency

wq,t



N = ln 1 + ft



Dd = a document WQ = query weight wq,t = query-term weight N = number of documents in collection t = a term

The Okapi BM25 similarity measure [Robertson and Walker, 1999] evaluates the probability that a document is relevant to a query. This similarity measure also takes into account term frequency, inverse document length and inverse document frequency. The full formula is provided below [Robertson and Walker, 1999].

7

Efficient and Effective Plagiarism Detection for Large Code Repositories

Okapi(Q, Dd ) =

X

wt .

t∈Q

wt = loge



N − ft + 0.5 ft + 0.5

(k1 + 1)fd,t (k3 + 1)fq,t . K + fd,t k3 + fq,t



Q = query Wd = document length k1 = 1.2 b = 0.75 fd,t = within-document frequency ft = collection frequency

Steven Burrows



b.Wd K = k1 . (1 − b) + WD



Dd = a document WD = average document length k3 = 1000 N = number of documents in collection fq,t = query-term frequency t = a term

PlagiRank was developed by Chawla [2003] specifically for the purpose of plagiarism detection. According to Chawla, PlagiRank “is based on the perception that similar programs should contain similar number of occurrences of symbols (tokens).” This makes sense because the nature of the Cosine and Okapi BM25 similarity measures is to reward documents that have a high number of query terms rather than a similar number of terms between the query and the document. We present Chawla’s PlagiRank similarity measure here:    X  1 fq,t P lagiRank(Q, Dd ) = . + 1 .fq,t loge WD WQ fd,t

where fq,t ≤ fd,t

   X  fd,t 1 loge + 1 .fq,t . P lagiRank(Q, Dd ) = WD WQ fq,t

where fq,t > fd,t

t∈Q∩Dd

t∈Q∩Dd

Arasu et al. [2001] present a detailed discussion of all components of a search engine and the crawler, crawler control module, page repository and indexer components that we have not discussed in this thesis.

2.2 Local Alignment Local alignment [Smith and Waterman, 1981] is an approximate string matching technique. A local alignment identifies sequence pairs with an optimal possible alignment. Local alignment similarity scoring is used to determine high-scoring regions and we ignore the fact that the whole sequences may have large differences [Williams and Zobel, 2002]. Today, local alignment is used in bioinformatics to find organisms of similar homology, that is, organisms with a similar evolutionary ancestry. To achieve this, the differences between the DNA (Deoxyribonucleic Acid) sequences belonging to various organisms [Williams and Zobel, 2002] are compared. One tool that can be used to achieve this is BLAST (Basic Local Alignment Search Tool) [Altschul et al., 1990]. The National Center for Biotechnology Information 4 (NCBI) hosts a version of BLAST. It supports genomic information retrieval through a web interface connected to 4

http://www.ncbi.nlm.nih.gov/

8

Steven Burrows

Efficient and Effective Plagiarism Detection for Large Code Repositories

A

C

G

0

0

0

0

A

0

1

0

0

C

0

0

2

1

T

0

0

1

1

Indel between ‘G’ and ‘C’

Match between

Mismatch between

‘C’ and ‘C’

‘G’ and ‘T’

Figure 2.2: A local alignment matrix for the genomic sequences “ACG” and “ACT”. The arrows indicate a match, a mismatch, and an indel. public genomic databases. In this thesis, we investigate the application of local alignment to plagiarism detection. Consider the local alignment of the two very short genomic sequences “ACG” and “ACT” presented in Figure 2.2. Local alignment can be calculated on two sequences s and t, of lengths l 1 and l2 using a matrix of size (l1 + 1) × (l2 + 1) [Smith and Waterman, 1981]. In our example, the strings of length three produce a matrix with sixteen cells. Also note that in local alignment, the first row and the first column of the matrix are initialised with zeros. When computing the score for each cell, we consider the maximum possible score resulting from a match, mismatch, insertion, or deletion between each possible pair of characters between the two sequences. We award a score increase for any pair of characters in the local alignment matrix that match. In our example, the character ‘C’ from “ACG” and the character ‘C’ from “ACT” match. This merits a one-point gain over the value in the matrix square immediately to the upper-left. Since that score is one, we obtain a score of two for this match. Similarly, we can penalise any pair of characters that mismatch. We have a mismatch between the character ‘G’ from “ACG” and the character ‘T’ from “ACT”, and so the score is reduced by one. We can also optimise the alignment of two sequences by inserting or deleting additional characters. These attract a penalty, as they also represent changes to the sequences. We refer to insertions and deletions collectively as indels. Scoring for an indel is slightly more complicated. We have an indel between the character ‘G’ from “ACG” and the character ‘C’ from “ACT”. This effectively means that we have either deleted the character ‘T’ from the string “ACT” or inserted a blank before it. Indel scores reduce the score to the previous cell to the left or upwards by one depending on which sequence the indel penalty is applied to. In our example, we have reduced the score from the left. In sequences with a large number of mismatches and indels, it is possible that a negative score is generated. Given that we are only interested in the similarity of regions in local alignment, we enforce a minimum score of zero. The formula below summarises the local alignment scoring metrics just discussed. The score increase in the first line of the formula represents a match score increase. The score decreases in lines 2-4 represent the penalties for mismatches and indels. The last line ensures no score becomes negative. The max operator ensures that we are always choosing the most optimal alignment between characters in the matrix; we don’t want interesting regions of similarity to be lost due to highly negative preceding scoring regions. This is important both for genomic information retrieval and in our application. 9

Efficient and Effective Plagiarism Detection for Large Code Repositories

Steven Burrows

A C G .. .. . . A C T Figure 2.3: An optimal alignment “AC” as found in the genomic sequences “ACG” and “ACT”.

 [i − 1, j − 1] + 1      [i − 1, j − 1] − 1 [i, j − 1] − 1 [i, j] = max    [i − 1, j] − 1   0

if i, j > 0 and si = tj , if i, j > 0 and si 6= tj , if i, j > 0, if i, j > 0, for all i, j

The goal of local alignment is to find the optimal alignment. To do this, we traverse our local alignment matrix to locate the highest score at any cell. In our example in Figure 2.2, the highest score is two. We then follow our traceback path (that has been indicated by dashes) until we reach a zero. This traceback path denotes the highest scoring preceding region to the left, above, or to the upper-left of the current cell. In our example, this traceback path represents the alignment “AC” as shown in Figure 2.3. It is not uncommon to have more than one optimal alignment or several regions of high similarity. We refer to this as multiple local alignment. Morgenstern et al. [1998] consider multiple local alignment in their genomic information retrieval system Dialign. They argue that local alignment can be successfully applied if there is only one region of high similarity, but not if there are several regions of high similarity. It is more likely for plagiarised program code to have multiple segments of similarity as a result of disguising the program as a whole. Identifying multiple local alignments is much more difficult than identifying a single optimal alignment. Morgenstern et al. propose that this problem can be overcome by ignoring indel penalties and just identify local regions of similarity in the diagonals of the local alignment matrix. This can be achieved by summing the optimal local alignment on each diagonal to produce a similarity score. A minimum match length must be imposed to avoid noise from trivial matches. We present an example of multiple local alignment in Figure 2.4.

3 Related Work Existing plagiarism detection methods fall into two main categories: text-based and code-based. We explain why test-based approaches are not suitable for code-based plagiarism detection, and we discuss the two main classifications of code-based methods: attribute-oriented and structure-oriented. We also discuss the SIM, JPlag, and PlagiIndex code plagiarism detection systems in detail.

3.1 Text-based Systems Hoad and Zobel [2002] discuss two families of methods for text-based plagiarism: ranking and fingerprinting. The ranking family of techniques is based on information retrieval concepts as discussed

10

Steven Burrows

Efficient and Effective Plagiarism Detection for Large Code Repositories

A

C

T

G

C

T

G

0

0

0

0

0

0

0

0

A

0

1

0

0

0

0

0

0

C

0

0

2

0

0

1

0

0

T

0

0

0

3

0

0

2

0

G

0

0

0

0

4

0

0

3

A

0

1

0

0

0

3

0

0

C

0

0

2

0

0

1

2

0

Figure 2.4: Multiple local alignment between the genomic sequences “ACTGAC” and “ACTGCTG”. In this example, we use match = 1 and mismatch = −1. Indels are ignored. We also employ a minimum match length of 3 to remove trivial matches. The alignment of these two sequences generates a score of 7. in Section 2.1. An index of the collection is built, then a similarity measure is used to evaluate the similarity of queries and documents. Hoad and Zobel take ranking methods a step further by proposing identity measures. Hoad and Zobel explain that similarity measures such as the Cosine measure are best suited for ad hoc querying. The identity measure is based on the idea that plagiarised documents are likely to contain a similar number of occurrences of terms. The fingerprinting family of methods represent documents in a compact form. Each fingerprint is a collection of integers representing key aspects of a document. The fingerprints are indexed and the fingerprints of queries a generated to query the index. One of the largest text-based plagiarism detection systems available is Turnitin. 5 As of August 2004, Turnitin claims that their database exceeds 4.5 billion documents. The techniques used for text plagiarism are not the same as code plagiarism. Our experiments with an online text-based plagiarism detection tool showed that text-based systems ignore coding syntax in favour of comparing normal English words for plagiarism detection. We uploaded a collection of C programming assignment and observed the results. We found that only small amounts of these programs were found to be similar such as function names, comments and output statements. All coding syntax was ignored – a severe defeat, because coding syntax is the dominant part of any program. Text-based systems are not discussed further in this thesis for this reason.

3.2 Attribute-oriented Code-based Systems Attribute-oriented plagiarism detection systems measure properties of assignment submissions. Donaldson et al. [1981] describe a system with four attributes: • The number of unique operators. 5

http://www.turnitin.com

11

Efficient and Effective Plagiarism Detection for Large Code Repositories

Steven Burrows

• The number of unique operands. • The total number of occurrences of operators. • The total number of occurrences of operands. The similarity of two programs is approximated by the differences between these attributes. Unfortunately, attribute-oriented systems are highly limited. Verco and Wise [1996] report that attribute-oriented systems perform best in the detection of plagiarism where very close copies are involved. A common method of plagiarising is to add additional redundant code statements to a program. Given that attribute-oriented systems take all lines of code into account, the resulting attribute scores differ greatly. Also, copies of chunks are hard to detect.

3.3 Structure-oriented Code-based Systems Structure-oriented systems [Bowyer and Hall, 1999; Gitchell and Tran, 1999; Prechelt et al., 2002] create lossy representations of programs and compare these to detect plagiarism. These systems deliberately ignore easy-to-modify elements such as comments, additional white space and variable names. Structure-oriented systems are less susceptible to the addition of redundant statements that can mislead attribute-oriented systems because these systems perform local comparisons. To effectively avoid detection, a student would need to significantly modify all parts of the program to reduce the longest matching program segments to a value below the minimum match length threshold of the utilised plagiarism detection system. It can be argued that this amount of work is similar to the amount of work needed to complete the assignment legitimately, hence students who have the knowledge that structural plagiarism detection is employed are less likely to plagiarise. Three of the most popular structure-oriented plagiarism detection systems are JPlag [Prechelt et al., 2002], MOSS [Bowyer and Hall, 1999] and SIM [Gitchell and Tran, 1999]. While these systems are effective at detecting plagiarism, they are limited by the volume of the code that they can process at any one time. JPlag is limited because its underlining comparison algorithm is applied exhaustively to all program pairs and the JPlag administrators enforce an upload limit of 9.5 Megabytes [Chawla, 2003]. MOSS is limited because all processing is done in main memory.6 Our experiments demonstrated that this limitation exists in SIM as well. SIM SIM [Gitchell and Tran, 1999] was developed by Dick Grune from Vrije University in the Netherlands. This plagiarism detection system is no longer actively supported, however, it can be downloaded, compiled and run locally as the source code is publicly available. 7 SIM supports programs written in Java, C, Pascal, Modula-2, Lisp, Miranda and plain text. This system applies string alignment techniques developed to detect similarity in DNA strings as discussed in Section 2.2. However, this approach is applied exhaustively to all program pairs and is not scalable to large code repositories as alignment is a computationally expensive process. 6 7

Claim based on personal communication with MOSS author A. Aitken. http://www.cs.vu.nl/~dick/sim.html

12

Steven Burrows

Efficient and Effective Plagiarism Detection for Large Code Repositories

JPlag JPlag [Prechelt et al., 2002] is a publicly accessible service 8 that compares programs pair by pair to produce a HTML report. JPlag uses the greedy string tiling algorithm to find a set of maximal length common substrings in two programs, after conversion into a token representation. This algorithm considers all matches above a threshold length to remove trivial matches. This can be achieved by marking characters that have been previously used in a match so that further matches are not recorded if they are subsets of existing matches. The longest substrings are marked first so that smaller substrings do not mark parts of the longer substrings. A similarity score can be computed between the two strings by calculating the fraction of characters that were marked. For example, comparing the strings “ABCDEFGHIJ” (string A) and “ABDEFLEFGH” (string B) with a minimum match length of three returns two matches with sufficient length ‘DEF’ and ‘EFGH’ in string B. After marking these seven characters in string B, it becomes “AB***L****”. We can compute the similarity of these two strings as 70% given that seven of the ten characters were matched in string B. Prechelt et al. [2002] note that the worst case complexity for this algorithm is O((|A| + |B|) 3 ). In the best case, where the two strings have no characters in common, the complexity is reduced to O((|A| + |B|)2 ). It is also reported that the authors have included elements of the Karp-Rabin pattern matching algorithm [Karp and Rabin, 1987] to further improve efficiency. The greedy string tiling algorithm is applied in an exhaustive fashion to all program pairs in JPlag [Prechelt et al., 2002]. For an example collection of 300 programs, there are 44,850 unique program pairs. This approach is unscalable. PlagiIndex PlagiIndex [Chawla, 2003] improved the efficiency of plagiarism detection for large code repositories, as it used an index of all programs that are being searched for plagiarism. Chawla used the PlagiRank similarity measure as discussed in Section 2.1 to compute the similarity between program pairs. As discussed in Section 2.1, indexes are highly scalable for large collection sizes and Chawla was able to show this in his approach. Chawla attempted to implement local alignment, however, he reported that his implementation of local alignment “did not prove to be effective”. Chawla demonstrates high scalability of PlagiIndex but experiments to prove its effectiveness are limited. One of Chawla’s experiments included 2,813 queries but he did not report the significance of these queries nor what they were queried against. Another experiment included benchmarking against JPlag with 1,043 programs but again the significance of this collection was not reported. A final experiment was performed against a “live” collection of assignments submitted during the duration of his work. He found one new case of plagiarism, but he does not report the total number of programs in this collection so the significance of this result is unclear.

4 Scalable Plagiarism Detection Our approach to plagiarism detection is to find matches with an index, then use local alignment. The approach is as follows. First, we tokenise all assignment submissions into a format suitable for indexing. Second, we construct an index of all assignment submissions. Third, we query assignment 8

http://www.jplag.de

13

Efficient and Effective Plagiarism Detection for Large Code Repositories

Steven Burrows

1. #include 2. int main(void) { 3. int var; 4. for (var=0; var

Suggest Documents