A Theory of Uncheatable Program Plagiarism Detection and Its ...

1 downloads 234 Views 255KB Size Report
May 5, 2002 - ... a metric to measure the degree to which two computer programs are ..... SID and a web service to detect program plagiarism online.
A Theory of Uncheatable Program Plagiarism Detection and Its Practical Implementation Xin Chen, Ming Li, Brian Mckinnon, Amit Seker University of California, Santa Barbara May 5, 2002

Abstract

This paper introduces a metric to measure the degree to which two computer programs are similar for plagiarism detection. This similarity metric is based on Kolmogorov complexity [8] and measures the amount of shared information between two programs. The measure is universal hence in theory not cheatable. Although the metric is not computable, we have designed and implemented a system SID (Software Integrity Diagnosis system) that approximates this metric. Experimental results are given to demonstrate the robustness of SID. SID system server is online at http://dna.cs.ucsb.edu/SID/. This research naturally generalizes to other domains.

1 Introduction A plagiarized program is de ned by Parker and Hamblen [11] as a program that has been produced from another program with a small number of text edit operations but no detailed understanding of the program required. It is a prevailing problem in university courses with programming assignments. Detecting plagiarism is a tedious and challenging task for university instructors. A good software tool helps the instructors to safeguard the quality of education. More generally, the methodology developed here has other applications such as detecting the internet plagiarism. Many program plagiarism detection systems have been developed previously [1, 6, 12, 15]. Based on which characteristic properties they employ in their comparisons, these systems can be roughly grouped into two categories: attribute-counting systems and structure-metric systems. A simple attribute-counting system [10] only counts the number of unique operators, unique operands, and their occurrences; then constructing a pro le for each program. A structure-metric system instead extracts and compares representations of the program structures, therefore it gives an improved measure of similarity and is a more e ective practical technique to detect program plagiarism [13]. Widely used systems such as Plague [12], MOSS [1], SIM [6] and YAP family [15], are all structure-metric systems. Such a system usually consists of two phases: the rst phase involves a tokenization procedure to convert source codes into token sequences by a lexical analyzer; the second phase involves a method to compare those token sequences. Note that a basic problem underlying in the second phase of a structure-metric system is how to measure similarity of a pair of token sequences. An inappropriate metric lets some plagiarisms go unnoticed, and an openly well-de ned non-universal [8] metric can always be cheated. For example, MOSS designers [1] avoided stating their similarity measuring strategy openly on the MOSS website, fearing the cheaters would learn right away how to beat the system. Wise [14] presents three properties an algorithm measuring program similarity must hold: (a) each token in either string is counted at most one time; (b) transposed code segments should have a  Bioinformatics Lab, Computer Science Department, University of California, Santa Barbara, CA 99106, USA. Contact author email: [email protected]. Work partially supported by NSF ITR grant 0085801 and an NSF REU grant.

1

minimal e ect on the resulting similarity score; (c) This score must degrade gracefully in the presence of random insertions or deletions of tokens. In fact these three criteria are far from enough. For example, many other things should also have minimal e ects: duplicated blocks, almost duplicated blocks, insertion of irrelevant large blocks, etc. All these beat Wise' items (b) and (c). In fact, it is not that Wise was not wise enough to enumerate all the cases, in theory, this is simply not enumerable. We will take a di erent approach. We will take one step back. We will look at an informationbased metric that measures the amount of information shared between two sequences, for any type of sequences: DNA sequences, English documents, or, for the purpose of this paper, programs. Such a measure is based on Kolmogorov complexity [8] and it is universal. The universality guarantees that if there is similarity between two sequences under any computable similarity metric, our measure will detect it. Although this measure is not computable, in this paper we design and implement an ecient system SID to approximately calculate such metric score (thus SID may be also read as Share Information Distance). These are detailed in Section 3 and Section 4.1, respectively. In Section 2 we rst survey related work, and in Section 4 we introduce our plagiarism detection system SID. Experimental results and discussion are then given in the last two sections.

2 Related work This section surveys several plagiarism detection systems.

2.1 Attribute counting

The earliest attribute-counting-metric system [10] used Halsted's software science metrics to measure the level of similarity between program pairs, where four simple program statistics were introduced:  1 = number of unique operators,  2 = number of unique operands,  N1 = number of operator occurrences,  N2 = number of operand occurrences. Two measure metrics calculated from the above are  V = (N1 + N2)log2 (1 + 2 )  E = [1 N2(N1 + N2)log2 (1 + 2)=(22 ). Whale [13] has demonstrated that a system as such based on attribute counters are incapable of detecting suciently similar programs.

2.2 MOSS

Details of the algorithm used by the MOSS system [1] are not given in order to ensure that it is not circumvented. But it's believed to be a tokenizing procedure followed by fast substring matching. Experimental results show that pairs of les are sorted by the size of their matched tokens. This measure metric assumes that the more tokens matched, the more possible they are plagiarized. But it may be problematic when measuring a large-size pair compared to a small-size one.

2

2.3 YAP

A similarity score used in YAP system [15] is a value from 0 to 100, called percent-match, representing the range from "no-match" to "complete-match". It is obtained by the formulae: Match = (same ? di )=min le ? (max le ? min le)=max le

PercentMatch = max(0; Match)  100 where max le and min le are the lengths of the larger and smaller of the two les, respectively, variable \same" is the number of tokens common to two les, variable \di " is the number of single-line di erences within blocks of matching tokens. The algorithm used to compare two token sequences is essentially the Greedy String Tiling algorithm with Karp-Rabin matching. Its aim is to nd a maximal set of common contiguous substrings as long as possible, each of which does not cover a token already used in some other substrings.

2.4 SIM

The SIM plagiarism detection system [6] compares token sequences using a dynamic programming string alignment technique. This technique rst assigns each pair of characters in an alignment a score. For example, a match scores 1, a mismatch scores -1, and a gap scores -2. The highest score of a block gives the score of an alignment. The score between two sequences is then de ned to be the maximum score among all alignments, which is easily calculated via dynamic programming technique. With this de nition a similarity measure between two sequences is de ned as follows.

s = 2  score(s; t)=(score(s; s) + score(t; t)) which is a normalized value between 0 and 1. It is known that this metric may be greatly plagued due to the incapability of dynamic programming technique to deal with transposed code segments.

3 Shared information An information based sequence distance to measure similarity between sequence pairs was rst proposed in [7, 4], and has been successfully applied to the construction of whole genome phylogenies [7], chain letter evolutionary history [3], and language classi cation [2, 5]1 . Its de nition is based Kolmogorov complexity or algorithmic entropy [8]. The Kolmogorov complexity of string s, K (s), measures the amount of absolute information content a sequence s contains. In other words, K (s) is the length of the shortest program that on empy input prints s. Given another sequence t, the conditional Kolmogorov complexity K (sjt) measures the amount of information of s giving t for free. In other words, K (sjt) is the length of the shortest program that on input t prints s. We refer the reader to [8] for formal de nitions and other applications of Kolmogorov complexity. An information based sequence distance was then de ned as

d(s; t) = 1 ? (K (s) ? K (sjt))=K (st) where the numerator K (s) ? K (sjt) is the amount of information t knows about s. A deep theorem in Kolmogorov complexity states that K (s) ? K (sjt)  K (t) ? K (tjs). That is, the amount of information t knows about s is the same as the amount of information s knows about t. This information is also called mutual information between s and t. The denominator K (st) is the total amount of information The application to language classi cation was rst investigated by authors of [2] using a similar but asymmetric measure. 1

3

in the concatenated string st. The normalization scales the distance within the range 0 to 1. While the mutual information is not a distance measure and does not satisfy triangle inequality, the distance d(s; t) is a well de ned distance measure. In [7], we have shown that d(s; t) satis es the standard distance function conditions. Furthermore in [7] (and [5]) we have proved a \universality" property for this distance in the sense that for any other computable distance function D (that satis es some very general conditions), for all strings s; t, d(s; t)  2D(s; t) + O(1): In the context of program plagiarism detection, this can be interpreted as follows: if two programs are \similar" under any computable measure, then it is similar under our measure d. Thus, as least in theory, a software plagiarism system built based on such measure is not cheatable. For example, all three items stated by Wise [14] in Section 1, such as the transposed code segments, and those not stated by Wise, such as segment duplication, are all special cases of our new metric.

4 SID | Software Integrity Diagnosis system Using the sequence distance de ned in the previous section, a similarity metric for two program token sequences s and t can be simply given by

R(s; t) = (K (s) ? K (sjt))=K (st): When R(s; t) = 0, two strings are independent. When R(s; t) = 1, two strings are identical. Other values in between represent partial matches. Unfortunately, the Kolmogorov complexity of an object is not computable [8]. It requires an ultimate compression of a string. Therefore, we resort to a compression algorithm to heuristically approximate Kolmogorov complexity. That is, we compute R(s; t) as follows:

R(s; t)  (Comp(s) ? Comp(sjt))=Comp(st) where Comp(:) (Comp(:j:)) represents the size of the (conditional) compressed string by a compression program. Apparently, the better the compression is, the closer we are to true Kolmogorov complexity, hence the more accurate we determine the similarity between two strings. Note that the Karp-Rabin Greedy-String-Tiling method used in YAP3 can now be considered as a very special case of R(s; t). Meanwhile, three problems [14] that plague most of other metrics are special cases of R(s; t).

4.1 TokenCompress algorithm

In order to achieve reasonable approximation to the R(s; t) function, a TokenCompress algorithm is designed. The problem of optimal compression for a string has been extensively studied and it is well-known that optimal compression is not computable [8]. In practice, the Lempel-Ziv (LZ) data compression algorithm [16] has been widely used in commercial le-compression software, such as GZIP and compress. TokenCompress also follows the Lempel-Ziv data compression scheme: it rst nds the longest (approximately) duplicated substring; and then encode it by a pointer to its previous occurrence instead of outputting itself. However, the implementation of LZ compression scheme is di erent from those existing software in many aspects due to the speciality of token sequences we will compress. First, a typical LZ-type algorithm may miss many longer duplicated substrings because only a part of encoded substrings would be added into a look-up dictionary (or a sliding window of a few kilobytes 4

size Compress Gzip TokenCompress token seq. 1 651 303 161 133 token seq. 2 1583 680 517 475 token seq. 3 4850 1888 1241 1202 Table 1: Compression results of three token sequences. long) built during the encoding process. It aims to obtain less encoding time and less memory space, at the cost of a little worse compression eciency on common English texts. For program token sequences, however, such an algorithm gives inferior compression and may deteriorate the similarity comparison sensitivity. This is especially so when two identical program sections are separated by a large block of codes. Therefore, TokenCompession considers all possible substrings located previously to search for an optimal match to compress. LZ-type algorithms also cannot handle approximate matches. Our compression program searches for approximately duplicated substrings and encodes mismatches when they provide bene ts to the compression ratio. Thus, several duplicated sections may be then combined to form a larger section with a few mismatches. The algorithm TokenCompress is brie y described below. Similarly matched code segment pairs (i.e. repeat pairs) are recorded in a le and later presented in a result-viewing webpage. Several simple experiments we tested also show that TokenCompress signi cantly outperforms the commercial compression programs GZIP and compress to compress token sequences, Table 1.

Algorithm TokenCompress Input: A token sequence s Output: Comp(s) and matched repeat pairs i = 0; An empty bu er B ; while (i < jsj) p = FindRepeatPair(i); if (p:compressProfit > 0) EncodeRepeatPair(p, compFile); i = i + p:length; OutputRepeatPair(p, repFile);

else

AppendCharToBu er(s , B ); i + +; EnodeLiteralZone(B , compFile); return Comp(s) = compFile:file size; and repeat pairs stored in repFile. i

4.2 System integration

As many other systems such as YAP and SIM, SID works in two phases. The rst phase: source programs are parsed to generate token sequences by a lexical analyzer. What distinguishes SID from other systems is the algorithm used in the second phase, where we use TokenCompress algorithm to compute the shared information measure R(s; t) between each program pair within the corpus submitted by students. Finally, all the program pairs are ranked by their similarity scores; the higher the measure score is, the greater the likelihood of plagiarism. Instructors can compare program source 5

   le 1 HHH  le 2 PPP qj parser *     le n 









'$ * token seq. 1 HHHj   nn  1   token seq. 2 PPPq compressor - similarity HHH matrix j token seq. n * &%

Figure 1: Block diagram of SID's design. Similarity score Original copy Modi ed copy Original copy 0 0.4031 Modi ed copy 0.3969 0 Table 2: Similarity measure scores of a program pair. Note that in theory, these two scores should be approximately equal since the metric is symmetric. codes side by side using SID GUI. Those parts that SID considers similar are colored red. A block diagram of the design of SID is shown in Fig. 1. The SID system is implemented in Java language. Instructors can submit all programs at once by browsing and selecting the folder of those les in his or her local computer through an applet window in a web browser. Although the current version of SID only takes Java programs as input, it can compare programs in any other language if a parser of that language incorporated into the SID system. Unlike MOSS, which avoids publicizing its similarity measure on its website, the shared information measure is precisely given at the SID website.

5 Experiments Gitchell and Tran [6] gave a simple but well-designed sample that can be used to test the reliability of similarity metrics. In this sample, the copier changed all variable and function names, removed comments, inverted the order of adjacent statements whenever possible, and permuted the order of the functions from one program to another. We submitted this program pair to SID. The result is presented in Table 2. As expected from theory, the two similarity scores output from SID are approximately equal, i.e. R(s; t)  R(t; s)  0:4. In order to further interpret such a similarity measure score value intuitively, consider a program pair s and t. Assume that p percentage of program codes are exactly copied between s and t. Then,

R(s; t) = p=(2 ? p): Thus, a score value R(s; t) = 0:4 in the above example implies p = 57:14% of program codes are identical when s and t are of same size. This warrants this program pair to be further examined by the instructor. Notice that due to compression overhead, the R(s; t) value such as 0.4 should be considered rather high. When such scoring' method was applied to biological sequences, genomes from sister species often have values around 0.4 [7]. Another program pair found in our experiments can further demonstrate the advantage of SID. One plagiarized source code from a student was submit with parts of its code segments duplicated. The simplest case as such is a string s compared to ss. The similarity between them is usually underestimated by most of normalized measure metrics. For example, JPlag [9] gives a measure score of only 66.8%. SID gives a score of 92.5%, with some encoding overhead. 6

Figure 2: A similar code segment detected by SID. Approximate matching used in TokenCompress algorithm allows larger similar code segments shown in the result-view page of SID. This feature allows us to guard against trivial modi cations, such as random insertion of irrelevant statements such as intx = 1 using an unused variable x. This trick defeats MOSS. Note that it is also useless to insert many x = 1's in a row since this does not increase the amount of information as far as the compression algorithm is concerned. A practical example was given in Figure 2, where a highlighted similar code segments can be detected by neither JPlag [9] nor MOSS [1]. We then used a large collection of homework assignments submitted from 71 undergraduate students as input to SID, and tested the similarity measure metric we proposed. Top 5 of the most similar program pairs are shown in Table 3 with their similarity scores and percentage values p. Further examination of the source les con rmed such nding. Note that cheng and joe00 both copied max, the program pair cheng and joe00 also have high similarity score (ranked 12th, at 0.1987 and 32.98%).

6 Discussion We have implemented SID and a web service to detect program plagiarism online. The system is based on a new similarity metric that measures the normalized amount of shared information between two programs. The measure is universal, hence it is theoretically not cheatable. In practice, the measure is not computable, hence we SID used a compression program to heuristically approximate Kolmogorov complexity. In the practical implementation, SID not only holds the three properties an algorithm measuring sequence similarity in this domain must hold [14], but also nds many other more subtle similarities. The robustness of SID was demonstrated by the experiments. It is important to point out that the methodology advocated in this paper has high potential to be extended to other areas. It has already been successfully applied to genome-genome comparison 7

assignment 1 assignment 2 coreanpsycho lastplayer98 charishoo H9aznlite cheng max joe00 max layzie310 phoongod

R(s; t) R(t; s) 0.2598 0.2408 0.2116 0.2062 0.2023

0.2500 0.2434 0.1968 0.2017 0.1982

p

41.24% 38.81% 34.93% 34.19% 33.65%

Table 3: Similarity measure scores of a program pair. [7], English document comparison [3], and language classi cation [5, 2].

References [1] A. Aiken. Measure of software similarity. URL http://www.cs.berkeley.edu/aiken/moss.html. [2] D. Benedetto, E. Caglioti, and V. Loreto, Language trees and zipping. Physical Review Letters, 88:4(2002). [3] C. Bennett, M. Li and B. Ma. Linking chain letters. Accepted by Scienti c American, 2000. [4] X. Chen, S. Kwong and M. Li. A compression algorithm for DNA sequences and its applications in genome comparison. In Proc. of the 10th Workshop on Genome Informatics, pp. 52-61, 1999. [5] X. Chen, M. Li, X. Li, B. Ma, P. Vitanyi. Similarity distance and phylogeny. Manuscript, 2002. [6] D. Gitchell and N. Tran. A utility for detecting similarity in computer programs. Proceedings of 30th SCGCSE Technical Symposium, New Orleans, USA. pp. 266-270, 1998. [7] M. Li, J. Badger, X. Chen, S. Kzong, P. Kearney and H. Zhang. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics. 17:2(2001), 149-154. [8] M. Li and P. Vitanyi. An introduction to Kolmogorov complexity and its applications. 2nd Ed., Springer, New York, 1997. [9] G. Malpohl. JPlag: detecting software plagiarism. URL http://www.ipd.uka.de:2222/index.html. [10] K. Ottenstein. An algorithmic approach to the detection and prevention of plagiarism. SIGCSE Bulletin 8:4(1977), 30-41. [11] A. Parker and J. Hamblen. Computer algorithms for plagiarism detection. IEEE Transactions on Education. 32:2(1989). [12] G. Whale. Plague: plagiarism detection using program structure. Dept. of Computer Science Technical Report 8805, University of NSQ, Kensington, Australia, 1988. [13] G. Whale. Identi cation of program similarity in large populations. The Computer Journal, 33:2(1990). [14] M. Wise. Running Karp-Rabin matching and greedy string tiling. Department of Computer Science technical report, Sydney University, 1994. 8

[15] M. Wise. YAP3: improved detection of similarities in computer program and other texts. Proceedings of 27th SCGCSE Technical Symposium, Philadelphia, USA. 130-134, 1996. [16] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory. IT-23, 337-343, 1977.

9

Suggest Documents