performance in detecting plagiarized codes. Keywords-Plagiarism detection; author identification; software forensics; source code. I. INTRODUCTION. With the ...
A Coding Style-based Plagiarism Detection S. Arabyarmohamady, H. Moradi, M. Asadpour Advanced Robotics and Intelligent Systems Laboratory & Control and Intelligent Processing Center School of Electrical and Computer Engineering University of Tehran Abstract—In this paper a plagiarism detection framework is proposed based on coding style. Furthermore, the typical stylebased approach is improved to better detect plagiarism in programming codes. The plagiarism detection is performed in two phases: in the first phase the main features representing a coding style are extracted. In the second phase the extracted features are used in three different modules to detect the plagiarized codes and to determine the giver and takers of the codes. The extracted features for each code developer are kept in a history log, i.e. a user profile as his/her style of coding, and would be used to determine the change in coding style. The user profile allows the system to detect if a code is truly developed by the claimed developer or it is written by another person, having another style. Furthermore, the user profile allows determining the code giver and code taker when two codes are similar by comparing the codes’ styles with the style of the programmers. Also if a code is copied from the internet or developed by a third party, then the style of who claims the ownership of the code is normally less proficient in coding than the third party and can be detected. The difference between the style levels is done through the style level checker module in the proposed framework. The proposed framework has been implemented and tested and the results are compared to Moss which shows comparable performance in detecting plagiarized codes. Keywords-Plagiarism detection; author identification; software forensics; source code
I.
INTRODUCTION
With the wide spread use of portable media and the access that internet has provided to people, the software community has faced with the new challenge of copied and plagiarized code. For instance, students’ programming assignments are designed to help the students in their coding skills and to determine the level of their proficiency in coding. However, with the ease of sharing and copying codes, the desired goals cannot be achieved with increase in plagiarism. This plagiarism is more noticeable especially in online education systems which students lack any direct relation with teachers. In other words, it is harder to determine whether or not a student has achieved certain level of proficiency in coding or not, since proctored exams are not used in online education. An early survey on 380 bachelor students by Haines, Diekhoff, Labeff and Clark in 1986 shows that although half of the students really cheat but only 1.3% of them were caught [1]. The recent surveys show that in the past two decades plagiarism among the students has increased due to ease of access and sharing of answers to homework and assignments.
It is interesting to mention that in another study by Ashworth, Bannister and Thorne in 1997, it is discovered that the main reason for plagiarism among students is lack of time for doing homework [2]. Furthermore, the plagiarism is mostly done in homework and exercises due to lack of proper control from the instructors. Therefore identification of copied assignments and homework is a vital need in educational systems. Consequently, it has ignited a set of research on detecting plagiarism to prevent students from submitting unoriginal assignment and punish the ones who behave otherwise. Obviously with the increase in the size of the assignments and the increase in the number of students, manual assignment comparison is not a viable solution. Consequently, there have been attempts to develop algorithms and systems that can check assignments automatically and detect the similar ones. In the case of coding assignments, the chance of similarity between codes, compared to the natural language texts, is higher since the syntax is more restricted. Furthermore, it is fairly easy to copy and change the overall look of a code which makes it more difficult to detect the similarity through manual observation or basic automatic algorithms. MOSS [3], as an automatic tool to detect similar coding assignments, and JPlag [4], as an online essay comparison system, are two systems developed to detect plagiarism. Although these systems are fairly successful in detecting similarity between codes submitted to them, they have limits in two areas: a) code obfuscation and b) third party development. Code obfuscation is a tool that was developed to aid in the prevention of software piracy by applying semantics preserving transformations to programs [5]. If students use this tool to change the appearance of their code they can defeat these detection systems. The third party code development can happen if a piece of code is copied from the internet or specifically developed by a third party. In the first case, keeping a huge database of the codes online may help to determine the plagiarism. However, the 2nd one, i.e. hiring a person to code (paid or unpaid) on behalf of the programmer, is undetectable since there is no code to be compared to the original code. In this paper, a style-based framework for plagiarism detection is introduced in which the coding style of each student is detected and used to compare with other students. The similarity between the styles and the change of the style in a student over several assignments raise the possibility of plagiarism. The proposed approach is capable of detecting the person who shares his/her document. Also the proposed method can distinguish between the style of coding between
978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan 2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL) Page 180
professional programmers and beginners. This is done by creating two style classes, i.e. professional class and beginner class, and categorizing a given code into any of these two classes. If this categorization is different from the history of the programmer, the given code would be flagged as possible illegal code. The person who shares his/her document is called the giver from now on and the person who receives/copies the document is called the taker. Also the term document is used for a source code throughout this paper. The paper is organized as follows. In section 2 the related work in the area of plagiarism detection is reviewed. Then in section 3 the proposed approach is explained and the results are presented in section 4. At the end, in section 5 conclusion and ideas for development of this method are presented. II.
REALATED WORK
The research toward detecting plagiarized documents, which can be code, essay or any other assignment, has been done in two directions: a) detecting the similarity between the two documents [6], [3] and b) stealthily signing each document with a unique signature, i.e. a watermark [7]-[10], that can be detected later if the document is used by another person. In the later, a hidden watermark is saved in the document which includes unique characteristics of the owner of the document, such as a student’s ID in case of school assignment, and characteristics of the document, such as assignment number and date. Then whenever a person copies a document, the hidden watermark would be copied along with it. The advantage of such an approach is that it can be used to detect the giver and taker of a document. The main issue in such an approach is to design the watermark in such a way that changing the visible areas of the document would not affect the watermark so it can be detected later. Furthermore, the user should not be able to detect the watermark so he/she cannot manipulate it to deceive the system and avoid plagiarism detection. Finally, the original document should be electronically signed so the watermark is placed in the document before the user starts changing it. In the first approach, i.e. the similarity detection, two methods are used to determine similarity. Both methods, i.e. word/token sequence analysis and attribute counting methods, try to determine a fingerprint for each document and match the documents with the same or close fingerprints. In the word/token sequence analysis method the number of certain attributes is counted. The closer the number of attributes the higher the chance of being similar and copied. In the attribute counting method, a structure metric is defined and calculated for each document such that, the closer the metrics the higher the chance of having copied documents. In documents, attributes such as the number of lines, number of loops, and number of function are used. Since the number of these attributes can be close, even if the codes are
written by two different programmers, the approach may falsely mark programs as copied. In the systems based on the sequence analysis, documents’ global structures are compared through deriving n_grams of words or tokens. N-gram is a stretch of text n words long. Information in n-grams tells us something about text. This method has lower probability of falsely marking two documents similar since it captures a document owner’s style through determining the structure of the document. However, driving and saving n_grams and token stream are time consuming and costly. Examples of these systems are Moss and Jplag. Moss(Measure Of Software Similarity) uses a string matching algorithm to divide the source code programs into n_grams and hash them, then select the subset of these hashes as fingerprints. Jplag is a token based system that is freely available on the internet. It output the similarity scale between each pair of programs. The major limitation of Jplag is that is requires parsing the dataset if a program fail to be parsed it will be omitted from the dataset. Another method of similarity detection through sequence analysis is called Static Execution Tree (S.E.T) [11]. S.E.T is a representation displaying the interconnection between the main program body and all procedures by parsing the source program. After building a tree, it is compared to other trees built for other programs and similar trees are detected. The main disadvantage of this approach is its limited use due to its usability for program-based documents only. Another study concerning the plagiarism problem is the authorship analysis, i.e. determining the owner of the code [12]-[1]. In the author identification approach, a profile is created for each author based on all the programs which he/she has written. The profile can be created based on the n-grams or the attribute counts of the written programs. In the identification phase, the profile of the given document is compared to the set of available profiles. The author of the closest profile is selected as the author of the document. In this research, the attribute counting method is improved by including more attributes to better detect plagiarized documents. Furthermore, a giver-taker module is proposed to be able to determine the giver and copier of documents. Finally, a style level checker module is developed to determine the style of a code and compare it with a programmer’s style level to determine the legitimacy of the code. III.
THE PROPOSED FRAMEWORK
The proposed plagiarism detection algorithm in this paper uses pattern recognition methods based on attribute counting for programs. Furthermore, a profile of each developer is built for future use. The developer profile is used to determine the consistency of the style of coding of a developer which can be used to predict the possibility of plagiarism. Moreover, it can be used to determine the giver and taker of copied codes by matching the style of the giver with his/her profile.
978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan 2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL) Page 181
Figure 1. General framework of the proposed method
To determine the plagiarized codes each code is compared to all other codes submitted to the system and also it is compared to its owner’s style, if one available. In comparison with others, when similarity rate is more than a given threshold the code is marked as potential candidate for plagiarism. On the other hand, the main point in self comparison is that every programmer has a specific style, which normally is followed in all codes developed by the programmer. Therefore, having the style found from previous codes, authenticity of every new code can be ascertained. That is, changes in the style hints the possibility of plagiarism. Fig. 1 shows the general framework for the proposed plagiarism detection system. The process is based on five modules. In feature extraction module, the features are detected and normalized to create the feature vector for the documents. In similarity clustering module, the feature vectors from two different submitted codes are compared and the similar codes are sent to the next module to determine the giver and taker. The feature vector, i.e. the output of feature extraction module, is compared to the profile of the developer of the code. If it is similar, then it is added to the current profile. In case there is no profile and it is the first time a developer submitted a code to the system, it would create a profile for him/her and would use it in the future. The giver/taker module uses the output from similarity clustering module and the developer profile module to determine the similarity between the submitted codes and profiles to determine the giver and taker. In the case that developer profiles are very similar to each
other, there is no way to determine the giver and taker through their style. The style checker module is designed to determine if a code is taken from the web or it is performed by a third party (paid or favored). This module is mainly useful when the current style of the code does not match any other code submitted for checking. This case happens if a developer has not copied from another developer, in the group of codes submitted for checking, but copies from the web or he/she asks another person to write the code for him/her. In such a case, usually, the style of the submitted code is more profficient than the style of the developer who has submitted the code. Consequently, to check the style level of the code submitted by the developer, it is passed through the style level checker which compares it with a trained dataset. Also the profile of the user is passed through the style checker to determine his/her level of proficiency in coding. If the results of the two checks are different, then there would be a possibility of plagiarism. The detail of each module is discussed in the following. A. Feature Extraction module In this module, a feature vector is derived for each document under investigation. A derived feature vector is considered as the fingerprint of a document and would be used throughout the investigation. In general, a feature vector should represent the general structure of a document and possible overlooked characteristics of a document. For instance, in a C/C++ program, the general structure consists of features such as number of loops, number of functions, and number of
978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan 2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL) Page 182
includes. Furthermore, a set of overlooked features such as the number of spaces after each line, the average length of variables, and space between a variable and the assignment operator, are among those features that normally stay unchanged during plagiarism process.
action of plagiary is transforming conditional term and loop structures together. Thus the result of the feature extraction module is a feature vector for each program file. Of course this vector is normalized to make programs having variable volumes comparable.
The idea behind employing the overlooked features initiated based on the typical behavior during copying process. For example if a student copies a programming assignment, he/she possibly applies the following to prevent plagiarism detection:
•
Changes variables’ names
•
Relocates code blocks
•
Transforms all kinds of loop structures together
•
Changes or deletes comments
•
Replaces conditional and control structures to similar command
As it can be seen the major changes involve general cases and the copier overlooks the details that can distinguish one code from a similar one. Consequently, if two programs are written by two different people, the codes should be different in details. In other words, if a program is an altered version of another one, the details remain the same or similar. A few of these overlooked features are: • •
Writing each command in one line or several commands in one line, Fig.2 (a) (b). Type of indentation at beginning of lines and inside the loop, Fig.2 (c) (d).
•
Spacing at the end of lines, Fig.2 (e)(f).
•
Placing the line comment or block comment, before functions or each line, Fig.2 (g), (h).
•
Placing some character like ‘+’, ’=’, ‘)’, with one space before and after it, or without space, Fig.2 (i), (j).
•
Placing the marks of Beginning and End of block (brackets) in one line alone or in continuation of one command, Fig.2 (k), (l).
•
Usage of special marks like ”_” in the name of variables
•
Number of empty lines
The above cases are normally overlooked by students who plagiarize. Consequently, if most of these cases are identical or similar between two codes the likelihood of plagiarism would be high. That is why the feature vector is a combination of general and detail cases which relate to writing style in program. A few of the feature vector elements that can show the above cases, general and detail cases, include: average length of lines, average length of variables and numbers of loops. It must be mentioned that in producing these vectors there is no differentiation between synonymous keywords because the first
B. Similarity Detection Module In this module a hierarchical clustering algorithm is used for classifying feature vectors that derived from the feature extraction module. The method is a bottom up or agglomerative approach: i.e. each feature vector starts in its own cluster, and then successively pairs of clusters are merged until all clusters have been merged into a single cluster that contains all feature vectors and any further merge action is impossible. At the end the clusters with high similarity rate between their programs, i.e. having similarity over a given threshold, are marked as candidates for plagiarism. Since clustering is conducted hierarchically, it is possible to detect the cheater groups. The clustering is performed by constructing a dissimilarity matrix using Euclidian distance as a distance metric. Then single-linkage clustering, as linkage criteria for merging feature vectors, is used on the dissimilarity matrix. In other words, if a, b are two feature vectors and A, B are two clusters then:
The result of this module can be shown as a dendrogram. C. Developer Profile Module In this module, the feature vector related to the present program is compared with older vectors of the program’s developer profile. If similarity between them is high, the vector is added to the to the developer’s profile. Furthermore, if the similarity rate between them is lower than a given threshold it is possible that the new code is not written by this programmer. Therefore this developer should be the copier of the code. If the similarity rate to previous vectors is high but the result of similarity detection module show us that this person is in a plagiarism cluster then it may be concluded that this person is a giver. D. Style Level Checking Module As mentioned before, Moss and Jplag compare codes to each other. Consequently, if a given code is copied from the internet or provided by a third party, these systems may fail to mark it as plagiarized. Although it is possible to keep a large database of the codes available on the web, similar to the way that Turnitin works, however it would require a huge database to hold all these codes and the process of detection would take a long time. To resolve this issue, the proposed method has been trained to classify the codes into advanced and beginner codes, i.e. it has been written by an advance programmer or by a naïve
978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan 2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL) Page 183
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Figure 2. Sample code that showing effect of proposed features in style, (a) and (b) command placementing, (c) and (d) indention,(e) and (f) spacing at the end of line, (g) and (h) line comment or block comment, (i) and (j) Placing some character like ‘+’, ’=’, ‘)’, with one space before and after it, or without space, (k) and (l) placement of ‘{’ and ‘}’.
978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan 2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL) Page 184
programmer. The training data has beenn collected from professional programmers developing large programs, mainly from the srouceforge.net since the contributors to sourceforge.net are mainly professional programmer. rammer. The naïve programmer samples are collected from the first year freshman students in Electrical and Computer Engineering ring field. IV.
RESULTS
The system was tested on a group of 120 freshman students. The projects for “introduction to programming” course were analyzed using Moss system and the proposed framework. The results showed that for cases es with more than 30% similarity, the system’s result is comparable rable to Moss. The proposed system also found resemblance between etween codes with less than 30% similarity that normally ignored red due to possible random resemblance. These cases can be manually evaluated by the instructor or rejected using a given threshold. eshold.
Figure 3. Moss system result on 8 given programs, N1 to N8 are the given programs and the vertical axis is shows the percentage of similarity.
Considering the fast performance of the proposed method compared to Moss (one minute for every 120 students), the system can be a reliable replacement forr online and fast plagiarism detection. Fig. 3 shows the result of running Moss on 8 source codes from different programmers. Moss showing ing the similarity similari between each pair of programs in the table, that hat here showed in the chart. For example N4 is about 60% similar lar to N8 but N8 is 85% similar to N4. This is because of changing anging the size of program. Fig. 4 shows the result of running the proposed roposed method on the same dataset. In this case chart is a dendrogram ndrogram tree that shows the dependency between program in a hierarchical clustering format. wo cluster clusters were As it can be seen in both cases two distinguished one of which include source codes odes 1, 2 and 3 and the other one including source codes 4, 5, 6, 7 and 8. In case of using Moss, the user should visually determine mine these clusters while in the proposed framework, the dendrogram rogram tree shows the clusters automatically. V.
CONCLUSION AND FEATURE E WORK
In this paper a new framework to plagiarism arism detection on coding documents based on attribute counting g is presented. The advantages of the proposed framework are: a) it is fast and can work on large volume data since it creates a feature vector for each document eliminating the need to process the whole document every time. b) the framework provides ovides a method to detect the giver/taker of a code in case of plagiarism, giarism, and c) the proposed framework is capable of detecting ecting plagiarized documents in which the code is copied from om the internet or provided by a third party. The future work would focus on assigning g different weights to the features since one feature could be more ore important than the other feature. Furthermore, the current nt version of the proposed feature extraction creates the feature ture vector for the whole document rather than part of a document. ment. Consequently it may fail to detect plagiarism in case partt of a document is copied. Consequently, the feature vectors need ed to be created for blocks of code rather than the whole code. This would further
Figure 4. proposed system result onn same 8 programs as dendrogram
help with detecting the areas off the code that have been copied. Finally the developer profile module would be improved to efficiently merge feature vectors. ors. It must be mentioned that some of developing tools provided automatical automatically some formatting rules to codes that decrease the importance of some elements of feature vector and should be handled. REFERENCES RENCES [1]
[2]
[3]
[4]
[5] [6]
[7]
V.J. Haines, G.M. Diekhoff, E.E. E Labeff and R.E. Clark, “College cheating: Immaturity, lack off commitment commitment, and the neutralizing attitude,” Research in Higher education ucation 25(4):342-354, 25(4):342 1986. P. Ashworth, P. Bannister, and d P. Thorne, “Guilty in whose eyes? University students perceptions of cheating and plagiarism in academic work and assessment ent,” Studies in Higher Education 22(2):187-203. 1997. S. Schleimer, D. Wilkerson and A. Aiken, “Winnowing: Local Algorithms for Document Fingerprinting, rprinting,” SIGMOD 2003, San Diego, CA, USA, June 9-12, 2003. P “Finding plagiarisms among L. Prechelt, G. Malpohl and M. Philippsen, a set of programs with Jplag,” J. Univ.Comput. Sci., 8, 1016–1038, 2002. C. Collberg, G. Myles and M. Stepp, “Cheating Cheating Detectors”. Technical Report TR04-05, 2004. 4. C. Arwin, and S.M.M. Tahaghoghi, ghoghi, “Plagiarism Detection across Programming Languages,” 29th 9th Australasian Computer Science Conference Vol.48, 2006. mchuk and L. O'Gorman, “Electronic J. Brassil, S. Low, N. Maxemchuk Marking and Identification Techniques T to Discourage Document Copying,” 13th Proceedings IEEE E Digital Object Identifier, 1278 - 1287 vol.3, 1994.
978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan 2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL) Page 185
[8]
[9]
[10]
[11] [12]
[13]
[14]
C. Daly and J. Horgan, “Patterns of plagiarism,” Proceedings of the Thirty-Sixth SIGCSE Technical Symposium on Computer Science Education, pp. 383-387, SIGCSE 2005. C. Daly and J.M. Horgan, “Automatic Plagiarism Detection,” Proceedings of the IASTED International Conference Applied Informatics, pp.255-259, 2001. Simon, “Electronic watermarks to help authenticate soft-copy exams,” ACE '05: Proceedings of the 7th Australasian conference on Computing education - Volume 42, Volume 42, 2005. H.T. Jankowitz, “Detecting plagiarism in student Pascal programs,” Computer Journal, Vol.31, No1, pp1-8. 1988. G. Frantzeskou, S. MacDonell, E. Stamatatos, and S. Gritzalis, “Examining the Significance of High-level Programming Features in Source Code Author Classification,” The Journal of Systems and Software, 81(3), 447-460, Elsevier, 2008. G. Frantzeskou, E. Stamatatos, S. Gritzalis, C.E. Chaski, and B.S. Howald, “Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP),” Method Int. Journal of Digital Evidence, 6(1), 2007. J.Hope, “The Authorship of Shakespeare’s Plays,” Cambridge University Press, Cambridge, 1994.
[15] A. Jadalla and A. Elnagar, “PDE4Java: Plagiarism Detection Engine for Java source code: a clustering approach,” IJBIDM 3(2): 121-135, 2008. [16] S. Mann and Z. Frew, “Similarity and originality in code: plagiarism and normal variation in student assignments,” Proceedings of the 8th Austalian conference on Computing education - Volume 52, 2006. [17] L. Moussiades and A. Vakali, “PDetect: A Clustering Approach for Detecting Plagiarism in Source Code Datasets,” The Computer Journal Vol. 48 No. 6, 2005. [18] R.C. Lange and S. Mancoridis, “Using code metric histograms and genetic algorithms to perform author identification for software forensics,” Proceedings of the 9th annual conference on Genetic and evolutionary computation, July 07-11, 2007. [19] J. Sheard, A. Carbone and M. Dick, “Determination of Factors which Impact on IT Students’ Propensity to Cheat,” Proc. Fifth Australasian Computing Education Conference, 119-126, ACM Press. 2002. [20] M. Shevertalov, J. Kothari, E. Stehle, and S. Mancoridis, “On the Use of Discretized Source Code Metrics for Author Identificatio”. In the IEEE Proceedings of the 1st International Symposium on Search Based Software Engineering (SBSE'09), Windsor, UK, May, 2009.
978-1-4673-4923-9/12/$31.00 ©2012 IEEE November 6-8, 2012, Amman, Jordan 2012 International Conference on Interactive Mobile and Computer Aided Learning (IMCL) Page 186