evaluation of token based tools on the basis of clone ...

3 downloads 149 Views 210KB Size Report
to investigate the code clones in source files. ... A code clone is one of a code portions in source files that is .... with Java, JSP, C, C++, Fortan and PHP code.
ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 3, May 2012

EVALUATION OF TOKEN BASED TOOLS ON THE BASIS OF CLONE METRICS Rupinder Kaur , Harpreet Kaur , Prabhjot Kaur 

Abstract— The area of clone detection has considerably evolved over the last decade, leading to approaches with better results. Since code clones are believed to reduce the maintainability of software, several code clone detection techniques and tools have been proposed. Token-by token matching technique is more expensive than line-by-line matching in terms of computing complexity since a single line is usually composed of several tokens. In this paper token based tools have been evaluated that is, CCfinder and PMD copy/paste detector on the basis of three types of metrics that is, clone set metrics , file metrics , line metrics to investigate the code clones in source files. The results of the analysis processes show that each tool has its own strengths and weaknesses and no single tool is able to identify all clones within the code. Index Terms— Plagiarism, CCFinder , PMD copy/paste detector, clone metrics.

I. INTRODUCTION

A. Code Clone A code clone is one of a code portions in source files that is identical or similar to each another. A software system has several clone subsystems created by duplication with slight modification. However, since the size and the degree of similarities among code segments vary, code cloning is a fairly subjective concept. It depends on the context or human judgement whether it is a code clone or not [1]. Code clones are one of the factors that make software maintenance difficult as when a fault is found in one subsystem, the engineer has to carefully modify all other subsystems. It was believed that there would be many clones in the system, but the documentation did not provide enough information regarding the clones. It was considered that these clones heavily reduce maintainability of the system; thus, an effective clone detection tools have been expected.

B. How Do Clones Occur Code clones do not occur in software systems by themselves. There are several factors that might force or influence the developers and/or maintenance engineers in making cloned code in the system. Software clones appear for a variety of following reasons: 1. Code reuse by copying pre-existing idioms. 2. Limitations of the programming language. 3. Deferred generalization: Code is copied and modified and only then the differences are factored out when they are all known and understood. 4. Templating: Copied text is used as a template and then customized in the pasted context. 5. Cloning by Accidents: Coincidentally implementing the same logic by different developers C. Code Clone Types Textual Similarity: Based on the textual similarity we distinguish the following types of clones 1. Exact clone (type 1) copied code fragment is same as the original. However, there might be a some variations in whitespaces ,comments or layouts 2. Renamed clone (type 2) is a copy where only parameters (identifiers or literals) have been changed as shown in figure 1. 3. Gapped clone (type 3) is a copy with further modifications i.e. by deleting or adding code fragments in both fragments , we can transform them into a contiguous clone. CCFinder extracts the following two types of code clone corresponds to a code fragment C. Functional Similarity: If the functionalities of the two code fragments are identical or similar i.e., they have similar pre and post conditions, we call them semantic clones. Type 4: Two or more code fragments that perform the same computation but implemented through different syntactic variants.

Manuscript received May 2012. Rupinder Kaur, University College Of Engineering, Punjabi University Patiala., (e-mail: [email protected]). Patiala, India, 9463536647 Harpreet Kaur, University College Of Engineering, Punjabi University Patiala.9814008879, (e-mail: [email protected]),Patiala, India Prabhjot Kaur, University College Of Engineering, Punjabi University Patiala., 9463536639., (e-mail: [email protected]). Patiala, India

145 All Rights Reserved © 2012 IJARCSEE

ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 3, May 2012

pioneers AST-based clone techniques is that of Baxter et al.'s CloneDR. [3] PDG based technique: PDG [4] contains the control flow and data flow information of a program and hence carries semantic information. Once a set of PDGs are obtained from a subject program, isomorphic subgraph matching algorithm is applied for finding similar subgraphs which are returned as clones. The disadvantage of these techniques is that the analysis is quite expensive.

If (x>y) { y++; x=1; }

Copied and pasted

II CODE CLONE DETECTION TOOL: CCFinder Renamed

If(a>b) { b++; a=0; }

If(x>y) { y++; x=1; }

Exact code clone Renamed code clone Fig 1: Exact and renamed code clone D. Clone Detection Techniques and Tools Many clone detection approaches have been proposed in the literature. Based on the level of analysis applied to the source code, the techniques can roughly be classified into four main categories: textual, lexical, syntactic, and semantic[3]. Text based technique: In this approach, the target source program is considered as sequence of lines/strings. Once two or more code fragments are found to be similar in their maximum possible extent (e.g., w.r.t maximum no. of lines) are returned as clone pair or clone class by the detection technique. The advantage is that the technique can be applied to any kind of programming language. The disadvantage is that minor textual modifications, for instance, changes in identifiers or layout, may mislead the analysis[10]. Token based technique: A lexical analysis turns a program into a stream of tokens (indivisible units of meaning). Clone detection then turns into the problem of finding similar token subsequences. Because of the space and time complexity linear to the program length, these approaches scale very well. Also, lexical analysis is relatively simple, so that the technique can quickly be adjusted to other languages. One of the leading state of the art token-based techniques is CCFinder [2] of Kamiya et al. Tree-based technique: In the tree-based approach a program is pared to a parse tree or an abstract syntax tree (AST) with a parser of the language of interest. Similar subtrees are then searched in the tree with some tree matching techniques and the corresponding source code of the similar subtrees are returned as clones pairs or clone classes. The variable name and literal values of source code are discarded during tree representation. The disadvantage is the necessity to develop a parser and the overhead for parsing. One of the

CCFinder (code clone finder) detects code clones from program source codes and outputs the locations of the clone pairs in the source codes. CCFinder has no GUI but it only generates character based output. The detection process consists of four steps shown in figure 2: Step1 Lexical analysis: Each line of source files is divided into tokens corresponding to a lexical rule of the programming language. The tokens of all source files are concatenated into a single token sequence, so that finding clones in multiple files is performed in the same way as single file analysis[2]. Step2 Transformation: The token sequence is transformed, i.e., tokens are added, removed, or changed based on the transformation rules that aims at regularization of identifiers and identification of structures. Then, each identifier related to types, variables, and constants is replaced with a special token. This replacement makes code portions with different variable named clone pair. Step3 Match Detection: From all the sub-strings on the transformed token sequence, equivalent pairs are detected as clone pairs. Step4 Formatting: Each location of clone pair is converted into line numbers on the original source files. The CCFinder has following characterstics: (1) CCFinder should have industrial-size strength, and be applicable to million-line size system within affordable computation time. (2) The language dependent part of CCFinder should be limited to small parts, and the tool has to be easily adaptable to many other languages as C, C++, Java, and COBOL. (3) CCFinder should detect clones of practical interest clones. Not only syntactically same portions, but also similar portions which are considered to be actual clones have to be effectively extracted[8]. (4) Scatter plot of code showing the code clones. (5) It should give the graphical representation as metrics graph. (6) CCFinder gives the source code view of files. III PLAGIARISM DETECTION TOOL: PMD’s Copy/Paste Detector (CPD) The PMD open source tool provides a Copy/Paste Detector (CPD) tool for finding duplicate code. Plagiarism is intentionally or unintentionally reproducing (copying, rewording, paraphrasing, adapting, etc) work that was produced by another person(s) without proper

Source file 146

ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 3, May 2012

Lexical analysis

Token sequence Transformation

Mapping from transformed sequence into original

Transformed token sequence

Match detection

Clones transformed sequence

Formatting

Clone pairs/clone classes Fig 2: CCFinder clone detection process acknowledgement in an attempt to gain academic benefit.. Work that can be plagiarised includes: words (language), ideas, findings, writings, graphic representations, computer programs, diagrams, graphs, illustrations, creative work, information, lectures, printed material, electronic material, or any other original work created by someone else. It works with Java, JSP, C, C++, Fortan and PHP code. It also provides guidance on how to add other programming languages to the tool. Because it is a duplicate code detector, this tool scans the files themselves for duplicate code, hence it returns similar code found within the same file. However, it is also successful in returning similar code across different files and can be used as a tool for detecting similarity in source-code files. PMD:CPDDesigner generates a report for PMD's Copy/Paste Detector (CPD) tool. It can also generate a cpd results file in any of these formats: xml, csv or txt. PMD is a static rule set based Java source code analyzer that identifies potential problems like: 1. Possible bugs - Empty try/catch/finally/switch blocks. 2. Dead code - Unused local variables, parameters and Figure 3: Scatter plot of clones in CCFinder tool The view in figure 2 shows where and how the code clones private methods. are distributed in the whole source files. 3. Empty if/while statements. 4. Overcomplicated expressions - Unnecessary if Gray vertical and horizontal lines are boundaries of the source files, that is, locations of the end-of-flies. Squares statements, for loops that could be while loops. 5. Suboptimal code -Wasteful String/String Buffer drawn by white lines (on the main diagonal line) are directories. Black lines (and dots) that located parallel to the usage. 6. Classes with high Cyclomatic Complexity main diagonal line are code clones The corresponding orthogonal projections to the horizontal axis and vertical axis measurements. 7. Duplicate code - Copied/pasted code can mean are the locations of similar code fragments. The longest clone (941 tokens, 191 lines) was found in a copied/pasted bugs, and decreases maintainability. fileC:\Users\as\Desktop\java_project_game\java_project_g ame\ars\MainMenu.java (marked as blue square in Fig. 2). It IV RESULTS The aim of this paper has to identify which of the evaluated takes about three minutes for execution on the PC. The files tools are best suited to support the process of software are sorted in alphabetical order of the file paths, so that files in the same directory are also located nearby on the axis. A maintenance and clone detection. clone pair is shown as a diagonal line segment. In Fig.3, each line segment looks like a dot since each clone pair is small in comparison to the scale of the axis. Most line segments are 147 All Rights Reserved © 2012 IJARCSEE

ISSN: 2277 – 9043 International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 3, May 2012

located near the main diagonal line and this means that most of the clones occur within a file or among source files at the near directories. Clone metrics have been used to investigate the code clones and the source files.

clone code portion is widely or frequently used in the system? The followings are the 3 clone metrics on which comparison has to made : Clone Set Metrics

CCFinder Tool

System started from Languag e(s) Supported

2002

Copy/pas te detector Tool 2003

C,C++,JAVA,CO BOL, EMACS LISP ,Plain text Clone detection

JAVA,JS P,C,C+ ,FORTA N,PHP code Plagiaris Domain m detection Silverlight, JDK 1.4 Require Python, JDK 1.5 or later ment(s) Command line Standalon Submissi e application on method Programming Directory Source language and with or data directory without subdirectorie s Scatter or dot plot, Text Result Metrics of code, report, XML output Source code file, CVS file File metrics, Token Metrics cloneset metrics, line matches produced based . ,lines matched Table 1: Feature comparison of source code tools This table compares the features offered by various tools that aid with the detection of source-code plagiarism detection. In total the Graph Tool case study identified initial clones. Not all clones were found by each tool, in fact no single tool identified all clones. The initial clone numbers identified by each tool are shown within Table 2 below. CCFinder PMD copy/paste detector Identified 6 4 clones by each tool Table 2: Total clone numbers identified by each tool From Table 2 , it can be seen that CCFinder identified the largest number of clones, a total of 6, whereas PMD Copy/paste detector identified only 4 on giving the minimum clone length of 50.

LEN (length) [2]: Represent the maximum length of element(code fragment) in a given clone class. The length can be measured by the number of tokens, or the size measures such as LOC (the number of lines, including null and comment lines), or SLOC (the number of lines, except null or comment lines) POP(population) [4] :Represent the number of element(code fragment) for a given clone class. If this value is high, it means that the similar code fragment appears in many places. RAD(Radius) [5] :Represent the range of the source code fragments of a code clone in the directory hierarchy, when the source code is supposed to be stored in the hierarchical directory. When all the code fragments of a clone class are located in one source file, the RAD value of the clone class is equivalent to 0. When the code fragments of the clone class are located in multiple source files stored in one directory, the RAD is 1. If those sources files are stored in different directories, then RAD is the maximum depth of those sources measured from their common parent director. RNR (ratio of non-repeated tokens) [6]:Ratio (percentage) of tokens that are not included in repeated part of a code fragment of the code clone. We assume that the clone set S includes n code clones, c1, c2 . . . , cn, LOS whole(fi) represents the Length Of the whole token Sequence of code clone ci, and LOS repeated(fi) represents the ci Length Of repeated token Sequence of code clone , then,

Higher RNR(S) values mean that each code clone in a clone set S consists of more non-repeated token sequences. In most cases, repeated code sequences are involved in language-dependent code clones (e.g., code clones that involve consecutive if (or if-else) blocks, case entries of switch statements, consecutive variable declarations). NIF: Count of source files that include one or more code fragments of the code clone. By definition, NIF

Suggest Documents