Efficient Token Based Clone Detection with Flexible Tokenization Hamid Abdul Basit
Simon J. Puglisi
William F. Smyth
National University of Singapore Curtin University of Technology McMaster University Department of Computer Science Department of Computing Department of Computing and 3 Science Drive 2, Singapore 117543 2nd line of address Software (+65) 6516 1184 Telephone number, incl. country code Telephone number, incl. country code
[email protected]
[email protected]
[email protected]
Andrew Turpin
Stan Jarzabek
RMIT University School of Computer Science and Information Technology Telephone number, incl. country code
National University of Singapore Department of Computer Science 3 Science Drive 2, Singapore 117543 (+65) 6516 2863
[email protected]
[email protected]
ABSTRACT Code clones are similar code fragments that occur at multiple locations in a software system. Detection of code clones provides useful information for maintenance, reengineering, program understanding and reuse. Several techniques have been proposed to detect code clones. These techniques differ in the code representation used for analysis of clones, ranging from plain text to parse trees and program dependence graphs. Clone detection based on lexical tokens involves minimal code transformation and gives good results, but is computationally expensive because of the large number of tokens that need to be compared. We explored string algorithms to find suitable data structures and algorithms for efficient token based clone detection and implemented them in our tool Repeated Tokens Finder (RTF). Instead of using suffix tree for string matching, we use more memory efficient suffix array. RTF incorporates a suffix array based linear time algorithm to detect string matches. It also provides a simple and customizable tokenization mechanism. Analysis and experiments show that our clone detection is simple, flexible, precise, scalable, and performs better than the previous well-known tools.
Categories and Subject Descriptors D.2.7 [Software Engineering] Maintenance - Restructuring, reverse engineering, and reengineering I.5.3 [Pattern recognition] Clustering
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Conference’04, Month 1–2, 2004, City, State, Country. Copyright 2004 ACM 1-58113-000-0/00/0004…$5.00.
General Terms Algorithms, Measurement, Performance, Experimentation, Languages, Verification.
Design,
Keywords Clone detection, software maintenance, token-based clone detection, suffix-arrays
1. INTRODUCTION Code clones, or simply clones, are code fragments of considerable length and significant similarity. Cloning is a common phenomenon found in almost all kinds of software systems. Several studies suggest that as much as 20-30% of large software systems consist of cloned code [2][25]. The presence of clones may lead to maintenance related problems by increasing the risk of update anomalies. Detection of clones provides several benefits in terms of maintenance, program understanding, reengineering and reuse [21]. Several tools and techniques have been proposed for the detection of clones [2][16][5][10][15][18][19][22][20]. The differentiating factors between these approaches are the code representation, the clone matching techniques and the granularity of the detected clones. Token based code representation provides a suitable abstraction for clone detection. It has both ease of adaptability to different languages, and awareness and control of the underlying language tokens. Comparative studies [6] involving different clone detection techniques have shown that token based clone detection tools perform well in terms of precision and recall of the detected clones. However, manipulating all tokens in large software systems is computationally very expensive. Efficient data structures and matching algorithms can help mitigate this problem to make the technique scalable even for very large scale systems of multi-million lines of code.
1
Previous token based techniques [2][16] make use of a well known data structure, suffix tree, to detect similarities in the token string. Recent researches have shown that an alternative data structure, suffix arrays [24], can provide the same efficiency for string pattern matching with much reduced space requirements [1]. Another limitation of the above mentioned token based techniques is the separation of parameter and non-parameter tokens, which can potentially cause false negatives in clone detection. This is further explained in Section 3. Based on these considerations, we propose an efficient token based clone detection technique and tool called Repeated Tokens Finder (RTF). After transforming source program into a string of tokens using a flexible tokenization mechanism, RTF computes the clones in the string with a suffix array based algorithm. Some heuristic based pruning is then applied to get rid of the probable false positives. The novel contribution of our approach is in applying a simple and flexible tokenization technique, and the selection of efficient data structures and algorithms for token string manipulation. We incorporated RTF as a front-end of the tool Clone Miner that applies data mining techniques to infer design-level similarity patterns from clone patterns found by RTF [4]. The rest of the paper is organized as follows: In Section 2 and 3 we provide background and related work. Section 4 discusses the need for flexible tokenization, while Section 5 describes our tokenization mechanism. This is followed by the discussion of efficient algorithms for clone detection in Section 6. Section 7 provides the performance evaluation of our clone detection technique. Section 7 concludes the paper.
2. CLONING PROBLEM Clones arise for a variety of reasons. Programmers often need to reuse components which have not been designed for reuse. In these situations, the programmers usually follow the low cost copy-paste-modify technique, instead of redesigning the system and so, cause clones. Programmers may also clone code to speed up development and maintenance, especially when the new requirement is not fully understood and similar functionality is already implemented in the system. Poor design, ad hoc maintenance, code generation tools, and an imposed coding style are other ways clones are introduced. While there are good reasons for creating certain clones, most of them, independently of the reasons why they occur, are counterproductive for future maintenance, as they increase the risk of update anomalies (inconsistencies in updating clone instances). When a cloned fragment is to be changed, a programmer must find and update all the instances of it consistently. The situation is further complicated if an affected fragment must be changed in slightly different ways, depending on the context. With excessive cloning, evolution and further development become prohibitively expensive. Clones may also form implicit links between components that share some functionality. All this contributes towards “software aging” [28]. It is beneficial to have knowledge about the location of clones in a software system. This helps in maintenance, change impact
analysis, refactoring, finding cross-cutting concerns and a variety of other activities.
3. RELATED WORK A number of different approaches have been proposed for code clone detection based on different alternatives for code representation, having their own merits and demerits. Treating code as plain text [10][15], on the one hand makes the technique language-independent, but on the other hand is very sensitive to the small differences that may be present between two very similar code fragments, so that only exact matching clones can be found. Parse tree, syntax tree or program dependence graph based techniques [5][18][19][20] give very accurate results but become too language dependent. This is a critical factor as there are hundreds of programming languages in use today, with some of the popular languages like C or COBOL having several dialects. A useful clone detector should be easily adaptable to various programming languages without requiring an expert in parsing technology [5]. Some of these techniques also pose scalability problems [6]. In [20], this problem is addressed by using a suffix tree representation of an abstract syntax tree. Given the memory efficiency of suffix arrays as compared to suffix trees, it would be interesting to evaluate a suffix arrays implementation of this technique. Similarly, metrics based techniques [25] give a quick overview of cloning situation but only fixed granularity clones, like similar functions or classes, can be detected. A token based clone detection approach provides a suitable level of flexibility for the task. It involves limited language dependence, is resilient to the differences in code layout, and provides a good mechanism for detecting parameterized clones, where parametric differences are allowed between otherwise identical code fragments. Representative clone detection techniques based on tokenization of source code are Dup [2] and CCFinder [16]. Tokenization in Dup is line-based rather than absolute. After the lexical analysis, a representative string for each line is produced. It differentiates between non-parameter (NP) and parameter (P) symbols. Parameters include identifiers, constants, field names of structures, and macro names. Keywords like “while” or “if” and operators are not considered to be candidates for parameters. The encoded string consists of one NP symbol and zero or more P symbols. The first occurrence of each P symbol is replaced by 0. Each later occurrence of a P symbol is replaced by the distance in the string since the previous occurrence of the same P symbol. For example, if A and B belongs to NP and x, y belongs to P; the p-string AxyBxAy is encoded as A00B3A4. In this way, the positions of the P symbols are also recorded. Only exact clones and those with 1-1 correspondence between parameters are matched by this technique. Strings representing all the lines are concatenated to form a pstring. The matching algorithm is based on a data structure called parameterized suffix tree which is a generalization of the suffix tree. Only C code can be processed through Dup, but the technique is easily customizable to other languages. The clones detected by Dup cannot cross file boundaries but can cross
2
function boundaries. Dup has some mechanism to partially treat the short repetitive segments but in the case studies the table initialization code was manually removed. One basic limitation of line based comparison is the modification of line breaks between otherwise similar code fragments. In languages like C++ and Java, line breaks have no real semantic meaning. Using this approach, such clones having line break relocation are either missed or detected as shorter clones. CCFinder [16] is another token based clone detection tool. After the lexical analysis, source files are transferred into token string. CCFinder also differentiates between parameter and nonparameter tokens. Each identifier related to types, variables, and constants is replaced with a special token. CCFinder, however, does not consider the correspondence of parameters in two code clones. The alphabet in CCFinder is not bounded because of the concatenation of abutting tokens other than punctuator keywords, which on the other hand reduces the total number of tokens. The tokens of all source files are concatenated into a single token sequence to uniformly detect clones within a file and across files. Some transformations are also applied on this token string, depending upon the language, to minimize the differences between similar code fragments. For example, removing the namespace attribution, template parameters, initialization lists, marking of function definition boundaries, removal of accessibility keywords etc for the C++ code. CCFinder is also easily configurable to read input in different programming languages like C, C+, Java and COBOL. A suffixtree matching algorithm is used that runs in O(mn) space and time, where m is the maximum length of clones and n is the total length of the code. Some optimization techniques are applied to reduce the size of the token string: The token sequences are aligned to begin at tokens that mark the beginning of a statement like #, {, class, if, else etc. Considering only these tokens as leading tokens reduces the resulting suffix tree size to one third. Another optimization is the removal of short repeated code segments from the input source. Very large files are split and “divide-and-conquer” strategy is applied to find duplicates in them. CCFinder detects clone pairs first, and then combines these clone pairs to form clone classes.
4. TOKENIZATION 4.1 Need for Flexibility The requirements for a clone detection technique are partly based upon the proposed usage of its output, for example, refactoring. As previous studies have shown, in certain cases the language level techniques like inheritance, generics or parameterized functions may not be sufficient to unify all types of clones due to the strict dependency on the language rules [3][14]. An example of such clones is given in Figure 1 where the two similar code portions defining overloaded operators in STL [32] differ parametrically in operators only.
template
==
( inline bool operator const set& __x, const set& __y) { return __x._M_t
== __y._M_t;
}
template