Towards a Clone Detection Benchmark Suite and Results Archive. Arun Lakhotia, Junwei Li, Andrew Walenstein and Yun Yang. Software Research Laboratory.
(c)2003 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
To appear in Proceedings of the 11th IEEE International Workshop on Program Comprehension, May 10–11, 2003, Portland, Oregon
Towards a Clone Detection Benchmark Suite and Results Archive Arun Lakhotia, Junwei Li, Andrew Walenstein and Yun Yang Software Research Laboratory Center for Advanced Computer Science University of Louisiana at Lafayette E-mail: {arun, walenste}@cacs.louisiana.edu Abstract
definition, however, how does one evaluate precision and recall?
Source code clones are copies or near-copies of other portions of code, often created by copying and pasting portions of source code. This working session is concerned with building a communal research infrastructure for clone detection. The intention of this working session is to try to build a consensus on how to continue to build a benchmark suite and results archive for clone- and source comparisonrelated research and development. The working session is structured to foster discussion and debates over what should be collected in the archive, and how to make it best to improve clone detection research techniques.
2. What subject system to use? Comparing clone detectors by comparing their precision and recall will often be fair only if all the detectors are run against identical subject systems (or subjects with comparable characteristics). However published results have used a variety of subject systems, some of which are not freely and easily available. One approach to reducing these impediments to research is to create a community-supported open benchmark suite and results archive. The archive could contain test software systems, results of various clone detectors, and “reference” statements of the clones that are believed to exist in the test software systems. These two resources serve to answer the two problems noted above. IR research has very successfully generated a similar sort of benchmark suite in the form of the TREC database of test collections [10]. Similar efforts repositories and benchmarks have been proposed in the past, or are currently ongoing [5, 8, 9]. The effort also emulates and extends the work of others in generating ”guinea pig” [3] or “reference corpora” [5, 6] for software engineering research and evaluation. The beginnings of this archive are already in place in the form of a collection of systems and results from a recent comparative study of clone detectors performed by Stefan Bellon and the group at the University of Stuttgart [1]. This study was the focus of the First Workshop on the Detection of Software Clones in 2002 [1]. The purpose of the present working session is to build consensus on how to continue building the archive. Discussions will be used to shape future policies in collecting and maintaining the archive.
1. Introduction Source code clones are copies or near-copies of code that are found in software. These are often created by copying, pasting, and then modifying functions or other code fragments. Several automated and semi-automated clone detection techniques have been devised in the past. Recently, various clone detection techniques have been subject to empirical comparisons to compare how well they perform [2, 7]. Typically, the way this is done is by viewing clone detection as an information retrieval (IR) task. Then, the standard measure for IR techniques can be applied in the form of the detector’s precision and recall (e.g., see Kontogiannis [4]). In the case of clone detection, “recall” refers the percentage of the clones that are found, and “precision” refers to the percentage of correct results as compared to “false positives” (code falsely reported to be clones). There are at least two main questions that plague detector comparisons using this style of research:
2. Working Session Topics
1. What is a clone? No objective and consensually agreed upon definition exists, nor does one appear forthcoming. Burd et al. [2], for example, called their clone assessment technique ”subjective”. Without an objective
All topics related to clone detection research are welcome. However the working session is formatted to help
1
Towards a Clone Detection Benchmark Suite and Results Archive focus attention on outstanding questions regarding how to systematically augment the current archive. These questions include:
niques? What about task-specific clone detection, i.e., finding clones relevant to some software engineering question (size, ability to reengineer, etc.)?
What is a clone? Past research has identified various “types” of clones based on formal characteristics of how they are similar (verbatim, identifier renamings, statement insertions, etc.) [4]. Such classifications are open to reinterpretation. For example one might define two similar sections as clones only if the source can be easily refactored to eliminate the copies (e.g., using procedural abstraction). A more restricted definition would consider only duplicated code that should be refactored for engineering purposes. Also, does one consider macro-expanded source code (macro expansion often generates clones)? Multiply#included source files? Should one discard comments or require clones to be only complete statements or functions?
3. Workshop Format The working session is organized with the intent of keeping it highly interactive and responsive to the interests of the attendees. We expect the working session to fall along these lines: 10 min 15 min 30 min 20 min 10 min
What systems to collect? A collection of easilyaccessible and standardized data sets is critical for empirical evaluation and comparison. Perhaps ideally the standardized data sets would include “representative” and “authentic” sources from various types of systems. “Representative” means that they should exemplify sources from various different engineering contexts. For example one might wish to have degraded systems ripe for reengineering as well as well-oiled legacy assets, and samples from various languages and problem domains (real-time vs. business applications, etc.). “Authentic” means that they should be actual systems or share characteristics of real industrial code. What might the archive include regarding: (1) multiple snapshots of systems as they evolve, (2) multiple versions of the same program, one with clones “eliminated” through abstraction techniques, (3) implementations of reference clone types in multiple languages, and (4) systems with seeded known clones for use in statistical evaluations of precision and recall?
Presentation on the contents and organization of the current archive 3-4 brief overviews of key unresolved issues— with examples Brainstorming session on those issues Focusing session to harvest ideas Recap and TODO list building
References [1] S. Bellon and R. Koschke. Evaluation of automatic software clones project (University of Stuttgart). http://www.bauhaus-stuttgart.de/clones/. [2] E. Burd and J. Bailey. Evaluating clone detection tools for use during preventative maintenance. In Second IEEE International Workshop on Source Code Analysis and Manipulation (SCAM’02), pages 36–43, 2002. [3] R. C. Holt. Software guinea pigs (homepage). http://plg.uwaterloo.ca/˜holt/guinea_pig. [4] K. Kontogiannis. Evaluation experiments on the detection of programming patterns using software metrics. In Proceedings of the 4th Working Conference on Reverse Engineering (WCRE ’97), pages 44–55, 1997. [5] R. Koschke and T. Eisenbarth. A framework for experimental evaluation of clustering techniques. In Proceedings of the 8th International Workshop on Program Comprehension (IWPC’00), pages 201–210, 2000. [6] A. Lakhotia and J. M. Gravley. Toward experimental evaluation of subsystem classification recovery techniques. In Second Working Conference on Reverse Engineering, pages 262–271, 1995. [7] A. Marcus and J. I. Maletic. Identification of high-level concept clones in source code. In Proceedings of the 16th IEEE International Conference on Automated Software Engineering (ASE 2001), pages 107–114, 2001. [8] S. Rugaber and L. M. Wills. Creating a research infrastructure for reengineering. In Third Working Conference on Reverse Engineering (WCRE-3), Nov. 1996. [9] S. E. Sim, R. C. Holt, and S. Easterbrook. On using a benchmark to evaluate C++ extractors. In Proceedings of the 10th International Workshop on Program Comprehension, pages 114–123, 2002. [10] Text REtrieval Conference (TREC). Homepage at http://trec.nist.gov.
What is a clone detection result set? Different clone detectors report clones in different formats. There can be several substantial disagreements over what should be considered a reported clone. A ’lowest common denominator” approach reports non-overlapping, pairwise clones in terms of contiguous sequences of lines. Alternative approaches could instead consider possibly discontinuous characterlevel clones, or overlapping or hierarchical clone relationships. It might also be helpful to expand on the simple dichotomy of clone vs. non-clone and allow the reporting of partial or fuzzy clone matches, or to encode match confidence values. How to compare clone detection techniques? How does one fairly compare automated and semi-automated tech2