code clone detection using string based tree matching

CODE CLONE DETECTION USING STRING BASED TREE MATCHING TECHNIQUE

NORFARADILLA BINTI WAHID

UNIVERSITI TEKNOLOGI MALAYSIA

CODE CLONE DETECTION USING STRING BASED TREE MATCHING TECHNIQUE

NORFARADILLA BINTI WAHID

A project report submitted in partial fulfillment of the requirements for the award of the degree of Master of Science (Computer Science)

Faculty of Computer Science and Information System Universiti Teknologi Malaysia

OCTOBER 2008

iii

To my beloved parents, fiancé and family.

iv

ACKNOWLEDGEMENT

In preparing this report, I was in contact with many people, researchers, and academicians. They have contributed towards my understanding and thoughts. In particular, I wish to express my sincere appreciation to my project supervisor, Dr Ali Selamat, for encouragement, guidance, critics and friendship.

My fellow postgraduate students should also be recognized for their support. My sincere appreciation also extends to all my colleagues and others who have provided assistance at various occasions. Their views and tips are useful indeed. Without their continued support and interest, this thesis would not have been the same as presented here.

v

ABSTRAK

Pengklonan kod telah menjadi suatu isu sejak beberapa tahun kebelakangan ini selari dengan pertambahan jumlah aplikasi web dan perisian berdiri sendiri pada hari ini. Pengklonan memberi kesan yang sangat besar kepada fasa penyelenggaran sistem kerana secara tidak langsung peningkatan bilangan pengulangan kod yang sama di dalam sesebuah sistem akan menyebabkan kompleksiti sistem turut meningkat. Terdapat banyak teknik pengesanan klon telah dihasilkan pada hari ini dan secara umumnya ianya boleh dikategorikan kepada pengesanan berasaskan jujukan perkataan, token, pepohon dan semantik. Tujuan projek ini adalah untuk mengetahui kemungkinan untuk menggunakan suatu teknik dari pemetaan ontologi untuk menyelesaikan masalah ini, tetapi kami tidak menggunakan ontologi di dalam pengesanan klon. Telah dibuktikan di dalam eksperimen awalan bahawa ia mampu untuk mengesan klon. Di dalam tesis ini kami menggunakan dua aras pengesanan. Aras pertama menggunakan ’pelombong sub-pepohon terkerap’ di mana ia mampu mengesan sub-pepohon yang sama antara fail yang berbeza. Kemudian sub-pepohon yang sama dinyatakan dalam bentuk ayat dan persamaan antara kedua-duanya dikira menggunakan ‘metrik ayat’. Daripada eksperimen, kami mendapati bahawa sistem kami adalah tidak berganting kepada sebarang bahasa dah menghasilkan keputusan yang bagus dari segi precision tetapi tidak dari segi recall. Ia mampu mengesan klon serupa dan yang hamper sama.

vi

ABSTRACT

Code cloning have been an issue in these few years as the number of available web application and stand alone software increase nowadays. The major consequences of cloning is that it would risk the maintenance process as there are many duplicated codes in the systems that practically increase the complexity of the system. There are many code clone detection techniques that can be found nowadays which generally can be group into string based, token based, tree based and semantic based. The aim of this project is to find out the possibility of using a technique of ontology mapping technique to solve the problem, but we are not using the real ontology for the clone detection. It has been prove that there is the possibility as it manages to detect clone code. In this thesis the clone detection is using two layers of detection; i.e. structural similarity and string based similarity. The structural similarity is by using subgraph miner where it capable to get the similar subtree between different files. And then we extract all elements of that particular subtree and treat the elements as a string. Two strings from different files then applied with similarity metric to know whether it is a clone pair. From the experimental result, we found that the system is language independent but the result is good in precision but not so good recall. It is also capable to detect two main types of clone, i.e identical clones and similar clones.

vii

TABLE OF CONTENTS

CHAPTER

1

2

TITLE

PAGE

DECLARATION

ii

DEDICATION

iii

ACKNOWLEDGEMENT

iv

ABSTRAK

v

ABSTRACT

vi

TABLE OF CONTENTS

vii

LIST OF TABLES

x

LIST OF FIGURES

xi

LIST OF ABBREVIATIONS

xiii

LIST OF SYMBOLS

xiv

LIST OF APPENDICES

xv

INTRODUCTION 1.1

Overview

1

1.2

Background of the Problem

2

1.3

Problem Statement

5

1.4

Objectives of the Project

6

1.5

Scope of the Project

7

1.7

Thesis outline

7

LITERATURE REVIEW 2.1

Introduction

8

2.2

Code Cloning

9

2.2.1

11

Reasons of code cloning

viii 2.2.2

Code cloning Consequences

14

2.2.3

Code Cloning versus Plagiarism

15

2.2.4

Code Cloning and the Software Copyright Infringement Detection

2.3

Code Cloning in web applications 2.3.1

2.3.2 2.4

2.5

16 17

Definition of clones from web application research View

19

Source of Clones

19

Existing Work of Code Cloning Detection

20

2.4.1 String based

22

2.4.2 Token based

23

2.4.3 Tree based

24

2.4.4

25

Semantic based

2.4.5 Fingerprinting

25

2.4.6 Analysis on Current Approaches

26

The Semantic Web

28

2.5.1

Architecture of the Semantic Web

29

2.5.2

Web Ontology

30

2.5.3

Web Ontology Description Languages

33

2.5.4

Various Application of Ontology

34

2.5.5 Ontology Mapping

36

2.5.6 Ontology Mapping Approaches

39

2.5.7

The Ontology Mapping Technique

40

2.5.7.1 String Metrics

45

2.5.7.2 Frequent Subgraph Mining

47

2.5.7.3 MoFa, gSpan, FFSM, and Gaston

48

2.5.7.4 Representing Web Programming as Tree

50

2.6

Clone Detection Evaluation

52

2.7

Different with work by Jarzabek

54

2.7.1 Clone Miner by Jarzabek

55

2.7.2

2.7.1.1 Detection Of Simple Clones

56

2.7.1.2 Finding Structural Clone

56

Comparison of existing work and our

58

proposed work.

ix 3

4

5

RESEARCH METHODOLOGY 3.1

Introduction

61

3.2

Proposed technique of clone detection

62

3.2.1

Structural Tree Similarity

65

3.2.2

String based tree matching

67

3.3

Preprocessing

70

3.4

Frequent subgraph mining

71

3.5

String based matching

73

3.6

Clone Detection Algorithm

75

3.7


75

EXPERIMENTAL RESULT AND DISCUSSION 4.1

Introduction

77

4.2

Data representation

78

4.2.1

Original source program into XML format

79

4.2.2

Subtree mining data representation

81

4.3

Frequent Subtree Mining

83

4.4

String metric computation

86

4.5

Experimental setup

87

4.6

Experimental results

88

4.7

Comparison of result using different parameters

96

CONCLUSION 5.1

Introduction

103

5.2

Future Works

104

5.3

Strength of the system

104

REFERENCES

105

Appendices A – C

112

x

LIST OF TABLES

TABLE NO.

TITLE

PAGE

2.1

A summary of code cloning and plagiarism detection

16

2.2

Brief description of ontology languages

33

2.3

List of string metric

45

3.1

Example of cross-table used to compare programs across two systems

71

3.2

Brief description of each frequent subgraph miner

72

4.1

Data for program testing

89

4.2

Experimental result using GSpan miner

91

4.3

Experiment using different parameter value

99

xi

LIST OF FIGURE

FIGURE NO

1.1

TITLE

A sample parse tree with generated characteristic

PAGE

4

vectors.

2.1

Example of a pair of cloned code in traditional program.

11

2.2

Tree-diagram for the Reasons for Cloning

13

2.3

Variation of clone detection research and the classification of detection

22

2.4

Architecture of Semantic Web

29

2.5

Simple example of ontology

32

2.6

Simple example of mapping between two ontologies.

38

2.7

Illustration of ontology mapping approaches

40

2.8

Tree representation of an XML source code

48

2.9

Clones per file

57

2.10

Frequent clone pattern with file coverage

58

xii 2.11

Similar node structure between two XML code fragments 59

2.12

Difference of work by Jarzabek and Basit(2005) and our proposed work

60

3.1

Mapping between concepts of O 'α and O ' β

65

3.2

Diagrammatic view of clone detection technique

69

3.3

Preprocessing phase

70

3.4

Illustration of detected clone within two trees

73

3.5

A pair of source code fragment classified as nearly identical. nearly-identical

74

3.6

Clone detection algorithm

75

4.1

Transformation of original PHP code into HTML code

80

4.2

XML form of the previous HTML code

81

4.3

A tree as list of nodes and edges

82

4.4

Example of tree as vertices and edges list

83

4.5

Frequent subtrees generated by graph miner.

85

4.6

Code fragment containing original frequent subtree.

87

4.7

Real output from the clone detection system

90

4.8

Recall and precision for GSpan-Jaro Winkler

93

xiii 4.9

Robustness of GSpan-Jaro Winkler

93

4.10

Computational time for GSpan-JaroWinkler

94

4.11

Recall and Precision for GSpan-Levenshtein Distance

94

4.12

Robustness for GSpan-Levenshtein Distance

95

4.13

Computational time for GSpan-Levenshtein Distance

95

4.14

Two close clones cannot be taken as a single clone

98

4.15

Precision result using different minimum support and

100

threshold

4.16

Recall result using different minimum support and threshold

101

xiv

LIST OF ABBREVIATIONS

WA

Web Application

TS

Traditional Software

CCD

Code Clone Detection

PD

Plagiarism Detection

xv

LIST OF SYMBOLS

Sim(s1, s2)

-

similarity between two strings, s1 and s2

Comm(s1, s2)

-

commonality between s1 and s2

Diff(s1,s2)

-

difference between s1 and s2

Winkler(s1, s2)

-

improvement value to improve the result

max ComSubStri ng

-

the sum of the lengths of common substring

length( s1 )

-

length of s1

length( s 2 )

-

length of s2

uLens1

-

length of the unmatched substring from s1

uLens 2

-

length of the unmatched substring from s2

p

-

a parameter of range 0 and ∞

θ

-

a threshold

xvi

LIST OF APPENDICES

APPENDIX

TITLE

PAGE

A

Project Activities

111

B

Existing Works of Code Clone Detection

113

C

Experimental result tables

117

CHAPTER 1

INTRODUCTION

1.1

Overview

As the world of computers is rapidly developing, there are tremendous needs of software development for different purposes. And as we can see today, the complexity of the software been developed are different between one and another. Sometimes, developers take easier way of implementation by copying some fragments of the existing programs and use the code in their work. This kind of work can be called as code cloning. Somehow the attitude of cloning can lead to the other issues of software development, for example the plagiarism and software copyright infringement (Roy and Cordy, 2007).

In most of the cases, in order to figure out the issues and to help better software maintenance, we need to detect the codes that have been cloned (Baker, 1995). In the web applications development, the chances of doing clones are bigger since there are too many open source software available in the Internet (Bailey and Burd, 2005). The applications are sometimes just a ‘cosmetic’ of another existing system. There are quite a number of researches in software code cloning detection, but not so particularly in the area of web based applications.

2 1.2

Background of the Problem

Software maintenance has been widely accepted as the most costly phase of a software lifecycle, with figures as high as 80% of the total development cost being reported (Baker, 1995). As cloning is one of the contributors towards this cost, the software clone detection and resolution has got considerable attention from the software engineering research community and many clone detection tools and techniques have been developed (Baker, 1995). However, when-it comes to commercialization of the software codes, most of the software house developers tend to claim that their works are 100% done in house without using other codes copies form various sources. This has made a difficulty for the intellectual property copyright entities such as SIRIM and patent searching offices in finding the genuineity of the software source codes developed by the in house company. There is a need to identify the software source submitted for patent copyright application to be a genuine source code without having any copyright infringements. Besides that, the cloning is somehow raising the issue of plagiarism. The simplest example is in the academic area where students tend to copy their friends’ works and submit the assignments with only slight modifications.

Usually, in software development process, there is a need for components reusability either in designing and coding. Reuse in object-oriented systems is made possible through different mechanisms such as inheritance, shared libraries, object composition, and so on. Still, programmers often need to reuse components which have not been designed for reuse. This may happen during the initial of systems development and also when the software systems go through the expansion phase and new requirements have to be satisfied. In these situations, the programmers usually follow the low cost copy-paste technique, instead of costly redesigning-thesystem approach, hence causing clones. This type of code cloning is the most basic and widely used approach towards software reuse. Several studies suggest that as much as 20-30% of large software systems consist of cloned code (Krinke, 2001). The problem with code cloning is that errors in the original must be fixed in every copy. Other kinds of maintenance changes, for instance, extensions or

3 adaptations, must be applied multiple times, too. Yet, it is usually not documented where code was copied. In such cases, one needs to detect them. For large systems, detection is feasible only by automatic techniques. Consequently, several techniques have been proposed to detect clones automatically (Bellon et al., 2007).

There are quite a number of works that detect the similarity by representing the code in tree or graph representation and also some using string-based detection, and semantic-based detection. Almost all the clone detection technique had the tendency of detecting syntactic similarity and only some detect the semantic part of the clones. Baxter in his work (Baxter et al., 1998) proposes a technique to extract clone pairs of statements, declarations, or sequences of them from C source files. The tool parses source code to build an abstract syntax tree (AST) and compares its subtrees by characterization metrics (hash functions). The parser needs a “fullfledged” syntax analysis for C to build AST. Baxter's tool expands C macros (define, include, etc) to compare code portions written with macros. Its computation complexity is O(n), where n is the number of the subtree of the source files. The hash function enables one to do parameterized matching, to detect gapped clones, and to identify clones of code portions in which some statements are reordered. In AST approaches, it is able to transform the source tree to a regular form as we do in the transformation rules. However, the AST based transformation is generally expensive since it requires full syntax analysis and transformation.

In other work (Jiang et al, 2007) present an efficient algorithm for identifying similar subtrees and apply it to tree representations of source code. Their algorithm is based on a novel characterization of subtrees with numerical vectors in the Euclidean space Rn and an efficient algorithm to cluster these vectors with respected to the Euclidean distance metric. Subtrees with vectors in one cluster are considered similar. They have implemented the tree similarity algorithm as a clone detection tool called DECKARD and evaluated it on large code bases written in C and Java including the Linux kernel and JDK. The experiments show that DECKARD is both scalable and accurate. It is also language independent, applicable to any language with a formally specified grammar.

4

Figure 1.1: A sample parse tree with generated characteristic vectors[14].

In (Krinke, 2001), Krinke presents an approach to identify similar code in programs based on finding similar subgraphs in attributed directed graphs. This approach is used on program dependence graphs and therefore considers not only the syntactic structure of programs but also the data flow within (as an abstraction of the semantics). As a result, it is said that no tradeoff between precision and recall- the approach is very good in both.

Kamiya in one of his work in (Kamiya et al., 2002) suggest the use of suffix tree. In the paper they have used a suffix-tree matching algorithm to compute tokenby token matching, in which the clone location information is represented as a tree with sharing nodes for leading identical subsequences and the clone detection is performed by searching the leading nodes on the tree. Their token-by token matching is more expensive than line-by-line matching in terms of computing complexity since a single line is usually composed of several tokens. They proposed several optimization techniques especially designed for the token-by-token matching algorithm, which enable the algorithm to be practically useful for large software.

Appendix B of this thesis, describe briefly some existing techniques of code clone detection and plagiarism. It also discusses the strength and weaknesses of each technique.

5 1.3

Problem Statement

As we can see from the previous works, some of the works are scalable, can detect more than one type of clone. But some of them face the trade off of the computational complexity. It may be happen because most of the techniques apply expensive syntax analysis for transformation. From the literature that have been done, more than half of existing techniques used tree- based detection as it were more scalable. But, most of the techniques do a single layer detection which means after the transformation into normalized data e.g. tree, graph, and etc, the process of finding the similarity of code, i.e. code clone, were done directly by processing each nodes in the data. All possible clones need to be search directly without some kind of filtering, which it can cause higher cost of computational process.

As ontology has been widely used nowadays, we cannot deny the importance of ontology in current web technology. The major similarity of ontology and clone detection works is that it both can be represented as tree. Beside that, there are many works have been done to do mapping of different ontologies between each other, which is actually to find out which concepts of the first ontology are the same with the second one. This activity is actually almost the same with what need to be done in detecting clone codes.

Since there are some kinds of similarity between both problems, so detecting clone in source code may be able to be done using the same way as mapping the ontologies. The research question of this thesis is to identify the possibility of using a technique of ontology mapping to detect clones in a web- based application. Obviously there will be no ontologies that going to be used in the experiments since we are dealing with source code and not ontology. But we will use the technique of mapping to detect clones.

6 In order to achieve the aim, there are a few questions that need to be solved. What are the attributes or criteria that might be possible to be cloned in web documents? What are the approaches that had been proposed in the previous research in the ontology mapping area than had been used in clone detection tool? What are the issues of the recovered approach and how to solve it?

1.4

Objectives of the Project

The aim of this research is to develop a clone detection framework by manipulating an existing work of mapping ontology. In order to achieve this aim, the following objectives must be fulfilled.

1. To analyze various techniques related to code clone detection that has been proposed by previous researches.

2. To develop a clone detection program by using the ontology mapping technique that will be proposed in the project.

3. To test the program using recall and precision measurements as the main metrics.

7 1.5

Scope of the Project

The scopes of the project are as follows:

1.

The analysis will only be done by various technique of ontology mapping in

order to detect code cloning of web-based applications.

2.

The proposed improve technique will be based on the best existing technique

of ontology mapping that suitable for code clone detection.

3.

The proposed technique should at least capable to detect identical clones and

renamed clones.

4.

The experiments will be done by testing the program with available open

source web applications in the web.

1.7

Thesis outline

The remaining of the report will be organized into the following division: •

Chapter II gives a review of code clone detection, the ontology and various existing techniques for code clone detection.

•

Chapter III describes the overall methodology adopted to achieve the objectives of this research.

•

Chapter IV presents the finding of this project which consists of the result of experiment of the proposed clone detection technique.

•

Chapter V concludes the report and provides some suggestion for future works.

CHAPTER 2

LITERATURE REVIEW

2.1

Introduction

Software is normally used in computer systems in our daily life and it is getting larger and more complex. As the software grow bigger and bigger, the number of software engineers and programmers attached to the development of the software also increase at the same time. The more persons involved in the software development life cycle, it may influence the complexity of the software depending on their habits and style in writing software code.

In the software development, reuse in object-oriented systems is made possible through different mechanisms such as inheritance, shared libraries, object composition, and so on. Still, programmers often need to reuse components which are not designed for reuse. This may happen during the initial of systems development and also when the software systems go through the expansion phase and new requirements have to be satisfied. In these situations, the programmers usually follow the low cost copy-paste technique, instead of costly redesigning-thesystem approach, hence causing clones. This type of code cloning is the most basic and widely used approach towards software reuse.

9 There are various reasons that encourage programmers to do code cloning. For whatever reason it is, the action of doing copy and paste always lead to the code cloning in particular software that at the end will difficult the software maintenance because of the appearance of the duplicating codes. This scenario always happens in the implementation phase of stand alone system as well as web- based application. Meanwhile, the growth of open source software also give a great impact on the intellectual property right to the owner of the source code. There is a need to detect and validate the genuinely of a software code to identify whether the code is actually a copied code or a genuine code before a copyright status can be given to the applicant.

2.2

Code Cloning

Code duplication or copying a code fragment and then reuse by pasting with or without any modifications is well known as code smell in software maintenance. This type of reuse approach of existing code is called code cloning and the pasted code fragment (with or without modifications) is called a clone of the original. Several studies show that duplicated code is basically the result of copying existing code fragments and using then by pasting with or without minor modifications. People always believe that the major cause of cloning is by the act of copy and paste. Some say that it may happen accidentally. In some cases, a new developed system is actually a ‘cosmetic’ of another existing system. This kind of case usually happen in the web based application. They tend to modify the appearance of the application or system by changing the background color, images, etc.

Refactoring of the duplicated code is another prime issue in software maintenance although several studies claim that refactoring of certain clones is not desirable and there is a risk of removing them. However, it is also widely agreed that clones should at least be detected.

10 Several studies have shown that, the cost of maintenance is promisingly increased wherever there are clones in the source code compared with the code without any clones. Definition of code cloning had been mentioned in different researches and some of them use different terminologies that refer to the code cloning. According to (Koschke, 2006), clone is one that appears to be a copy of an original form. It is a synonym to duplicate. Often in literature, there was a misconception of code clone and redundant code. Even though code clone usually leads to code redundancy, but not every redundant code is harmful, on the other side cloned codes are usually harmful especially for the maintenance phase of software development life cycle. Baxter in his outstanding work (Baxter et al., 2008), states that clone is a program fragment that identical to another fragment”, (Krinke, 2001) uses the term “similar code", (Ducasse et al. 1999) use the term “duplicated code", (Komondoor and Horwitz, 2001) also use the term “duplicated code" and use “clone" as an instance of duplicated code. (Mayrand and Leblanc, 1996) use metrics to find “an exact copy or a mutant of another function in the system".

All these definitions of clones carry some kind of vagueness (e.g., “similar" and “identical") and this imperfect definition of clones makes the clone detection process much harder than the detection approach itself. Generally, it can be said that code clone pair is a fragment of code that is syntactically or semantically identical or similar. From all arguments above, we could simplify the clones into four types: i.

An exact copy without modifications (except for white space and comments) i.e. identical.

ii.

A syntactically identical copy; only variable, type or function identifiers have been changed. i.e. nearly identical.

iii.

A copy with further modifications; statements have been changed, added, or removed i.e. similar.

iv.

A code portion that is partly similar to another code fragment. It may involve some deletion, modification and addition from the original code i.e. gapped clone.

11 According to our understanding to (Ueda et al., 2002), we may group the second and third one as a single major type which is renamed code clone. Renamed code clone still have similar structure between each other. So it is part of this report that the framework proposed should at least capable to detect the identical clone and renamed code clone.

Figure 2.1 shows an example of cloned code. Obviously the code in the example has the same code structure and the pair is considered similar. 1 int sum = 0 ; 2 3 void foo (Iterator iter){ 4 for(item = first (iter); has_more(iter); item = next(iter) ){ 5 sum = sum + value (item); 6 } 7 } 1 int bar ( Iterator iter ){ 2 int sum = 0 ; 3 for(item = first (iter); has_more(iter); item = next(iter)) { 4 sum = sum + value (item); 5 } 6 }

Figure 2.1

2.2.1

Example of a pair of cloned code in traditional program.

Reasons of code cloning

There are a number of reasons why people tend to do cloning. The most frequently reported reason is to ease the development of the source code. The developers see that doing duplication at this phase is a cheap way to enhance the performance of the development process but on the other hand it causes an expensive result during the maintenance and somehow will affect the performance of the system itself. Sometimes, these programmers are forced to copy because of limitations of their knowledge in that particular programming language.

12 The other scenario leading to code cloning is the planned duplication of code. For example, software developers may clone code to get the same quality of code fragments due to strict performance requirement (Ekram et.al 2005). This type of cloning is desirable. Another such desirable cloning is the code duplication between two architectural entities in order to maintain architectural clarity.

Code fragments may also be accidentally identical. Technically these are not clones since they were not intentionally copied from each other. But the clone detection tools may identify them as clones since they look similar. The tendency of employer to evaluate programmers’ works by line of codes per day also sometimes forces them to highly implement the code cloning (LOCs per day). They often believe that as long as the programs works very well then there are nothing left to be concerned.

Other reported reasons of cloning are insufficient information on global change impact, badly organized reuse process, time pressure,

educational

deficiencies, ignorance, or shortsightedness, intellectual challenges (e.g., generics), professionalism/end-user

programming

(e.g.,

HTML,

Visual

Basic,

etc.),

development process and organizational issues, e.g., distributed development organizations(Koschke, 2006). Figure 2.2 shows the reasons of cloning in the work done by (Roy, 2007) represented in the form of tree.

13 Simple reuse by copy and paste Forking Reuse Approach Development Strategy Programming Approach

Design reuse Functionalities/ Logic Reuse Genetic programming Merging similar system Delay in restructuring Risk

Avoiding Maintenance Benefits Reasons for Cloning

Ensuring

Reflecting

Unwanted design dependencies Robustness Better performance in real time programs Design decision Lack of reuse mechanism

Language limitation Overcoming Underlying Limitations

Abstraction create complexity Abstraction is error-prone Time limitation

Programmers’ limitations

Performance by LOC Lack of ownership Lack of knowledge in domain Difficulty in understanding large system

Cloning by Accident

Language paradigm Programmers’ working style

Protocols to interact with API and Libraries Programmers’ mental model Unknowingly implementing the same logic by different person

Figure 2.2: Tree-diagram for the Reasons for Cloning (Roy, 2001).

14 2.2.2

Code Cloning Consequences

The preceding sub topic have discussed several reasons of why cloning could occur. While it is sometimes beneficial to practice cloning, code clones also can give serious impacts on the quality, reusability and maintainability of a software system as well as in determining the copyright authority to the developer e.g. by SIRIM and towards plagiarism.

There are plausible arguments that code cloning increases maintenance effort. Changes must be made consistently multiple times if the code is redundant. The copied code fragments are usually not documented so that there is a need for manual searching if it happens that any bugs exist in the original code, i.e. the bugs are copied in the cloned code as well. Therefore, the probability of bug propagation may increase significantly in the system (Li et al., 2006). In many cases, only the structure of the duplicated fragment is reused with the developer's responsibility of adapting the code to the current need. This action brings the system into a lot more suffering condition where there are possibilities of another error or bugs. Furthermore during analysis, the same code must be read over and over again, and then compared to the other code just to find out that this code has already been analyzed. In order to verify this, they need to make a quite expensive comparison between codes to check the similarity of the fragments; meanwhile it actually can be avoided if there is no code duplication in the program development.

(Roy, 2001) also stated that code cloning may cause the increasing of probability of bad design, difficulty of system improvement and modification and the resource requirements. As the consequent it becomes difficult to reuse part of the implementation in future project. Beside that, because of the duplications in the code, one needs additional time to understand the existing code in order to add more functionality to the system. Code duplication introduces higher growth rate of the system size. While system size may not be a big problem for some domains, others (e.g., compact devices) may require costly hardware upgrade with a software

15 upgrade. Compilation times will increase if more code has to be translated which has a harmful effect on the edit-compile-test cycle.

2.2.3

Code Cloning versus Plagiarism

Copy- paste- adapt strategies are obviously lead to a lot of problems when software renovation need to be done, but it is likely that one cannot avoid these strategies in their programming stage.

One of the closely related areas of clone detection is the plagiarism detection (Roy, 2001). Jarzabek and Rajapakse(2005) stated that plagiarism detection is one of the application of code clone detection. In clone detection existing code is copied and then reused with or without modifications or adaptations for various reasons. On the other hand, for plagiarism detection, copied code is concealed intentionally and therefore, it is more difficult to detect. Clone detection techniques can be used in the domain of plagiarism detection if extensive normalization is applied to the source code for comparison. However, such normalization may produce lots of false positives (Roy, 2001). According to all the explanation, it can be said that one way to do plagiarism detection is by doing clone detection i.e. plagiarism detection is an application of code cloning detection. Clone detection tools such as token-based CCFinder (Kamiya et al., 2002) and metrics-based Moss (Schleimer, 2003) have been applied in detecting plagiarism.

For clear distinguishing, we would like to differentiate these two terms; code cloning detection and plagiarism detection as stated in (Roy, 2001). Plagiarism detection tools are designed to measure the degree of similarity between a pair of programs in order to detect the degree of plagiarism between them. The terms ‘Replication Within Programs’ and ‘Replication Across Programs’ were introduced

16 to differentiate the detection. The first one describe the situation where code has been copied and pasted once or more within the same file while the later described the code cloning between files.

Table 2.1 A summary of code cloning and plagiarism detection Clone Detection Replication Within Programs

√

Replication Across Programs

√

Plagiarism Detection √

Clone detection techniques on the other hand may benefit from the researches that have been done in plagiarism area. This statement is due to the work that had been done in (Bailey and Burd, 2002). In their work, they evaluate a few tools of code cloning and plagiarism detection. From their study, it is found that plagiarism detection tools show more or less similar precision and recall compared to the clone detection tools even though these tools detect clones across files only. Since plagiarism detection tools are meant for comparing program so the tools are not well suited to use directly for clone detection from a performance point of view.

2.2.4

Code Cloning and the Software Copyright Infringement Detection

In recent years, the issue of infringement in the industry has gained international attention as the demand for software continues to grow. The growing presence of unauthorized reproduction of copyrighted products inhibits full potential growth and discourages creative activity. Copyright infringement occurs when we use or distribute information without permission from the person or organization that owns the legal rights to the information i.e. copyrighted software, in this particular topic. Software copyright is not essentially different from any other sort of copyright. However, there are certain aspects of copyright law that are specific to software,

17 because there are practical differences between software and other things that can be copyrighted e.g. books, drawings, sculptures, etc. Copyright law gives the developers a high degree of control over the program that they create. In the case of web application, the whole parts of the system are copyrighted including an image on our web site or in a document. Illegally downloading music and pirating software are common copyright violations. While these activities may seem harmless, they could have serious legal and security implications.

As we mention in the previous subchapters, sometimes there are difficulties of assuring the originality of software. Often, developer declares that the software or application that they have been developed is 100% of their own effort. But, the fact that copy and paste have been a ‘style’ in software development cannot be denied. A lot more of considerations are needed in order to give certain copyright authority to the developer since some of the code might be copied from the existing software. There might be other issues that need to be taken into account instead of looking only at the source codes.

The problem of detecting source code copyright infringement is viewed as a code similarity measuring problem between software systems. Clone detection tools can, therefore, be applied or can easily be adapted in detecting copyright infringement(Roy, 2001).

2.3

Code Cloning in web applications

In nowadays era, computer has been a powerful tool to solve various kinds of problems in our everyday life. Meanwhile, WWW has been the pit stop for people to find, share and exchange information all around the world. Today’s web site is not only the collections of static web sites that only display information but it offers a lot

18 more tasks in more critical domains such as e-business, e-finance, e-government, elearning and so on that applying dynamic web pages with richer contents that was being retrieved from databases and such. These types of web applications require a lot more work efforts in their development life-cycles and thus require a lot more investments. People need to realize that web applications are not only meant for the Internet but if we can have at least local area network (LAN), we can still have web application and people within the network can still access the system.

Normally, web applications development need shorter time of development processes, fuzzy initial requirement thus brings to a lot of latent changes over the development life cycle (Jarzabek and Rajapakse, 2005). As the needs of shorter development time, there is a possibility of increasing in code cloning activities. Programmers often force to copy the code from existing work so that they can shorten the time and ease their work.).

As we can see, there are quite enormous researches have been done in the area of code cloning especially in traditional software (e.g. developed for stand alone application, using C, C++, etc) in at least a decade. Meanwhile, we can say that such researches are still in the infancy state for web based application. The statement is due to the small number of researches that can be found available. Most of the researches are revolving in the code clone detection where by using different strategies of detection. Callefato(2004) conducted an experiment of semi automated approach of function cloned detection, Lucca et al. (2002) introduced the detection approach based on similarity metrics, to detect duplicated pages in Web sites and applications, implemented with HTML language and ASP technology, and Lanubile and Mallardo(2003) introduced simple semi-automated approach that can be used to identify cloned functions within scripting code of web applications.

Some of the researches adopted the approaches done in traditional software. The most frequently appear in the researches is the use of CCFinder (Kamiya et al., 2002) as the cloned detector which can detect exact clones and parameterized clones.

19 While most of the researches are discussing about the strategies of clone detection, Jarzabek and Rajapakse(2005) conducted a systematic research to find out the extent of cloning in web domain in comparing with traditional software. They found out that cloning in web application is indeed substantial exceeding the rates for traditional software. Jarzabek also did introduced metrics that might be useful for similar studies.

2.3.1

Definition of clones from web application research view

Clones have been defined in slightly different ways by different researches. These differences are usually related to different detection strategies used and the different domains focused on by each researcher. Accordingly, there are clone definitions specific to web domain.

Di Lucca et al.(2001) it defines HTML clones as pages that include the same set of tags while in Di Lucca et al.(2002) defines clones as pages that have the same, or a very similar structure. In (De Lucia et al., 2004), web page clones are classified into three types based on similarity in page structure, content, and scripts. In (Rajapakse and Jarzabek, 2005) since their study considers all text files for clone detection, not just HTML pages or script files, they use a simpler definition of clones, i.e., clones are similar text fragments.

20 2.3.2

Source of Clones

In order for us to make the complete and right detection, we first need to know the categories of possible sources of clones that need to be considered. Stan Jarzabek in (Rajapakse and Jarzabek, 2005) had mentioned such categories in his research paper and the information is also used as an evaluation metric. One of the interests of his work is to know which of the following file categories contributed most clones. The categories are like below: i.

Static files (STA) – files that needs to be delivered ‘as is’ to the browser. Includes markup files, style sheets and client side scripts (e.g., HTML, CSS, XSL, JavaScripts).

ii.

Server pages (SPG) – files containing embedded server side scripting. These generate dynamic content at runtime (e.g., JSP, PHP, ASP, ASP.NET).

iii.

Templates (TPL) – files related to additional templating mechanisms used.

iv.

Program files (PRG) – files containing code written in a full fledged programming language (e.g., Java, Perl, C#, Python)

v.

Administrative files (ADM) – build scripts, database scripts, configuration files

vi.

2.4

Other files (OTH) – files that do not belong to other five types.

Existing Work of Code Cloning Detection

Code cloning detection had been an active research for almost two decades. Within that period many tools or techniques have been invented either for commercial use or for academic purposes. At the same time, a number of issues have been raised along the researches in term of number of clones detected, types of

21 detected clones, the recall and precision, the scalability, and the coupling towards language i.e language dependent/independent. Various researches have shown that their tools can detect almost up to 70% of clone within a particular source code.

According to Jiang in (Jiang et al, 2007) all the researches in this area can be classified into four main bases; string-based, token based, tree- based and semantic based. According to this classification, we found out that most if the clones that were detected could involved in two general ways which are syntactically and semantically. The first type of clones usually found from the similarity of functions including scripting e.g. JavaScript, VBScript, the classes, the attributes, etc. On the other hand the semantic clones is about the meaning of the content i.e. knowledge represented, sequence of declaration of statement (Baxter et al, 1998), etc.

Figure 2.8 below shows the relationship of classification of detection and the type of clones detected. Most of the reports show that most of them tend to find clones syntactically rather than semantically. Syntactic clone detections cover from string-based until tree-based works. Some tree-based works also show the ability of finding clone semantically. Appendix B of this report present some of previous works in clone detection area.

22

Code Clone Detection

String-based

Token- based

Tree- based

Semantic- based

Syntactic Semantic

Figure 2.3: Variation of clone detection research and the classification of detection

2.4.1 String based

String based techniques use basic string transformation and comparison algorithms which makes them independent of programming languages. Techniques in this category differ in underlying string comparison algorithm. Comparing calculated signatures per line is one possibility to identify for matching substrings (Ehrig and Sure, 2002). Line matching, which comes in two variants, is an alternative which is selected as representative for this category because it uses general string manipulations.

i.

Simple Line Matching: the first variant of line matching in which both detection phases are straightforward. Only minor transformations using string manipulation operations, which can operate using no or very limited knowledge about possible language constructs, are applied.

23 Typical transformations are the removal of empty lines and white spaces. During comparison all lines are compared with each other using a string matching algorithm. This result in a large search space which is usually reduced using hashing buckets. Before comparing all the lines, they are hashed into one of n possible buckets. Afterwards all pairs in the same bucket are compared. An example of this detection is in (Ducasse, 2002). ii.

Parameterized Line Matching : detects both identical as well as similar code fragments. The idea is that since identifier–names and literals are likely to change when cloning a code fragment, they can be considered as changeable parameters. Similar fragments which differ only in the naming of these parameters are allowed. To enable such parameterization, the set of transformations is extended with an additional transformation that replaces all identifiers and literals with one common identifier symbol. The comparison then becomes independent of the parameters. Therefore no additional changes are necessary to the comparison algorithm itself. Parameterized line matching is discussed in (Fox, 1998).

2.4.2

Token based

A program is lexed to produce a token sequence, which is scanned for duplicated token subsequences that indicate potential code clones. Compared to string-based approaches, a token-based approach is usually more robust against code changes such as formatting and spacing. CCFinder (Kamiya, et al., 2002) is one of the most well-known among token-based techniques.

Token based techniques use a more sophisticated transformation algorithm by constructing a token stream from the source code, hence require a lexer. The presence of such tokens makes it possible to use improved comparison algorithms.

24 Next to parameterize matching with suffix trees, which acts as representative, we include (Kamiya, et al., 2002) in this category because it also transforms the source code in a token-structure which is afterwards matched. The latter tries to remove much more detail by summarizing non interesting code fragments.

Parameterized Matching With Suffix Trees consists of three consecutive steps manipulating a suffix tree as internal representation. In the first step, a lexical analyzer passes over the source text transforming identifiers and literals in parameter symbols, while the typographical structure of each line is encoded in a non-parameter symbol. One symbol always refers to the same identifier, literal or structure. The result of this first step is a parameterized string or p-string. Once the p-string is constructed, a criterion to decide whether two sequences in this p-string are a parameterized match or not is necessary. Two strings are a parameterized match if one can be transformed into the other by applying a one-to-one mapping renaming the parameter symbols. After the lexical analysis, a data structure called a parameterized suffix tree (p-suffix tree) is built for the p-string. The use of a suffix tree allows a more efficient detection of maximal, parameterized matches. All that is left for the last step, is to find maximal paths in the p-suffix tree that are longer than a predefined character length. Parameterized matching using suffix trees were introduced in (Baker, 1995) with Dup as implementation example.

2.4.3

Tree based

Parse tree based techniques use a heavyweight transformation algorithm, i.e. the construction of a parse tree. A program is parsed to produce a parse tree or abstract

syntax tree (AST) representation of the source program. Exact or close matches of subtrees can then be identified by comparing subtrees within the generated parse tree or AST. Alternatively, different metrics can be used to fingerprint the subtrees, and subtrees with similar fingerprints are reported as possible duplicates (Kontogiannis et

25 al., 1996). Because of the richness of this structure, it is possible to try various comparison algorithms as well.

Metric Fingerprints builds on the idea that you can characterize a code fragment using a set of numbers. These numbers are measurements which identify the functional structure of the fragment and sometimes the layout. The metric fingerprint technique can be divided in five steps, each with a well-defined task. However the algorithm behind each task may differ between implementations. Before we can characterize the functional structure of a code fragment with numbers, it’s wise to transform the source code into a representation that allows us to calculate such measurements efficiently. This transformation job then ends up with one large syntax tree. This tree is then split into interesting fragments. The choice of the type of fragments used is difficult because it affects the detection results. Most of the time, however, method and scope blocks are used as fragments since they are easily extracted from a syntax tree.

Afterwards the fragments are characterized through a set of measurements by measuring the values for a set of metrics, chosen in advance. This set of metrics can differ between various implementations, but most of the time it specifies functional properties. However there are implementations in which layout metrics are used as well. Finally, these sets of numbers are compared to each other. Depending on the implementation, algorithms with different levels of sophistication or power may be used. One possible approach calculates the Euclidean distance between each pair of fingerprints, considering fragments within zero distance as clones.

26 2.4.4

Semantic based

Semantics-aware approaches have also been proposed. Komondoor and Horwitz(2001) suggest the use of program dependence graphs (PDGs) and program slicing to find isomorphic PDG sub graphs in order to identify code clones. They also propose an approach to group identified clones together while preserving the semantics of the original code for automatic procedure extraction to support software refactoring. Such techniques have not scaled to large code bases.

2.4.5

Fingerprinting

A fingerprint of a document can be defined as the set of all possible document substrings of a certain length and fingerprinting is the process of generating fingerprints. Fingerprinting techniques mostly rely on the use of k-grams. The process divides a document into k-grams and extracts a hash- value from each one. The result of this process is a fingerprint that represents the document in each of its sub-part of length k. Then the fingerprints of two documents can be compared in order to detect plagiarism. There are two types of fingerprinting, i.e. full fingerprinting and selective fingerprinting

2.4.6

Analysis on Current Approaches

From all the clone detection works that we can see in Appendix we can make larger group of techniques where the existing work can be classified into structural base detection ( i.e. tree based, token based, metric) and string based detection.

27 As we can see in Appendix B, there are more clone detection were done using tree- representation rather than string and token detection. It is because tree-based clone detection tools are more robust. In our table above, we group all the approach that use abstract syntax tree(AST) and program-dependence graph(PDG) under the same roof i.e. tree-based approach. Robust here means that the there are more types of clones can be detected rather than only the identical clones. Most of the tool of tree-based can detect identical clones, similar clones and near-similar clones. All these types of clones can appear since one of the programming styles is by copypaste-adapt from the existing source code. Programmer usually will modify the codes that have been copied to suit the need of their program rather that directly use the code fragment. Beside that, tree-based clone detection tools are usually more scalable rather than the other approaches. It manages to detect clones even in large size of source code. Even though it is recorded that tree based manage to detect many types of clone but the overall result says that it is lower in recall and precision.

String based clone detection is limited to a few type of clone only. Generally, if there exist a non-similar string within sequence, the fragment is not clone. So the major clone that can be detected is only exact clone and some near- identical clone, i.e. similar. It is hard to detect near- exact clone and other type of clone. But the strength of string based detection is it is well suited for work involving text written in many languages including natural languages, i.e. language independant.

Among all the tools that have been listed above, tools by Baxter, Krinke and Jiang are among the outstanding tools of all. Their techniques are simple and yet have the strength of scalability, the avoidance of trade of between recall and precision, language independent and etc. Some of the tools on the other hand aim to eliminate those cloned codes for better software maintenance. In (Rajapakse and Jarzabek, 2005) for example, the author used the composition with adaptation metaprogramming technique. The technique is called XVCL. In their work, they have deigned a meta-representation so that they could produce JDK buffer classes in exactly the same form as they appear in the original Buffer classes rather than re-

28 design the design. It is in order to allow the integration of the XVCL solution into contemporary programming methodologies.

Even though at least 70% of the clone detection tools available are about to detect syntactic clones, the issue of detecting semantic clones has indeed becoming a concern to the research community. Unfortunately, no proper way of detecting semantic clones is available yet (Roy et al.,2007) An alternative approach of detecting such clones could be to apply extensive and intelligent normalizations and transformations to source code. However, it is not clear how such normalizations and transformations can assist in detecting semantic clones. Large scale empirical studies are required to verify this approach.

From the observation, we see that only two of the above tools consider the semantic view of clone which is tools by Komoondor and Krinke. Semantic clones in their tools mean all the clones that have the similar functionality. If the functionalities of the two code fragments are identical or similar i.e., they have similar pre and post conditions, they are called semantic clones. The clones might be in two or more code fragments that perform the same computation but implemented through different syntactic variants.

2.5

The Semantic Web

It is obvious that the term ‘ontology’ has become a key word within modern computer science. It is becoming more important in fields such as knowledge management, information integration, cooperative information systems, information retrieval and electronic commerce. One application area which has recently seen an explosion of interest is the so called Semantic Web.

29 2.5.1 Architecture of the Semantic Web

The Semantic Web is an evolving extension of the World Wide Web(WWW) in which web content can be expressed not only in natural language, but also in a format that can be read and used by automated tools, thus permitting people and machines to find, share and integrate information more easily. It derives from W3C director Tim Berners-Lee's vision of the Web as a universal medium for data, information, and knowledge exchange.

In building one layer of the Semantic Web on top of another, there are some principles that should be followed; downward compatibility and upward partial understanding (Antoniou et al., 2003).

Inference A lightweight ontology language

Structure Coding

Document exchange standard

Figure 2.4: Architecture of Semantic Web

At the bottom we find XML, a language that lets one write structured Web documents with a user-defined vocabulary. XML is particularly suitable for sending documents across the Web. RDF is a basic data model, like the entity-relationship

30 model, for writing simple statements about Web objects (resources). The RDF data model does not rely on XML, but RDF has an XML-based syntax. Therefore in Figure 2.3 it is located on top of the XML layer. RDF Schema provides modeling primitives for organizing Web objects into hierarchies. Key primitives are classes and properties, subclass and sub-property relationships, and domain and range restrictions. RDF Schema (RDF-S) is based on RDF. RDF Schema can be viewed as a primitive language for writing ontologies. But there is a need for more powerful ontology languages that expand RDF Schema and allow the representations of more complex relationships between Web objects.

The logic layer is used to enhance the ontology language further, and to allow writing application-specific declarative knowledge. The proof layer involves the actual deductive process, as well as the representation of proofs in Web languages (from lower levels) and proof validation. Finally trust will emerge through the use of digital signatures, and other kind of knowledge, based on recommendations by agents we trust, or rating and certification agencies and consumer bodies. The Web will only achieve its full potential when users have trust in its operations (security) and the quality of information provided.

2.5.2 Web Ontology

The WWW now is widely used as a universal medium for information exchange. Semantic interoperability among different information systems in the WWW is limited due to information heterogeneity, and the non semantic nature of HTML and URLs. Ontologies have been suggested as a way to solve the problem of information heterogeneity by providing formal and explicit definitions of data. An ontology is a model of the world, represented as a tangled tree of linked concepts. Concepts are language-independent abstract entities, not words. They are expressed in this ontology using English words and phrases only as a simplifying convention.

31 The term ‘ontology’ is imported from philosophy, and that the understanding of ‘ontology’ in computer science differs somewhat from the understanding of the term in traditional philosophy. Ontologies are developed to provide a machineprocessable semantics of information sources that can be communicated between different agents (software and humans). The purpose of the semantic ontology is to improve automated text processing by providing language-independent, meaningbased representations of concepts in the world. The ontology shows how concepts are related (e.g., DOG and CAT are closely related, both being MAMMALs) and what properties each has (e.g., both CAT and DOG have FUR and a TAIL, but CAT can be the AGENT of HISS, whereas DOG can be the AGENT of BARK). Unlike word, concepts in ontology are unambiguous, i.e. have exactly one meaning.

Many definitions of ontologies have been given in the last decade, but one that has been referred by many researchers as the best characterizes the essence of an ontology is based on the related definitions given in by Gruber(1993).

“An ontology is a formal, explicit specification of a shared conceptualization”.

A ‘conceptualization’ refers to an abstract model of some phenomenon in the world which identifies the relevant concepts of that phenomenon. ‘Explicit’ means that the type of concepts used and the constraints on their use are explicitly defined. ‘Formal’ refers to the fact that the ontology should be machine readable. Figure 2.4 shows a simple example of ontology. In a simpler way, ontology can be described as metadata schemas, providing a controlled vocabulary of concepts, each with an explicitly defined and machine process able semantics. By defining shared and common domain theories, ontologies help both people and machines to communicate concisely.

Generally, an ontology should provide descriptions of the elements; classes or ‘things’ in the various domain of interest, relationships among those ‘things’ and

32 properties or attributes that ‘things’ should posses. There are many formal definitions that were reported in the research. One of the works (Maedche and Staab, 2002) proposed that ontology can be described in five-tuples O. Ontology, O = {C, R, CH, rel, OA} where:

C and R are two disjoint sets, called the set of concepts and the set of relations

CH ⊆ C × C is a concept hierarchy or taxonomy, where CH (C1 , C2 ) indicates that C1 is a subconcept of C2 .

rel : R → C × C is a function that relates the concepts nontaxonomically.

OA is a set of ontology axioms, expressed in an appropriate logical language.

Concept, C

Relationship, R

Figure 2.5: Simple example of ontology (Ehrig and Sure, 2002)

33 2.5.3

Web Ontology Description Languages

A few different classifications for ontologies have been proposed in the literature. According to (Breitman et al.2006) ontology can be classifies according to their semantic spectrum, their generality and information represented. With the assumption that the ontologies are not explicitly ‘install’ in our web applications that are going be tested, we take the ontology from semantic spectrum view. From this perspective of view, ontologies can range from lightweight to heavyweight, depending on the complexity and sophistication of the elements they contain.

Ontology description languages are specifically designed to define ontologies. As explain in the above section, there is a need for more powerful ontology languages other than RDF and RDF-S that are more expressive that allow the representations of more complex relationships between Web objects. Several ontology description languages, such as SHOE, Oil, DAML, DAML+Oil, and OWL were later defined based on RDF/RDF-S. Table 2.2 contains the brief description of the ontology languages.

Table 2.2

Brief description of ontology languages

Language

Description

SHOE

- Simple HTML Ontology Extension - First ontology description language created for Semantic Web. - Extends HTML with new tags to semantically annotate Web pages(can be used by software agents to improve searching)

Oil

- Ontology Inference Layer - Result of On- To- Knowledge project (by European community) - Formal semantics based on description logic.

DAML

- DARPA Agent Markup Language - Sponsored by DARPA(first release on August, 2000)

34 - The goal to develop a language by extending RDF/RDF-S as well as tools aligned with the concept of the Semantic Web. DAML+Oil

- Integration of DAML and Oil - Released in 2001 and submitted to W3C - Became the starting point for Web Ontology Group(WebOnto)

OWL

- Web Ontology Language - The result of WebOnto. - Has greater machine interpretability of Web content than that supported by XML and RDF/RDF-S by providing additional vocabulary, based on description language. - Three sublanguages; OWL Lite, OWL DL and OWL Full. - Became a set of W3C recommendations.

The layered model for the Semantic Web as shown in Figure 2.3 puts the relationship among ontology description languages, RDF and RDF schema and XML in a better perspective (have been described in preceding subtopic). We can conclude from the explanation above that XML acts as the document exchange standard and RDF and RDF schema as mechanisms to describe the resources available on the web, as such they may be classified as lightweight ontology language (Breitman et al.2006). Full ontology description languages appear in the fourth layer as a way to capture more semantics while the topmost layer introduces expressive rule language.

2.5.4

Various Application of Ontology

Since the beginning of the nineties ontologies have become a popular research topic investigated by several Artificial Intelligence research communities. The reason ontologies are becoming so popular is in large part due to what they

35 promise: a shared and common understanding of some domain that can be communicated between people and application systems.

The strength of ontology has become a popular research topic investigated by several

Artificial

Intelligence

research

communities,

including

knowledge

engineering (Stutt, 1997), natural-language processing (Estival, 2004) and knowledge representation (Stevens et al., 2000). More recently, the notion of ontology is also becoming widespread in fields such as intelligent information integration (Fox et al., 1998), cooperative information systems (Vallet et al, 2005), information retrieval (McGuinness, 1999), and knowledge management. All the researches are mostly related to the semantic web technology. The reason ontologies are becoming so popular is in large part due to what they promise: a shared and common understanding of some domain that can be communicated between people and application systems.

One of the current researches on the Semantic Web area is semantic annotation of information sources. On-line lexical ontologies can be exploited as apriori common knowledge to provide easily understandable, machine-readable metadata. Nevertheless, the absence of terms related to specific domains causes a loss of semantics. In (Benassi et al., 2004), they present WNEditor, a tool that aims at guiding the annotation designer during the creation of a domain lexicon ontology, extending the pre-existing WordNet ontology. New terms, meanings and relations between terms are virtually added and managed by preserving the WordNet’s internal organization.

McGuinness in the work use ontology in electronic commerce research stream (McGuinness, 1999). Broader domains in the field increase the need for thoughtful content organization and browsing support. Ontology in the project is not only to support searching but also to enhance browsing and more active smart notification services. In the paper, they also identify some of the issues with respect

36 to existing ontology- enhanced e-commerce applications. Steven et al.(2000) in the paper aims to introduce the reader to the use of ontologies within bioinformatics. A description of the type of knowledge held in ontology was given. The paper was illustrated throughout with examples taken from bioinformatics and molecular biology, and a survey of current biological ontologies was presented. From the survey, we can see the use to which the ontology is put largely determines the content of the ontology. Stevens also described the process of building an ontology, introducing the reader to the techniques and methods currently in use and the open research questions in ontology development.

With the need of ontologies in different applications nowadays, a number of researches also have been done to solve the issues of ontology, e.g. heterogeneity issue. A lot of researches have been focusing on the ontology mapping as well as ontology matching, ontology alignment, ontology merging, ontology integration, etc. All of the terms give different definition and of course different aims of research. Different approaches have been proposed in the works and each has their own unique attribute.

2.5.5

Ontology Mapping

Ontology mapping is seen as a solution provider in today's landscape of ontology research. As the number of ontologies that are made publicly available and accessible on the Web increases steadily, so does the need for applications to use them. A single ontology is no longer enough to support the tasks envisaged by a distributed environment like the Semantic Web. Multiple ontologies need to be accessed from different systems. The distributed nature of ontology development has led to dissimilar ontologies for the same or overlapping domains. Thus, various parties with different ontologies do not fully understand each other. To solve these problems, it is necessary to use ontology mapping geared for interoperability.

37 Given that no universal ontology exists for the WWW, work has focused on finding semantic correspondences between similar elements of different ontologies, i.e., ontology mapping. Automatic ontology mapping is important to various practical applications such as the emerging Semantic Web (Ehrig, 2006), information transformation and data integration (Dou and McDermott, 2005), query processing across disparate sources (Gašević and Hatala, 2006), and many others . Ehrig and Staab define ontology mapping: “Given two ontologies O1 and O2, mapping one ontology onto another means that for each entity (concept C, relation R, or instance I) in ontology O1, we try to find a corresponding entity, which has the same intended meaning, in ontology O2”. According to Ehrig, an ontology mapping represents a function between ontologies. The original ontologies are not changed, but the additional mapping axioms describe how to express concepts, relations or instance in term of the second ontology. Whereas alignment merely identifies the relation between ontologies, mapping focuses on the representation and the execution of the relations for a certain task.

Ontology mapping is also known as ontology alignment, semantic integration, and ontology merging in some cases, depending upon the application and intended outcome. Noy(2004) argues that research in ontology mapping is organized around three specific areas of investigation:

1.

Mapping Discovery: Given two ontologies, how do we find the similarities between them, and how do we determine which concepts and properties represent similar notions? This is the type that will be applied in the clone detection process. Some works of ontology mapping can be found in (Qian and Zhang, 2006a), (Qian and Zhang, 2006b), (Zongjiang et al., 2006), (Jin et al., 2007), (Zhang et al., 2008) and (Ichise, 2008).

2.

Declarative formal representation of mappings: Given two ontologies, how do we represent the mappings between them to enable reasoning with mappings?

38 3.

Reasoning with mappings: Once the mappings are defined, what do we do with them, what types of reasoning are involved?

Object

Thing

Wheeled Car

Locomotive has

Big Car

Engine has

Bus

has has Horsepower

Train has

Automobile has

Horsepower

Autobus

Cylinder

Ontology A

Ontology B

Figure 2.6: Simple example of mapping between two ontologies. The above figure shows an example of an ontology mapping. Given ontology A and ontology B, the aim is to find the similar concept from the first and the second hierarchies. Both ontologies were represented in the form of graph. As we can see in the figure, by doing ontology mapping we know that ‘thing’ can be mapped to ‘object’, ‘car’ can be mapped to ‘automobile’, etc. By using this information, we know that some of information ontology A and B are somehow referring to the same semantic of concept.

39 2.5.6

Ontology Mapping Approaches

There are a few approaches of ontology mapping in the literature. The approaches can be simplified into four main categories like below. Figure 2.6 shows the illustration of the approaches. i. One-to-one approach: for each ontology, a set of translating functions is provided to allow the communication with the other ontologies without an intermediate ontology e.g. in (Calasanz et al., 2006). ii. Single-shared ontology: uses shared ontologies which serve as the ontology counterpart of a lingua franca. Resource ontologies are mapped onto the shared ontology thus avoiding the potentially large number of individual resource mappings that may have to be defined. E.g. in Visser (Visser et al., 1999). iii. Multiple- shared ontology: uses the same idea as single-shared ontology but for this category the shared ontology might be multiple. iv. Ontology clustering: resources are clustered together on the basis of similarities. Additionally, ontology clusters can be organized in a hierarchical fashion, i.e. in (Visser et al., 1999).

Shared

Onto A

Onto B Onto A

(i)One-to-one approach

Onto B

(ii)Single shared –ontology approach

40

Shared1 Shared2

Shared3

Application

Dutch/ Italian shared Dutch mirror

Onto Onto B A

French mirror Italian mirror

Onto Onto D C

Onto C Onto Onto A B

(iii)

Multiple

approach

shared

–ontology (iv) Ontology clustering, e.g. by language

Figure 2.7: Illustration of ontology mapping approaches

2.5.7 The ontology mapping technique

Today quiet a lot of ontology mapping technique that exists in literature. Simple descriptions of a few techniques have been mention above. In order to use the technique in our main problem which is clone detection, we need to find a technique that can adapt with the problem environment. In this project, we are dealing with web documents that consist of more than one format. The basic nature of web application development is a programmer has a tendency to modify the cosmetic of web appearance and sometimes make partial changes to the scripting and so on.

41 Current clone detection technique can be found either way in tree-based, string based, token- based and so on. Tree-based can always deal with any size of files but produce not so good result. String- based is less complex than the other three but can only detect exact clone because of the string matching strategy.

We need a technique that can work fast and the complexity should be very low that leading to fast matching procedure. Beside that, the algorithm should be intelligent enough to differentiate concepts. Consider for example the word “score” and “store”, they represent two completely different concepts but resemble a lot. Working with web application always deal with a lot of strings that represent information to end user as well as functions for document processing, unlike traditional software that mainly only consist of functions and procedures. Simple algorithm would be best to be considered as complex algorithm that contains too many features and parameters can affect the performance of mapping.

For this reason, we has consider the string metric technique that was proposed in (Stoilos et al., 2005) as one of the technique that adopted in the project. They argue that the similarity among two entities is related to their commonalities as well as their differences. Thus, the similarity should be a function of both these features. The metric is defines by the following equation: Sim( s1 , s 2 ) = Comm( s1 , s 2 ) − Diff ( s1 , s 2 ) + winkler ( s1 , s 2 )

(1)

where Comm(s1, s2) stands for the commonality between s1, and s2, Diff(s1, s2) for the difference and Winkler(s1, s2) for the improvement of the result using method introduced by Winkler(1999).

The function of commonality is motivated by the substring string metric. In the substring metric the biggest common substring between two strings is computed. This process is further extended by removing the common substring and by

42 searching again for the next biggest substring until no one can be identified. The sum of the lengths of these substrings is then scaled with the length of the strings. The intuition behind this extension of the substring metric is the following. In the field of Computer Science researchers tend to use descriptive names for their variables or the units that represent real world entities. In other cases they tend to concatenate words and create new ones.

Comm( s1 , s 2 ) =

2 * ∑ length(max ComSubString i ) i

length( s1 ) + length( s 2 )

(2)

As for the difference function, this is based on the length of the unmatched strings that have resulted from the initial matching step. The author believes that difference should play a less important role on the computation of the overall similarity. The solution is based on Hamacher product, which is a parameter triangular norm. This leads us to the following equation:

Diff ( s1 , s 2 ), Φ =

uLen s1 * uLen s 2 p + (1 − p ) * (uLen s1 + uLen s 2 − uLen s1 * uLen s 2 )

(3)

where p ∈ [0,∞), and uLens1, uLens2 represent the length of the unmatched substring from the initial strings s1 and s2 scaled with the string length, respectively. Observer that the parameter p can be adjusted at will give a different importance on the difference factor.

For the Winkler(s1, s2), it is based on work done by Winkler for the improvement of the result. The formula is like below:

Winkler ( s1 , s 2 ), Φ n = Φ + i ⋅ 0.1 ⋅ (1 − Φ ), for i = 1, 2, 3 , 4, …

(4)

43 We notice that that it might be good if we test the program with another type of well-known similarity measurement. All the measurements are stated in the next subchapter.

After doing the initial experiment, we found out that the above metric itself is not enough to detect clones. We need to find and compute similarity of each and every node in all the source codes in order to find the clones. This is too expensive since there it no filtering at all has been applied. So, as the solution, we use another technique of mapping as in (Todorov, 2008). In this technique, it used two layers of mapping. The first layer is used to detect structural similarity and the second one is used to detect the instance similarity between concepts in ontology O1 and O2.

In the paper, the author presents a procedure for mapping hierarchical ontologies, populated with instances taken from properly classified text documents. It combines a structural and instance-based approach in order to yield concept-to concept mapping assertions between two input ontologies. It can be successfully applied to finding correspondences between the elements of two directories.

First Phase: Structural Ontology Similarity •

Before applying a structure-based mapping technique, the author starts by "translating" ontologies into graphs. Finding correspondences between elements of such graphs relates to the graph homomorphism problem.

•

Focusing on hierarchical ontologies, i.e. considering a very narrow set of graphs - directed rooted trees and then define a distance function on such graphs.

•

The step is done by finding maximal common subgraph where a common subgraph is maximal if it exists no other subgraph isomorphism from G to G1 and G2 that has more nodes than G, denoted by mcs(G1 , G2 ) .

44 •

Define the distance function which will be used as an indicator of the structural similarities of two taxonomies. Let G denote the number of vertices in a graph G. The distance between two nonempty graphs G1 and G2 is defined as

d (G1, G 2) = 1 −

mcs(G1 , G2 ) max( G1 , G2 )

Second Phase: identifying concept matches •

Making use of extensional information by model concepts as "bags" of instances.

•

A set theoretic approach based on testing the intersection between classes.

•

Let O1 = (G1 , is_a) and O2 = (G2 , is_a) be two ontologies, let IO1 and IO2 be the sets of instances belonging to the ontologies correspondingly and let A∈ C1 and B ∈ C2 be two different concepts of each ontology, viewed as

sets of instances. We will consider A and B very similar when A ∩ B ≈ A ≈ B and not similar at all when A ∩ B = Φ

•

Compute using Jaccard coefficient:

P ( A, B) =

A ∩O1 B + A ∩O 2 B I O1 + I O 2

45 2.5.7.1 String Metrics

String metrics (also known as similarity metrics) are a class of textual based metrics resulting in a similarity or dissimilarity (distance) score between two pairs of text strings for approximate matching or comparison and in fuzzy string searching. For example the strings "Sam" and "Samuel" can be considered (although not the same) to a degree similar. A string metric provides a floating point number indicating an algorithm-specific indication of similarity.

The most widely known (although rudimentary) string metric is Levenshtein Distance (also known as Edit Distance), which operates between two input strings, returning a score equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons. A widespread example of a string metric is DNA sequence analysis and RNA analysis, which are performed by optimised string metrics to identify matching sequences.

String metrics are used heavily in information integration and are currently used in fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database deduplication, data mining, Web interfaces, e.g. Ajax-style suggestions as you type, data integration, semantic knowledge integration, etc. The following table shows some of well-known string metric which will be used in the experiment.

46 Table 2.3: List of string metric. 1

Jaro Winkler

⎛1 m m m−t ⎞ ⎟ d j = ⎜⎜ + + m ⎟⎠ ⎝ 3 | s1 | s2

2

Levenshtein Distance

⎧ D(i − 1,j − 1 ) + d(si ,t j )//substitute/copy ⎪ D(i,j) = min ⎨ D(i,j) = min D(i − 1,j) + 1//insert ⎪ D(i,j − 1 ) + 1//delete ⎩

3

Needleman-

Wunch

Distance 4

Dice Coefficient

5

Smith-Waterman Distance

6

Jaccard Similarity

7

Monge Elkan Distance

8

Matching Coefficient

9

Euclidean Distance

⎧ D(i − 1,j − 1 ) + d(si ,t j )//substitute/copy ⎪ D(i,j) = min ⎨ D(i,j) = min D(i − 1,j) + G//insert ⎪ D(i,j − 1 ) + G//delete ⎩

s=

2 | X ∩Y | | X | + |Y |

0 // start over ⎧ ⎪ D(i − 1,j − 1 ) + d(s ,t )//substitute/copy ⎪ i j D(i, j ) = min ⎨ min D(i,j) = D(i − 1,j) + G//insert ⎪ ⎪⎩ D(i,j − 1 ) + G//delete

( X *Y ) (| X || Y | −( X * Y )) match( A, B) = si j = ED =

1 | A| |B| ∑ maxmatch( Ai , B j ) | A | i =1 j =1

a+d p n

∑(p i =1

i

− qi ) 2

10 L1 Distance

L1(q, r ) = ∑ y | q( y ) − r ( y ) |

11 Gotoh Distance

Dij= max {Di-1, j-1 +d(ai ,b j ), max {Di-k ,j -Wk }, max {Di, j-l -Wl }, 0 }

12 Jaro Distance

1 ⎛ | s ' | | t ' | | s ' | −Ts ',t ' ⎞ ⎟ Jaro( s, t ) = ⎜⎜ + + 3⎝| s| |t | 2 | s ' | ⎟⎠

13 Soundex Distance

1) retain first letter of word 2) change occurrence of A, E, I, O, U, H , W and Y into 0(zero) 3) change the rest of the letters according to Soundex table 4) removes all pairs of consecutive digits. 5) removes all zeros of the resulting string

47 6) last result 14 Overlap Coefficient

OC (q, r ) =

(| q & r |) min{| q |, | r |}

∑ q(r ).r ( y)

15 Cosine Similarity cos(q, r ) = 16 q-Gram

y

∑

y

q( y) 2 ∑ y r ( y) 2

1) “sliding” a window of length q over the characters of a string to create a number of 'q' length grams. 2) count of identical q-grams over the total qgrams available

2.5.7.2 Frequent Subgraph Mining

The problem of frequent subgraph mining has been studied intensively in the past years, with many applications in bioinformatics/ chemistry, computer vision and image and object retrieval, and Web mining. Algorithms are based on the a priori principle stating that a graph pattern can only be frequent in a set of transactions if all its subgraphs are also frequent.

Several algorithms have been developed to discover arbitrary connected subgraphs. One research direction has been to explore ways of computing canonical labels for candidate graphs to speed up duplicate-candidate detection. Various schemes have been proposed to reduce the number of generated duplicates. In particular, candidates that are not in canonical form need not be developed and investigated further. Some algorithms constrain the form of the sought subgraphs, some search for induced subgraphs (a pattern can only be embedded in a transaction if its nodes are connected by the same topology in the transaction as in the pattern). And some focus on “relational graphs” (graphs with a bijective label function).

48 Frequent subgraph mining considers a set G of isomorphically distinct graph patterns whose vertices and edges are labeled in a set L. Because two isomorphic patterns of G are necessarily equal, the isomorphic subgraph relation ≤G is an ordering relation over G. Given a database D of graphs labeled in L, the frequency freq(p) of a pattern pεG is the number of elements of D which contain at least one subgraph isomorphic to p. A pattern is frequent if its frequency is greater or equal to a given threshold.

In this project, we are going to implement four of the most popular frequent subgraph miners using a common infrastructure which are MoFa, gSpan, FFSM and Gaston (Meinl et. al, 2006). All these miners were included in Parmol project (Meinl et. al, 2006). The earlier motivation of Parmol is to find common features in large sets of molecules.

2.5.7.3 MoFa, gSpan, FFSM, and Gaston

All four fragment miners included in ParMol work on general, undirected graphs with labeled nodes and edges. They are all restricted to finding connected subgraphs and traverse the lattice in depth-first order.

i.

MoFa (Molecule Fragment Miner, by Borgelt and Berthold in 2002)

MoFa has been targeted towards molecular databases, but it can also be used for arbitrary graphs. MoFa stores all embeddings. New subgraphs are built by extending old subgraphs with an edge (and a node if necessary). Extension is restricted to those fragments that actually appear in the database. Isomorphism tests in the database can be done cheaply by testing whether an embedding can be refined in the same way. MoFa uses a fragment-local heuristic close to the maximum source node extension

49 described above and uses standard isomorphism testing to remove duplicates.

ii.

FFSM (Fast Frequent Subgraph Mining, by Huan, Wang, and Prins in 2003])

FFSM represents graphs as triangle matrices (node labels on the diagonal, edge labels elsewhere). The canonical adjacency matrix, CAM, is used to detect duplicates. The matrix-code is the concatenation of all its entries, left to right and line by line. Based on lexicographic ordering, isomorphic graphs have the same canonical code. New subgraphs are generated by merging CAMs that have special properties in common. In addition FFSM needs a restricted extension operation: a new edge-node pair may only be added to the last node of a CAM. When FFSM joins two matrices of fragments to generate new subgraphs, only at most two new structures result. After refinement generation, FFSM permutes matrix lines to check whether a generated matrix is in canonical form. If not, it can be pruned. FFSM stores embeddings to avoid explicit subgraph isomorphism testing.

iii.

gSpan (Graph-based Substructure pattern, by Yan and Han in 2002 [YH02])

gSpan uses a canonical form for graphs (called dfs-code: depth first search code) resulting from the used rightmost path extension to eliminate the remaining duplicates. A depth first traversal of a graph defines an order in which the nodes and edges are visited. The concatenation of edge representations in that order is the graph’s dfs-code. Refinement generation is restricted by gSpan in two ways: o First, fragments can only be extended at nodes that lie on the

rightmost path of the depth first search tree. o Second, fragment generation is guided by occurrence in the

appearance lists. Since these two pruning rules cannot fully prevent isomorphic fragment generation, gSpan computes the canonical

50 (lexicographically

smallest)

dfs-code

for

each

refinement.

Refinements with non-minimal dfs-code can be pruned. Since instead of embeddings, gSpan only stores appearance lists for each fragment, explicit subgraph isomorphism testing must be done on all graphs in these appearance lists.

iv.

Gaston (GrAph/Sequence/Tree extractiON, by Nijssen and Kok 2004])

Gaston stores all embeddings to generate only new subgraphs that actually appear in the database and to achieve fast isomorphism testing. The main insight of Gaston is that there are efficient ways to uniquely enumerate paths and trees. The last phase deals with general graphs. As Gaston first generates paths, then trees, and finally general graphs, it has a very special search strategy through the subgraph lattice. For all three steps different and specialized methods to generate new subgraphs are used. For the last phase Gaston defines a global order on cycle-closing edges to minimize the need for graph isomorphism tests. By considering fragments that are paths or trees first, and by only proceeding to general graphs with cycles at the end, a large fraction of the work can be done efficiently. Only in that last phase, Gaston faces the NP-completeness of the subgraph isomorphism problem. Duplicate detection is done in two phases: hashing to pre-sort and a graph isomorphism test for final duplicate detection. Gaston can calculate the frequency of a subgraph either with isomorphism tests or embedding lists.

2.5.7.4 Representing Web Programming as Tree

Generally, when we do web programming we are actually written the line of codes with tagging and the tags will have other properties such as the contents, the attribute and the value. We could also represent web programming in a hierarchical

51 way. For the example the web source code in the following figure could be represent in a tree with hierarchical order if the code is well written or a valid code with complete < and > symbol. The following figure shows a simple example of how an XML code could be represented in the form of tree.

As we are going to represent the code in the XML syntax, we stated here the formal definition of directed rooted tree as XML tree always a directed and rooted one. A directed tree is a connected acyclic graph G(V, E), where V( G) denotes the set of vertices of G and E(G) denotes the set of partially ordered pairs of vertices of G, called edges. A tree is called a rooted tree if it exists a vertex designated as "root" and all edges have an orientation - either towards or away from the root. Now let Γ be a set of directed rooted trees. We assume that for every d ∈ D , it exists G ∈ Γ so that C = V , Where: - . denotes set cardinality - d is a source file in XML database D - C is set of tagging name in D. There exists an isomorphism:

f : C → V , i.e. (ci, cj ) ∈ E d ⇔ 〈 f (ci), f (cj )〉 ∈ E , where ci, cj ∈ d .

52 XML in a Nutshell Elliotte Rusty Harold, W. Scott Means 39.95 Who Moved My Cheese Spencer, M.D. Johnson, Kenneth H. Blanchard 19.95 (a) XML document

Document

DocumentTy pe Element book

Element catalog Element book

Attr id=”121” Attr id=”121”

Element author

Text XML in a Nutshell

Text ““

Text ““ Element author Text Elliotte,

Comment sample

Element title Element price Element

39.95 (a) XML tree

Figure 2.8: Tree representation of an XML source code.

2.6


As we see in the previous researches, there are plenty of clone detection techniques and their corresponding tools, and therefore, a comparison of these techniques/tools is worth much in order to pick the right technique for a particular purpose of interest. There are several parameters with which the tools can be compared. These parameters are also known as clone detection challenges. In the

53 following we list some of the parameters we use for comparing the different tools/techniques: i.

Portability: The tool should be portable in terms of multiple Languages and dialects. Having thousands of programming languages in use with several dialects for many of them, a clone detection tool is expected to be portable and easily configurable for different types of languages and dialects tackling the syntactic variations of those languages.

ii.

Precision: The tool should be sound enough so that it detect less number of false positives i.e., the tool should find duplicated code with higher precision.

Precision , p =

iii.

number of correct found clone number of all found clone

Recall: The tool should be capable of finding most (or even all) of the

clones of a system of interest. Often, duplicated fragments are not textually similar. Although editing activities on the copied fragments may disguise the similarity with the original, a cloning relationship may exist between them. A good clone detection tool will be robust enough in identifying such hidden cloning relationship so that it can detect most or even all the clones of the subject system. Recall, r =

iv.

number of correct found clone number of possible existing clone

Scalability: The tool should be capable of finding clones from large code

bases as duplication is the most problematic in large and complex system. The tool should handle large and complex systems with efficient use of memory. In this thesis by analyzing computation time taken for different size of testing. v.

Robustness: A good tool should be robust in terms of the different editing

activities that might be applied on the copied fragment so that it can detect different types of clones with higher precision and recall. In this

54 thesis we apply the robustness by listing the type of clones the respective clone detector finds and their frequencies.

2.7

Different with work by Jarzabek

In the next chapter, we will discuss in detail our propose methodology. Basically the proposed methodology is by adopting works done in two ontology mapping works. There are two main phases that will be conducted, i.e. structural phase and string similarity phase. There are not many works that have been done in detecting structural clone. One of the outstanding one is work by Jarzabek and Basit(2005).

In (Jarzabek and Basit, 2005), they claimed that their work is the first of its kind in analyzing patterns of cloned fragments to infer design-level similarities in a system. The original contribution of their work lies in formulating heuristics for inferring design-level similarities based on patterns of simple clones, and applying data mining approach to automate making proper inferences. They also demonstrate that the method finds useful design-level similarities and scales up to large programs. The work is basically concentrating more on detecting similar code fragments – socalled simple clones where it leads to bigger contribution in detecting structural clones. The authors claimed that recurring patterns of simple clones – so-called structural clones - often indicate the presence of interesting design-level similarities.

In a proper definition, structural clone by (Jarzabek and Basit, 2005) is patterns of inter-related classes emerging from design and analysis spaces; patterns of components at the architecture level; design solutions repeatedly applied by programmers to solve similar problems, whereby simple clone is contiguous

55 segments of similar code such as class methods, or fragments of method implementation.

2.7.1

Clone Miner by Jarzabek

This subchapter will discuss briefly the technique that have been done by Jarzabek and Basit(2005). The technique is capable to detect both simple and structural clones, allowing a vast range of differences between them. Simple clones can differ in type parameters, keywords, variable or constant names, and operators.

The simplest form of a structural clone is a class. The authors are interested in classes differing in details of method implementation, method signatures, or the order in which methods are listed in the class body. If a class contains extra methods as compared to other similar classes, it will still be considered as structural clones. Beyond similar classes, they also interested in patterns of classes/components that display similarities.

The reason why the authors are interested in such vast range of similarities is that their goal for clone detection is to unify clone classes with generic design solutions. They build generic design solutions with a meta-level method, called XVCL, which is capable of unifying both simple and structural clones, even with a wide range of differences among them. With XVCL meta-structures, any similarity patterns can be unified, whose unification is deemed beneficial from the engineering point of view (e.g., leads to simpler programs that are easier to maintain or reuse due to non-redundancy).

56 2.7.1.1 Detection Of Simple Clones

In detecting simple clone, Jarzabek and Basit have implemented the token based detection of simple clones. They claimed that with negligible amount of preprocessing, clone detection based on raw text is language independent and free of pre-processing overhead, but is very sensitive to the small differences that may be present between two very similar code fragments. It can only find the exact clones, and cannot be used for parameterized clone detection. After tokenizing the input source code files, a single token string is generated. Efficient suffix array based repeat

finding

algorithm,

with

some

language-specific,

heuristics

based

optimizations, is implemented to detect clones.

They propose a customizable tokenization strategy. In this scheme, a separate integer ID is assigned to each token class found in the source code. The classification of tokens is totally customizable. For example, if the user does not want to differentiate between the types (int, short, long, float, double), we can have the same ID to represent every member of the above set of types. An important parameter that the user needs to adjust for clone detection is the minimum length of the clones. If this threshold value is too large, few clones are reported. On the other hand, if the threshold value is too small, a large number of clones are reported, many of them being so small that no meaningful clone unification can be applied.

2.7.1.2 Finding Structural Clone

A clone class is a maximal set of code portions in which a clone relation holds between any pair of code portions. The detection of all clone classes corresponds directly with the computation of all non-extendible ‘repeats’ in a string, when the code is represented by a string of tokens. The sentinels for method and file

57 boundaries make sure that no ‘repeat’ crosses these boundaries. The algorithm for finding these non-extendible ‘repeats’ returns the output in terms of the indexes in the token string for beginning and ending of the repeat. This information has to be translated into file name, line number and column number to be useful for the user and to be projected on the source code browser. For this purpose, information for line number and column number for each token is stored separately.

The basic output from the Clone Miner gives the number of total clone classes found and the details for each clone class. This includes its length in tokens, number of clone instances (members of this clone class), file ID for each clone instance, its beginning line number and column number, and its ending line number and column number. A sample extract of this format is shown in Figure 2.9. The first row says that the file with file ID 12 contains three clone instances belonging to clone class 9 and one instance from each of the clone classes 15, 28, 38, and 40. The interpretation is likewise for the other rows.

FILE ID

CLONE CLASSES PRESENT

…

…

12

9, 9, 9, 15, 28, 38, 40

13

12, 15, 40, 41, 43, 44, 44

…

… Figure 2.9: Clones per file

Some of these clone patterns could be quite significant and cover a considerable part of some files. Such clusters of files that are covered by a significant clone pattern form the basic structural clones, and may also indicate higher design level similarities that can be extracted with proper domain analysis.

58 To measure file coverage by a clone pattern, Jarzabek calculate two metrics, namely the File Percentage Coverage (FPC), which indicates the percentage of a file covered by a clone pattern, and the File Token Coverage (FTC), which tells the number of tokens in a file covered by the clone patter, for each file containing the clone pattern. The first two rows in Figure 2.10 depict the clone pattern as explained above. The last two lines give the file ID with the FPC and FTC values for each file.

CLONE PATTERN

9,9,9,15

SUPPORT

2

FILE ID

FTC

FPC

12

1175

29%

14

1175

50%

Figure 2.10: Frequent clone pattern with file coverage

2.7.2 Comparison of existing work and our proposed work.

As we can see in the previous subchapter, we have discussed briefly on how Clone Miner by Jarzabek and Basit worked. Basically the technique is capable to detect simple clone and structural clone.

Here in our work, we also wish to detect structural clone as well. But we reconstruct our own definition of structural clone. Structural clone is a pair of code fragment in web programming which has similar node structure according to the tagging. Since we represent our web programs in XML representation, the tagging that we mention here is the XML tagging. So any code fragments which have similar tags will be accepted as structural clones. These structural clones will then be taken as candidate of simple clones. In order to get the simple clone, we read the node as

59 strings and calculate the similarity of string using existing well-known string metrics. The following figure is an example of a structural clone pair in a real XML codes.

< Address>

< Email>

< State> XML code B XML code A

Figure 2.11: Similar node structure between two XML code fragments

Referring to the following figure, we can see that clone miner by Jarzabek and Basit looks for simple clone first to get a list of clone classes. By using the classes then only they detect the structural clones. In simple words, the structural clone here referred to clone of classes.

In our proposed technique, we change the order of detection as we adapt the work of ontology mapping in clone detection. We wish to detect structural clone first then we find the simple clone by using the structural clones where we treat the structural clones as candidate simple clone pair.

60

Detect simple clone (token based)

List of Clone classes

FIS

(a) Jazarbek and Basit(2005)

Find frequent subtree (subcode FSG exists more than List of one) “Structural clone” (a) Our proposed method

Find Structural clones

FIS – frequent itemset

Subtree (candidate clones)

Detect simple clone

FSG – frequent subgraph miner

Figure 2.12: Difference of work by Jarzabek and Basit(2005) and our proposed work

CHAPTER 3

RESEARCH METHODOLOGY

3.1

Introduction

The previous chapter has discussed several existing techniques of code clone detection. Most of the tools were developed for detection in traditional software. Some of the techniques then have been adopted in code clone detection within web application area.

As mention in the previous chapter, the aim of this project is to find out the ability of using a technique in ontology mapping to detect code clones across files. The main idea of ontology mapping is of course to map ontology from the one application with the other ontology from another application. But for the purpose of this project, we need to redefine the definition of ontology as we assume that the ontologies are not explicitly installed in the systems and ontologies itself will not going to be used in detecting clones.

The remaining subtopics will discuss the organization of the research and the proposed technique of code clone detection.

62 3.2

Proposed technique of code clone detection

We will start by defining the relation assumed by our model between ontologies and source code, on the one hand, and source code and instances on the other hand. A set of documents can serve as a base to extract ontological information Stumme., and Maedche, 2001). In this model we represent the source codes using XML parse tree. So we assume that the ontological information in this case is all the tagging name in XML trees, i.e. we called as concept in this thesis as stated in the following formal definition. The instances are all similar concepts that actually populated in the source code. An instance will consist of the concept itself and the attributes and value of that concept.

In ontology mapping, given two schemas, A and B, one wants to find mapping µ from the concepts in A into the concepts of B in such a way that, if a = µ(b), then b and a have a same meaning.

This clone detection basically has the same idea as in ontology mapping work.

Definition 1: if a = µ (b), then b and a have the same meaning, hence derived code clones.

Our strategy is to do one- to- one mapping since using specific shared ontologies might request for a specific domain of knowledge for different applications that need to be compares. The idea is to derive mappings from candidate concept A to the concepts A’ with the same names as in the selected ontologies.

63 Based on the definition of ontology in Chapter 2 we adopt the following definition of derivative ontology in our research. An ontology O’, is composed by 5 tuples,

Definition 2: Ontology, O’ = {C’, R’, CH’, rel’, OA’} where:

C is the set of concepts in each of the nodes in the tree.

CH ' ⊆ C '×C ' is concept hierarchy or taxonomy, where CH ' (C '1 , C ' 2 ) indicates that C’1 is a subconcept of C’2.

rel : R' → C '×C ' is a function that relates the concepts nontaxonomically, R’ is the set of relations where R’ = Ø.

OA is a set of ontology axioms, where OA’ is the properties of concepts, in practical the contents of tags, the attribute and the value.

Figure 3.2 shows the overall phases of clone detection. The key idea of the propose technique is by combining detection by structural information and instance based detection as both of the techniques have their own strength and weaknesses as in (Todorov, 2008) and has been discussed in the previous chapter. The output of the processes will be a set of similar fragments of code, i.e. under different type of clone between two different systems.

In our framework we assume that the population phase had already taken place and it exists a set of source codes so that: i.

It covers all concepts of the source code trees. "Covers" is understood as: Instances of every concept can be found in at least one of the tree in the collection of source code tree. Every tree contains instances of at least one concept.

64 ii.

A tree node is considered to be assigned to a concept node if and only if it provides instances of that concept with a higher cardinality than a fixed threshold (Figure1).

In the sequel we will deal with hierarchical trees. We are concerned with studying their similarity on purely structural level, so let us assume that only the concept nodes are label but not for the relations. Under these assumptions we provide the following definition of a hierarchical source code tree.

Definition 3: A hierarchical tree is a couple O’:= (C, is_ a), where C is a finite set whose elements are called concepts and is_ a is a partial order on C.

We proceed to formalize the association of tree of different source codes. Let

α be a set of hierarchical source code trees of system 1 and β a set of hierarchical source trees of system 2 satisfying the assumption 1 and 2.

Let γ : α → β be an injection from the set of source code trees of system 1 to the set of source code trees of system 1. For every subtree of O'α ∈ α and subtree of O ' β ∈ β that can be map so that there exists γ (O 'α ) = O ' β , it exists an injection

g : CO 'α → CO 'β which map concepts in source code tree of system 1 to system 2. The following figure shows an illustration of the mapping between trees.

65

is_a

g

O'α

O 'β

Figure 3.1: Mapping between concepts of O'α and O'β

3.2.1

Structural Tree Similarity

As mention in the previous chapter, we are going to do two layers of tree comparing. The first layer is about the structural tree similarity. It eventually provide some kind of filtering to the model since it finds parts of the trees which is similar between each other before we do the real similarity comparing.

And as been discussed in the literature review, XML tree is actually a directed rooted tree which can be represented formally using the definition of graph G = (V , E ) . So a source code tree can formally be represented by the following

definition:

Definition 4: Let O' X be a source code tree. A source code tree corresponding to 0

is a directed rooted tree G (V (G ), E (G )) , so that (1) V (G ) = C (2) E (G ) ⊆ C × C such that ∃f where f (ci , c j ) ∈ EOX ⇔ (ci, cj ) ∈ is _ a .

66 In (Todorov, 2008), the author used Bunke’s graph distance metric to calculate the distance of source code structure based on maximal common subgraph. We are not going to find the maximal subgraph since this technique is often a NPcomplete problem and it has been use several time in the previous works of clone detection. So instead of using maximal common subgraph, we used the frequent subgraph miner available. Before that we start by giving a couple of definitions which are needed before introducing the distance ratio. The distance ratio is used to find out number of programs that could have high similarity of structure between each other. All definitions are given for general graphs and are also applicable for trees.

Definition 5: Graph Isomophism. A bijectie function µ : V1 → V2 is a graph

isomorphism form graph G1 (V1 , E1 ) to a graph G2 (V2 , E2 ) if for any v11 , v12 ∈ V1 v11 , v12 ∈ E1 ⇔ µ (v11 ), µ (v12 ) ∈ E2

Definition 6: Subgraph isomorphism. An injective function µ : V1 → V2 is a

subgraph isomorphism from G1 to G2 if it exist a subgraph S ⊆ G2 so that µ is a graph isomorphism from G1 to S.

Definition 7: Graph distance ratio: We simplify the distance of graph G1 and G2 by

using the following ratio since we are using frequent subgraph mining. Let FG1 and

FG2 as sets of frequent subgraphs which owned by G1 and G2. FG1 = { t1, t2, t3,…, tm} FG2 = { t1, t2, t3,…, tn}

67 Distance of G1 and G2 can be calculated as follows:

d (G1 , G2 ) = 1 −

# (t G1 ∩ t G2 ) max(# t G1 , # t G2 )

The differences of this structure similarity with other existing work of code clone detection will be discussed in the next chapter.

3.2.2 String based Tree Matching

In the sequel we will define a measure of similarity between two concepts of two different source code trees, O’x, based on the instances that they contain. One convenient way of making use of extensional information is to model concepts as bags of instances. More formally speaking, this is a set theoretic approach based on testing the intersection between classes.

Let Oα = (Cα , is _ a) and Oβ = (C β , is _ a ) be two trees of source codes, let

I O1 and I O2 be sets of instances belonging to the ontologies correspondingly and let A ∈ Cα and B ∈ C β be two different concepts of each subtree between two different tree viewed as set of instances. We will consider A and B very similar when

A ∩ B ≈ A ≈ B and not similar at all when A ∩ B = φ .

According to (Todorov, 2008), we do not have to commit to a particular definition of similarity instead of using only Jacc coefficient as in (Todorov, 2008). So in this thesis, we test all the similarity metric as mention in Chapter 2. The easiest way is by representing all instances of the subtree as a string and compare with

68 another similar subtree of other source code tree. We assume that each element of the string is an element of a set of instances in A or a set of instances in B. By using this model, three main things that can be detected are: i.

Degrees of similarity between different subtrees which can classify two fragments under identical or non-identical fragment of code. It is achieved if degree of similarity is equal to one, i.e. simm = 1 .

ii.

Non-identical fragments could be classified into nearly-identical and similar fragments. Both of these two types should have similarity larger than a certain threshold σ , i.e. simm>= σ . a. For nearly identical clone, simm>0.7

iii.

Degree of similarity between two different source code trees in terms of structural similarity. Using the information, if is more than a fix threshold θ we then calculate all the instance similarity , and then take the average similarity in order to know the degree of similarity between this two trees. So at the end of the detection we will see as well number of source file which are much similar between each other.

69

Web documents A

Web documents B

PREPROCESSING Transform into XML parse tree

Structural Similarity (using frequent subgraph miner)

1st step

2nd step

yield

frequent subgraphs (indicating frequent substructure found in all program trees

String- based matching (using string metrics) yield

3rd step

Clone pairs, i.e. similarity>θ

POSTPROCESSING Extract clones

4th step

ANALYSIS

5th step

Figure 3.2: Diagrammatic view of clone detection technique

70 3.3

Preprocessing

The initial idea is by doing detection with combination of tree detection and string detection. For this reason, the clone detection will start with the preprocessing where all documents will be standardized into XML documents in order to get the tags and contents of each node. We are going to test the model on HTML, ASP, PHP and JSP systems. Web page documents from system A and system B need to be compared to detect the cloning. Then the XML will be parsed into tree.

To minimize the code for each file, all XML codes will be cleaned to eliminate all useless lines of code so that we could maximize the code comparing without trying to compare the formatting information which only use for the purpose of information appearance to the end user. For each and every XML source codes, the tag names will be taken and inserted into a file called ‘vocabulary’ that will be used for XML node matching. Duplicate entry in the vocabulary will then be deleted from the list to minimize number of entries in the vocabulary.

Clean XML code where all useless tagging will be eliminated i.e. formatting tag.

file 1

file 2

Convert into XML document

. .

file n A collection of clean XML documents

Original Web Programs

VOCABULARY Figure 3.3: Preprocessing phase

71 3.4

Frequent subgraph mining

The detection process will then started with the structural comparison of the tree. The comparison of nodes is done between O'α and O'β which represent two different systems. After generating the frequent subgraphs, we store the shared subtree of different program or source code in a cross table.

Table 3.1: Example of cross- table used to compare programs across two

systems. Program, p

p1

p1

…

…

…

…

pn

p1

t1, t2

-

-

t1, t3, t4

-

-

-

p1

-

t1, t2

-

-

-

-

-

…

-

-

-

-

-

-

-

…

-

-

t1, t3, t4

-

t1, t3, t4

-

-

pm

-

-

-

-

-

-

t1, t3, t5

For each subtree in the table that was generated by the frequent subgraph miner, we set the minimum size i.e. number of edges of subtree as 5 and the maximum is 6. The decision is made after an initial experiment where it shows that number of frequent subtree generated is not too large or too small in comparison with other value of parameter. Then the string based matching will be done as have been described above. The example of a frequent subgraph between two trees is shown below.

There are four well-known miners that we are going to test in our experiment. The miners were explained briefly in the following table.

72 Table 3.2: Brief description of each frequent subgraph miner Miner

GSpan

Description

- a software package of mining frequent graphs in a graph database. Given a collection of graphs and a minimum support threshold, gSpan is able to find all of the subgraphs whose frequency is above the threshold.(Yan and Han, 2002)

FFSM

- Fast Frequent Subgraph Mining. - solves the frequent subgraph mining problem, which is, given a collection of graphs D and a threshold f between 0 (exclusive) and 1 (inclusive), to enumerate subgraphs that occur in at least a fraction f of graphs in D.(Huan et al., 2003)

MoFa

- a program that finds automatically molecular substructures and discriminative fragments in a set of molecule descriptions given some user defined parameters. - designed in cooperation with Tripos, Inc., Data Analysis Research Lab, South San Francisco, CA, USA and the Working Group Neural Networks and Fuzzy Systems of the University of Magdeburg (Borgelt and Berthold, 2002).

Gaston

- Finds all frequent subgraphs by using a level-wise approach in which first simple paths are considered, then more complex trees and finally the most complex cyclic graphs. It appears that in practice most frequent graphs are not actually very complex structures; Gaston uses this quick start observation to organize the search space efficiently. - To determine the frequency of graphs, Gaston employs an occurrence list based approach in which all occurrences of a small set of graphs are stored in main memory.( Nijssen and Kon, 2004)

73

Frequent subgraph

Program 1 Program 2 Figure 3.4: Illustration of frequent subgraph of two trees.

3.5

String based matching

In Figure 3.5, an example of instance based matching is presented. As we can see, we found that depth of fragment 1 and fragment 2 are equal to three so the comparison using string metric is done but in the original experiment we use five and six instead of depth equal to three. If the similarity is above the threshold, the string will be taken as clone pair. Instead of using vocabulary as in our initial experiment, we compare the similar structure of subgraph found and recorded in the table above.

As mention before, we assume that all elements of the subtree are treated as instance i.e. including the node name, attribute and value, etc. We take all the elements as a string for simplicity in order to calculate the similarity of set of instance in fragment 1 i.e. set A and set of instance of fragment 2 i.e. set B.

74

My name is Marry Fragment 1

My name is Bob Fragment 2 x

x

x

a

a Depth =3 b

b

Depth =3

b

b

toString() c

c

toString()

My name is Marry

My name is Bob

Compute string similarity

Figure 3.5: A pair of source code fragment classified as nearly identical.

The last stage of the code clone detection would be the post-processing. At this stage, all clones will be extracted from the original code for further analysis.

75 3.6

Clone detection algorithm

The previous subtopics explained the process of proposed clone detection. The process can be summarized into general algorithm as below:

Variable: begin Step 1: Step 2: Step 3: Step 4:

Step 6: Step 7: end

threshold, p, minNode. Define parameter threshold, p, minNode. Convert all files of system A and system B into XML and clean files Generate frequent subgrapghs and record in cross table, D. For all subgraphs in D, begin-while For all XML code file in system A Step 4.1: begin-while Step 4.1.1: For all XML code file in system B begin-while Step 4.1.1.1: - read XML nodes as string - compute similarity using string metric. Step 4.1.1.2: If simm> threshold, select string as clone. end-while end-while end-while Post-processing. Extract all clones. Determine program which highly in similarity.

Figure 3.6: Clone Detection Algorithm

3.7

Clone detection evaluation

In the previous chapter, we have presented several choice of evaluating performance of clone detection tool. For this project, the evaluation will be based on the metrics below. The metrics are selected because these metrics are the most commonly use to for clone detection algorithm evaluation.

76 i.

Clone type: lists the type of clones the respective clone detector finds

ii.

Recall: describes the number of correct clone found in comparison to the total number of existing clone. Recall, r =

iii.

number of correct found clone number of possible existing clone

Precision: measure the number of correct clone found versus the total number of retrieved clone (correct and wrong).

iv.

Time consumption: measure the time taken to run the detection.

v.

Types of programming language can be processed: the portability.

CHAPTER 4

EXPERIMENTAL RESULT AND DISCUSSIONS

4.1

Introduction

This chapter primarily presents the results obtained by searching for clone pairs using a methodology inspired from ontology mapping works. As mention in the previous chapter, in the ontology mapping work, the author used maximal common subgraph in the first layer and the calculating the instance similarity using Jaccard coefficient. So in our methodology, instead of using maximal common subgraph, we are using frequent common subgraph miner as the maximal common subgraph technique is frequently reported as NP-complete problem.

Generally, methodology consists of four main stages, i.e. preprocessing stage, frequent subtree generation using frequent subgraph miner, subtree similarity computation and extraction of clone pairs and analysis. The frequent subtree mining in this work is taken as the process of getting candidate clone pairs which have similar subtree structure by only taking into account the node tags and omitting any other elements of the tree such as attribute, label or values of the tree.

78 Before we discussed in depth about the result of code clone detection, we discussed about the experiment that have been carried out in this project. The following first two subchapters discussed about the preprocessing stage. In this stage, we do the preparation where the original source code transformed into inexpensive standardized form and representation of web source programs in order to induce the data into the frequent sub graph miner.

4.2

Data representation

A few group of system files were used for the testing purposes. We divided the data into three different size of data; i.e. small data size, medium size and large size where all the program were taken from open sources web application. The original web program is in HTML, ASP and PHP format to test the portability of our system where our system is portable if it manages to process different type of web programming languages.

As mention before, all programs need to be transformed into a standard form of programs. In our system, we transform the programs into XML format where the transformations were inexpensive. This is because we need to make sure that the entire programs were in valid form of tagging so that we could extract all tagging name of each XML tree.

79 4.2.1 Original source programs into XML format

The first step for data normalization is by converting all programs into HTML format first. This is to make sure that all XML documents that will be generated were in a valid form of tagging. As for now, the process is done by using a freeware tool called AscToTab which can transform any form of text into HTML or RTF format. This stage need to be done manually. After all first transformations were done, by using our system the HTML programs then converted into XML documents for further processes.

For any program which is not a HTML program e.g. PHP and ASP, the programs were treated as text file. Transformation using the tool is done by applying formatting tag such as
,
, etc onto the text code. The following figure shows example of original source code transformation into HTML form.

Converted from "C:\Documents and Settings\DiLLa DiLLoT\Desktop\dbase.php"

80

(b) Generated HTML code

Figure 4.1: Transformation of original PHP code into HTML code

After transformation into HTML, the code then converted into XML to ensure their validity. This process can be done automatically using our system where it provides a function to convert HTML into XML.

HTML

PUBLIC

"-//W3C//DTD

HTML

4.0

Transitional//EN""http://www.w3.org/TR/REChtml40/loose.dtd" Converted by AscToTab 4.0 - fully registered version Converted

from

"C:\Documents

Settings\DiLLa DiLLoT\Desktop\dbase.php"

and

81 User-specified TABLE structure end of user-specified TABLE structure Converted by AscToTab 4.0 - fully registered version

Figure 4.2: XML form of the previous HTML code

4.2.2

Subtree mining data representation

As what we can see, XML representation is not suitable to be fed directly into our frequent subgraph miner. So, as the solution we need to represent the tree structure into simpler form of data. This is important to reduce the complexity of mining process.

The simplest way is by representing the trees as a node and edge lists. Before generating the data, we extract all node names or tagging in XML code and treated it as bag of concept or vocabulary as has been used in Project I. The subgraph mining

82 data is represented as a list of nodes and edges as in Figure 4.3 below. As in the figure, t represents tree, v represents vertex and e represent edge. Label of node in below figure represent the node name or tagging of the XML. But instead of putting the node name in the list, we put the index of vocabulary as we explain before.

t # v 0 v 1 ... e e ...

Figure 4.3: A tree as list of nodes and edges

Figure 4.4 shows the example of vocabulary generated and the tree representation following the format in 4.3. In example below, [v 0 1] means that node name for vertex0 is title and [e 0 1 1] means there is an edge between vertex0 and vertex1. The last digit 1 is the default labeling for all edges since we are working with trees instead of graphs, so we need to omit any labeling of all edges. Data in Figure 4.4(b) will be fed in the frequent subgraph miner. There are four miner were used in our experiment. Each graph miner was explained briefly in Chapter 3.

83

[0] text [1] title [2] a href [3] p [4] h1 [5] meta http-equiv [6] head

(a) Vocabulary/ Bag of concepts t # s1.XML_10.xml v 0 1 v 1 6 v 2 0 v 3 3 ... e 0 1 1 e 0 2 1 e 2 3 1 ...

(b) A tree represented as list of vertices and edges Figure 4.4: Example of tree as vertices and edges list

4.3

Frequent Subtree Mining

As we mention before, we used four well-known frequent subgraph miners to get similar substructure of trees that exists between the files. We induce the data as in previous example into the miner and the miner will generate all frequent subtree or substructure that exists among the files.

There are a few configurations that need to be set before doing the mining. The configurations are as below:

84

minimumFrequencies - sets the minimum frequency (support) a subgraph

i.

must have to get reported. In the experiment, we set the value as low as 10% so that the miner will be capable to find all similar substructures even though the appearance in the codes is not so frequent. minimumFragmentSize - sets the minimum size (in edges) a frequent

ii.

subtree must have in order to be reported. maximumFragmentSize -sets the maximum size (in edges) a frequent

iii.

subtree can have in order to be reported. In the experiment, we set the value of (ii) and (iii) with 5 edges in size. We select this size after some preliminary experiments where this value is capable to generate the average number of subtrees. So instead of using minimal depth of a subtree as in project 1, we used the minimum and maximum fragment size.

After executing the graph miner, a list of frequent subtree will be generated from the system as well as the original tree that hold that particular subtree. So as the summary, we can generate a cross- table which contains all subtrees id that were shared among different files in different systems. t # 24

GSpan subgraph miner found 76 frequent

v 0 11

fragments

v 1 10

>>SIMILAR SUBTREE CROSS-TABLE ; x= for

v 2 3

system 1, y= for system 2

v 3 0 v 4 12

BetweenFile[0][0]:77,72,70,68,66,64,38,3

e 0 1 1

6,34,32,30,

e 1 2 1 e 1 3 1

BetweenFile[0][1]:176,171,169,155,153,15

e 3 4 1

1,149,147,145,143,141,139,137,132,130,12

=>

[2.0]

[s2.XML_21.xml,s1.XML_10.xml]

8,116,114,112,110,108,106,104,102,100,93 ,91,89,87,85,77,72,70,68,66,64,56,54,52,

85 t # 26

50,48,46,44,42,40,38,36,34,32,30,26,24,

v 0 11 v 1 10

BetweenFile[1][0]:192,184,164,162,160,12

v 2 3

0,118,95,77,72,70,68,66,64,62,60,58,38,3

v 3 0

6,34,32,30,28,

v 4 9 e 0 1 1

BetweenFile[1][1]:77,72,70,68,66,64,38,3

e 1 2 1

6,34,32,30,

e 1 3 1 e 3 4 1 =>

[2.0]

[s2.XML_21.xml,s1.XML_10.xml] t # 28 v 0 11

(b) Example of cross- table containing subtree id

v 1 10

shared between different files

v 2 3 v 3 0 v 4 2 e 0 1 1 e 1 2 1 e 1 3 1 e 3 4 1 =>

[2.0]

[s2.XML_20.xml,s1.XML_11.xml]

(a) Example of frequent subtrees generated

Figure 4.5: Frequent subtrees generated by graph miner.

86 4.4

String metric computation

The challenges part of the system is to extract the original subtree from the original XML documents according to the frequent subtree generated above. Once the original subtree successfully extracted, it will be taken as a string so that we could compute the similarity of the subtree with another subtree from another file using string metric. This is actually the same as what we did in project 1.

In the initial project, we realize that this technique is working to find clone pairs. But in the initial project, instead of using frequent subgraph miner, we used vocabulary element to find the subtree rooted with a node which have a name same with in the vocabulary element. As we discussed before, it was quite expensive to do it this way. So, for the similarity computation, we used all string metric that were stated in chapter 2 before. In the following example, we show how the subtree is represented as string before we can compute the similarity. Consider if the bold italic lines are match the frequent subtreex, the string equivalent would be as follows:

String s = HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN""http://www.w3.org/TR/REChtml40/loose.dtd"Conve rted from "C:\Documents and Settings\DiLLaDiLLoT\Desktop\ dbase.php"

87 HTML

PUBLIC

"-//W3C//DTD

HTML

4.0

Transitional//EN""http://www.w3.org/TR/REChtml40/loose.dtd" Converted by AscToTab 4.0 - fully registered version Converted

from

"C:\Documents

and

Settings\DiLLa DiLLoT\Desktop\dbase.php" ...

Figure 4.6: Code fragment containing original frequent subtree.

4.5

Experimental setup

The implementation process was done using Java language as the base language. To support the program, we used a Java library named Chilkat Java which can be found available in the Internet This library offers a few features like XML parser and tree walker ability. All process of converting web pages to standard XML and generating the vocabulary for mapping purposes are done automatically in the program. All executions were done using an Intel® Core Duo 1.86GHz machine with 1.5 GB of RAM.

The following settings were meant for the Ontology-Winkler Similarity part. We set the most lenient values for all those parameter as below:

88 (1) The similarity threshold, θ is set to 0.7. (2) We define the similarity, Sim as below:

Sim( s1 , s 2 ) = Comm( s1 , s 2 ) − Diff ( s1 , s 2 ) + winkler( s1 , s 2 ) where Comm(s1,s2) is the commonality value between two strings and Diff(s1,s2) is the difference between two strings. The detail of the equation has been discussed in detail in Chapter 2. (3) We omit the Winkler(s1,s2) calculation from the equation to simplify the programming. Value for Winkler is set to 0.1.

(4) Value of parameter p is set to 0.6 as the original author of this technique report that the result for their experiment work best with this value.

4.6

Experimental results

This subsection is mainly to present the result of our code clone detection using all the metric that has been discussed in the previous chapter. In this experiment, we will consider any valid candidate clone as a clone even though that code fragment is actually an accidental clone.

Several experiments were conducted to investigate the performances of our clone detection program. The experiments were executed using the same parameters setting and data setups as in the previous subchapter. We conduct the experiment using two open source management systems as follow:

89 i.

Module 1.1 – 54.9Mb(size), >4000 files

ii.

Tutor 1.55 – 49.7Mb(size), >3000 files

For the testing, we randomly select the file from these two systems for comparing where the testing is done in different number of file. By using 1.5Gb RAM processor, the number of files that can be process is quite limited and only allow small size of detection which is less than 100 files. We did the testing in three groups of testing. The following table shows information of data being used for the testing process.

Basically we split the experiment into two types of experiments. The first one is to view the scalability of the program. And the second one is to see overall performance of the system.

In the first experiment, we use a fixed value for the threshold and the minimum support where we set the value to 0.7 for threshold and 10% for the minimum support.

Table 4.1: Data for program testing /

Number of files(NOF)

Line of codes(LOC)

Testing #1

20

325 lines

Testing #2

40

675 lines

Testing #3

60

891 lines

Testing #4

80

1239 lines

Testing #5

100

1394 lines

As we know, our experiment is basically on different subgraph miner and different string metric or similarity coefficients. The following figure shows example of real output of detection using our program which has been written in Java.

90

>>COMPARE FILES:

*Compare file between: D:/Documents/MASTER/4thSEM/Project II/Code Clone Detection/proc1/XML_10.xml D:/Documents/MASTER/4th SEM/Project II/Code Clone Detection/proc2/XML_20.xml #1: ThisisatestThisisatestThisisatestThisisatestThisisatest ThisisatestThisisatestThisisatestThisisatestThisisatest = 1.0

*Compare file between: D:/Documents/MASTER/4th SEM/Project II/Code Clone Detection/proc1/XML_10.xml D:/Documents/MASTER/4th SEM/Project II/Code Clone Detection/proc2/XML_20.xml #1: ThisisatestThisisatestThisisatestThisisatestThisisatest ThisisatestThisisatestThisisatestThisisatestThisisatest

Figure 4.7: Real output from the clone detection system

We show the result of the experimental testing using our default subgraph miner which is GSpan miner. As we mention, the testing is done in three different size of files as above. The rest of the tables were shown in appendix D. We show some of graphical output by using Jaro Winkler and Levenshtein Distance as the string metric.

91 Table 4.2: Experimental result using GSpan miner String metric File size

Recall (%) 20

40

60

Precision (%) 80

100

20

40

60

80

100

Jaro Winkler

80

85

76

77

76

100

100

100

99

98


85

83

80

78

76

100

100

100

99

98

Needleman- Wunch

89

83

83

75

76

100

100

100

99

98

Dice Coefficient

100

95

92

88

79

100

100

100

99

98

Smith-Waterman

70

68

70

64

100

100

100

100

98

Jaccard Similarity

90

77

78

78

70

100

100

100

99

97


75

70

64

65

60

100

100

100

99

98


80

80

64

64

61

100

100

100

99

98

Euclidean Distance

85

80

73

70

68

100

100

100

99

100

L1 Distance

60

63

58

55

58

100

100

100

99

98

Gotoh Distance

65

60

58

55

57

100

100

100

99

98

Jaro Distance

67

66

40

40

42

100

100

100

99

98

Soundex Distance

87

80

83

85

87

100

100

100

98

98

Overlap Coefficient

86

83

80

72

79

100

100

100

99

98

Cosine Similarity

75

70

73

70

73

100

100

100

99

98

q-Gram

78

70

50

58

60

100

100

100

99

98

Ontology-Winkler

75

70

73

76

78

100

100

100

99

98

Distance

Distance

Similarity

92 String metric

Computation Time (seconds)

Clone Frequencies (%)

File size

20

40

60

80

100

20

40

60

80

100

Jaro Winkler

43.2

243.6

1504

3331

7200

*83

*70

*65

*65

*68

**17

**30

**35

**35

**32

*85

*73

*70

*65

*61

**15

**27

**30

**35

**39

*73

*65

*75

*70

*64

**27

**35

**25

**30

**36

*70

*50

*65

*67

*66

**30

**50

**35

**33

**36

*93

*85

*89

*73

*76

**7

**15

**11

**27

**24

*78

*75

*80

*77

*66

**22

**25

**20

**23

**36

*83

*80

*87

*65

*66

**17

**20

**13

**35

**36

*83

*70

*88

*65

*65

**17

**30

**12

**35

**35

*68

*72

*75

*65

*66

**32

**28

**25

**35

**34

*95

*99

*97

*68

*64

**5

**1

**3

**32

**36

*92

*90

*95

*83

*75

**8

**10

**5

**17

**25

*84

*70

*65

*60

*62

**16

**30

**25

**40

**38

*88

*70

*65

*65

*64

**12

**30

**25

**35

**36

*97

*70

*65

*65

*66

**3

**30

**25

**35

**36

*83

*70

*65

*65

*50

**17

**30

**25

**35

**50

*98

*70

*65

*65

*68

**2

**30

**25

**35

**32

*83

*70

*65

*65

*69

**17

**30

**25

**35

**33


Needleman- Wunch

50.8

54.6

257.3

267.2

1632

1203

3462

3023

7211

7034

Distance Dice Coefficient

Smith-Waterman

78

45

300.1

289.3

1267

1452

3023

3111

7212

7345

Distance Jaccard Similarity



Euclidean Distance

L1 Distance

Gotoh Distance

Jaro Distance

Soundex Distance

Overlap Coefficient

Cosine Similarity

q-Gram

Ontology-Winkler Similarity

34.4

67.1

45.2

44.8

63.1

57.7

48.2

45.1

56.1

63.5

65

63.5

240

387.6

221.2

207.3

276.2

376.7

380.5

412.8

392.1

333.6

327.1

333.6

1700

1332

1210

1117

1630

1817

1330

1211

1440

1331

1267

1331

4674

3100

3111

2900

3046

3025

3125

3028

3542

3444

3700

3997

8003

7248

7344

6888

7211

7254

7215

7210

7331

7347

7409

7758

93 Legend: * indicate identical clone, ** indicate similar clone

-

Threshold θ is 0.7

GSpan-Jaro Winkler

%

Recall and Precision for GSpan-JaroWinkler 120 100 80

Recall Precision

60 40 20 0 20

40

60

80

100

Document size

Figure 4.8: Recall and precision for GSpan-Jaro Winkler

Robustness for GSpan-JaroWinkler %

i.

-

120 100 80 60 40 20 0

Identical Similar

20

40

60

80

100

Document size

Figure 4.9: Robustness of GSpan-Jaro Winkler

94

Computational time for GSpan-JaroWinkler

seconds

3000 2000 Time taken 1000 0 Test#1

Test#2

Test#3

Document size

Figure 4.10: Computational time for GSpan-JaroWinkler

GSpan- Levenshtein Distance

Recall and Precision for GSpan-Levenshtein Distance

%

ii.

120 100 80 60 40 20 0

Recall Precision

Test#1

Test#2

Test#3

Document size

Figure 4.11: Recall and Precision for GSpan-Levenshtein Distance

95

%

Robustness for GSpan-Levenshtein Distance 120 100 80 60 40 20 0

Identical Similar

Test#1

Test#2

Test#3

Document size

Figure 4.12: Robustness for GSpan-Levenshtein Distance

seconds

Computational time for GSpan-Levenshtein Distance 3000 2000 Time taken 1000 0 Test#1

Test#2

Test#3

Document size

Figure 4.13: Computational time for GSpan-Levenshtein Distance

The above graphs show result of using GSpan frequent subgraph miner. As we can see from the diagram, mining the similar structure using GSpan miner generate almost similar graph trends where the value generated is almost similar between these two string metrics.

96 As shown in the graph, all clone pairs that were found by our code clone detection system were all positive clone. This situation yields our precision to be 100% for small size or bigger data. But the trade off happen for the recall. Our system only manage to find small number of clones where most of the clones found were identical clones, but we can say the limitation is of searching for similar clones.

Another big issue shown in the data above is, the computation time taken was rapidly increasing as the number of documents increase. This is practically not good for detecting clone pairs. But, we may need more testing to find out whether the line will keep increasing to the infinite as number of document increased.

Generally, there are not much different between the trends of graphs using different frequent subraph miner. The major different is about the overall computational time of the detection as different frequent subgraph miner offer different performance in generating frequent subgraph. The result shows that Gaston offers the best computation time, followed by gSpan, FFSM and MoFa.

4.7 Limitation of the code clone detection program

As we can see above, the overall result of our code clone detection program did not shows a good result as what we expected. In general, we notice that for each an every subgraph miner and string metric being used, the computational time increased rapidly as the number of source codes were increased. This is practically not healthy for code clone detection or for any experiments related to this area, e.g. plagiarism.

97 Another big issue is the results shown were practically not good. There was a big trade off between the recall and precision. From the precision view, the program manages to achieve very good result but not from the recall view where only a part of all expected clones were found during the detection.

For the analysis purposes, we identify a few points that may affect the overall results.

i.

The computational time taken may affected by the time taken by preprocessing time taken to convert from original code into XML form.

ii.

It may also affect by processing taken by subgraph miner. The miner generally will generate all subtree from the code subtrees which sometimes reach thousands of subtree even for only small number of source code being tested before it identify which subtrees were the frequent one.

iii.

We need a higher specification of machine for testing as the current machine only capable to test less than 100 source files per time. We have initially tested more than 100 times, but the computational time goes to infinite.

iv.

The program only capable to detect identical clone and near identical clone since our program is using the string based detection. As we know the strength of string based detection is it were able to detection more language of codes, i.e. language independent but the weaknesses is in terms of the robustness where it only able to detect identical and near identical clones.

v.

The clones found were always at a same size at a particular testing since we already predefine the fragment size of frequent subtree in the frequent subgraph miner. So, they might be close similarity between clones and the differences may only a node in a subtree. The following figure shows the

98 illustration of the scenario. Assume that the shaded part of the tree is taken as a frequent subtree by the subgraph miner and detected as a clone in a source code. As we can see, there are nodes in both frequent subtree were intersect and the subtrees actually can be taken as a single clone but our program is unable to do that.

Two clone subtrees in a same source code

Figure 4.14: Two close clones cannot be taken as a single clone

4.8 Comparison of result using different parameters

In the previous subchapter, we have shown a few result of using our clone detection program on web application. The experiments were done using a fix threshold and minimum support which are 0.7 and 10%. Generally threshold is used to differentiate the result types of clones whereby minimum support is used to indicate minimum number of subgraph that must be achieve by the subgraph miner in order to take the subgraph as structural clone.

99 In this subchapter, we show some result of using different values of parameters instead of fix one. The experiment is using GSpan- Jaro Winkler option in order to scale down the experiment size.

Table 4.3: Experiment using different parameter value (a)File size = 20 Experiment

E1

E2

E3

E4

E5

E6

Threshold

0.7

0.7

0.7

0.6

0.6

0.6

5

10

15

5

10

15

41.516

43.2

45.797

41.828

38.859

39.99

Precision

100

100

100

100

100

100

Recall

95

80

85

95

83

83

Min Support Computation time

(b)File size = 40 Experiment

E1

E2

E3

E4

E5

E6

Threshold

0.7

0.7

0.7

0.6

0.6

0.6

5

10

15

5

10

15

48.43

48.13

49.06

48.13

48.28

47.97

Precision

100

100

100

100

100

100

Recall

89

85

85

81

76

70


(c)File size = 60 Experiment

E1

E2

E3

E4

E5

E6

Threshold

0.7

0.7

0.7

0.6

0.6

0.6

5

10

15

5

10

15

48.29

4.89

51.766

48.12

48.28

48.43

Precision

100

100

100

100

100

100

Recall

86

76

80

81

76

70


(d)File size = 80 Experiment

E1

E2

E3

E4

E5

E6

Threshold

0.7

0.7

0.7

0.6

0.6

0.6

5

10

15

5

10

15

48.44

57.97

58.43

58.13

58.12

58.44

Precision

100

99

100

100

99

100

Recall

80

77

75

75

70

60


100 (e)File size = 100 Experiment

E1

E2

E3

E4

E5

E6

Threshold

0.7

0.7

0.7

0.6

0.6

0.6

5

10

15

5

10

15

0.813

0.875

0.828

0.906

0.86

0.828

Precision

100

98

100

100

99

100

Recall

79

76

70

65

60

53


(a) Threshold = 0.7

Precision (%)

Precision result -using different minimum support threshold = 0.7 120 100 80 60 40 20 0

S1 S2 S3 S4 S5 5

10

15

Mimimum support

Precision (%)

(b) Threshold =0.6 Precision result -using different minimum support threshold = 0.6 120 100 80 60 40 20 0 5 10 15

S1 S2 S3 S4 S5

Mimimum support

Figure 4.15: Precision result using different minimum support and threshold

101

Recall (%)

(a) Threshold = 0.7 Recall result -using different minimum support threshold = 0.7 120 100 80 60 40 20 0 5

10

S1 S2 S3 S4 S5

15

Mimimum support

Recall (%)

(b) Threshold =0.6 Recall result -using different minimum support threshold = 0.6 120 100 80 60 40 20 0 5 10 15

S1 S2 S3 S4 S5

Mimimum support

Figure 4.16: Recall result using different minimum support and threshold Figure 4.16: Precision result using different minimum support and threshold

As we can see in the figure above, there is no much different between both result of precision even using different minimum support and threshold. Almost all the result goes to 100% precision. But, as we mention in Chapter 3, we take all found clone as true clones even though the fragments were actually accidental one or purposely copied in the programs. So there might be unuseful clones found from the program.

102 As the result in the precision figure shows a very plain graph, it goes different way in the recall result. We can see that as the number of file increased, the result slowly goes down. But in the graph shown, the result is slightly better if we used threshold 0.7. It might because by letting the threshold with the value 0.6 or below, we were exposed to more incorrect found clone. The program might mistakenly take any strings which have similarities that achieve that degree as clones.

CHAPTER 5

CONCLUSION

5.1

Introduction

As the number of web pages extensively increases across time, the number of possible clones among source code also can be increase. The programmer always to find the easiest way to write the coding and that might yield the clones that would risk the maintenance of the system.

As we know, the overall aim of this project is to know the ability of ontology mapping technique to solve the clone detection between files of different systems. There are already many researches that do the code clone detection but none of them use the ontology mapping as part of the detection.

From the finding that we get from the previous chapter, we know that there is a possibility of using mapping technique to detect clones. Somehow the result shown are not so good and of course the next process should be to refine the propose methodology in order to get a better result. The remaining of this chapter will discuss the future works for this project.

104 5.2

Recommendation for Future Works

In order for us to get a good result of clone detection, we need to do some refinement to the methodology. Below are a few things that can be considered as the project move on aiming for better recall and precision.

i.

Refine the process of generating vocabulary.

ii.

Preprocessing phase where original code were transform into standard code need to be refine to make sure all scripting and dynamic web pages lines of code e.g. PHP and ASP code clones can be detected as well.

iii.

In the process of mapping the tags using vocabulary, enhance the searching towards the end of every single page.

iv.

Manipulate the subgraph miner so that number subtree generated would be lenient without having any redundancy of subtree, etc.

5.3

Strength of the system

The following are some strength of our system:

i.

Capable to find structural similarity among XML tree, i.e. structural clone.

ii.

Very good precision result in almost all data testing.

APPENDIX A Project Activities

Project Schedule (Year 2008) J Activities Comparative study of ontology mapping techniques for source code software cloning detection Test current/available tools Analyze the design of ontology mapping technique for source code software clone detection Identify the design of ontology mapping that suitable to code cloning detection Generate a general framework of the clone detection technique using ontology mapping based on the strength/ weaknesses of the current tool. Report writing and documentation Project Presentation(Project 1) Develop the code clone detection program using the ontology mapping technique Test the developed program in order to comply with the cloning detection requirements Report writing and documentation Analysis of the results and discussion Project presentation(Project 2) Project Milestone Literature review completed

F

M

A

M

J

J

A

S

O

N

112 Analysis of existing approach of ontology mapping Development of prototype program Project completion

APPENDIX B Existing Works of Code Clone Detection

Year

1992

Author/

Supported

Tool

Languages

Domain/ Type

Approach

Strengths

Weaknesses

Baker/

C, C++,

CCD/

- code representation

Detect

The line-by-

Dup

Java

Syntactic/String

as parameterized

exact and

line method

-based

token string

parame-

has a

- comparison using

terized

weakness in

Suffix-tree based

matches

the linestructure

token matching

modification 1996

Mayrand/

Java

CCD/

- comparison of func-

-

Covet

(TS)

Syntactic/

tion metrics

exact

- applied metric to

near-miss

(represent-ed in

AST

clones

AST)

-

Metric

based

unit

for

Detect and

based

measurement are the

the

bodies of function.

values

on delta of

the metrics (Level 1 to level 8) 1998

CCD/

- annotated parse tree

- can detect

- Expensive

COBOL,

Syntactic/

generated by compiler

exact

and

since

Java,

tree-based

generator

near

miss

Baxter/

C,

CloneDr

C++,

use

similarity

it

requires full syntax

clone

Progress

-

(TS)

threshold for user to

-

specify how similar

practical

two sub trees should

mean

be

remove

offer

analysis and transformati

to

on.

detected clone 2001

Krinke/

C

CCD/

Duplix

(TS)

Syntactic semantic/

and

- use extension of

- did not

not scalable

AST

suffer trade

to large size

(PDG)

off between

programs

114 tree-based

-

use

iterative

precision

approach for detecting

and recall.

maximal

-

similar

non-

polynomial

subgraph.

complexity of

the

problem. 2001

Komondoor

C, C++

CCD/ syntactic

- Use CodeSurfer to

-

et. al. /

(TS)

and

get

non-

tree-based

PDG

contiguous,

(PDG)

- Isomorphic PDG

reordered,

subgraph matching

intertwined

using backward

clones

PDG- DUP

semantic/

detect

slicing 2002

CCD/

- sequence of token is

- Exact or

- token-by

COBOL,

Syntactic/

produced by a scanner

near miss

token

Java,

Token-based

- transformed into a

possibly

matching is

Plain Text

sequence of token by

with gaps

more

(TS)

language

Kamiya et.

C,

al./ CCFinder

C++,

Lisp,

expensive

specific

transformation

than line-by-

rules

and by replacement of

line

parameters

matching

-

comparison

of

possible substring 2002

Ueda et. al,

CCD/

- detect gapped clone

-

Syntactic/

that partly same with

detect

Token- based

the original with some

gapped-

different code portion

clones even

-

in a very

based

location by

using

on

gap

information detection

able

to

short clones.

result of NG-clone. - to improve GEMINI 2002

Prechel /

Java, c,

PD/

-using greedy string

-not

JPlag

c++,

Token-based

approach

scalable,

Scheme,

work

for

NL text

small

size

(TS)

of

115 document e.g. students assignment 2003

Jarzabek

Java

- offer refactoring to

-eliminate

eliminate code clone

at

using

composition

68%

with

adaptation”

least of

clones

called XVCL 2004

Wahler

Java,

C++,

Prolog

CCD/ syntactic/

- code represented in

- find exact

Tree-based

XML

and

-detect two types of

parameter-

clone; type 1 and type

rized clones

2

in a more

(TS)

-

using

frequent

abstract

itemset algorithm

level

-

than AST

need

to

define

pattern of clone in the XML

configuration

file 2006

Koschke

Tree based

et.al 2007

Abstract syntax suffix tree

Jiang/

Any

CCD/ syntactic/

- generate AST

-scalable

DECKARD

languages

Tree-based

-

-language

generate

with

characteristic vectors

independen

formally

to

t

specified

structured

grammar

information.

(TS)

-

approximate

cluster

similar

vectors with respected to the Euclidean space 2002

Di Lucca /

HTML

DiLucca Pro.

- represented in

- detect

client &

sequence of Tags

same set of

ASP server

- comparison using

HTML

Pages

Levenshtein distance

tags/ASP

(WA)

PD/

objects but data may be different)

116 2004

Calefato,

VBScript,

CCD/ syntactic/

-find

find

function

-detect

-needs

JavaScript

clone

in

scripting

identical,

human

(WA)

code rather than code

nearly

inspection

fragment

identical,

-use two stages:

similar and

1) Auto-detection of

distinct

potential

function

clone(using eMetric) 2) visual inspection of selected

script

function

Legend: WA- web application, TS– traditional software, CCD– code clone detection, PD– plagiarism detection

APPENDIX C Experimental result tables

Table 1: Experimental result using FFSM miner String metric File size Jaro Winkler

Recall (%)

Precision (%)

20 80

40 85

60 76

80 77

100 76

20 100

40 100

60 100

80 99

100 98


85

83

80

78

76

100

100

100

99

98

NeedlemanWunch Distance

89

83

83

75

76

100

100

100

99

98

Dice Coefficient

100

95

92

88

79

100

100

100

99

98

SmithWaterman Distance

87

80

83

85

87

100

100

100

99

98

Jaccard Similarity

90

77

78

78

70

100

100

100

99

98


75

70

64

65

60

100

100

100

99

98


87

80

83

85

87

100

100

100

99

98

Euclidean Distance

85

80

73

70

68

100

100

100

99

98

L1 Distance

60

63

58

55

58

100

100

100

99

98

Gotoh Distance

65

60

58

55

57

100

100

100

99

98

Jaro Distance

67

66

40

40

42

100

100

100

99

98

Soundex Distance

87

80

83

85

87

100

100

100

99

98

Overlap Coefficient

86

83

80

72

79

100

100

100

99

98

Cosine Similarity

75

70

73

70

73

100

100

100

99

98

q-Gram

87

80

83

85

87

100

100

100

99

98

OntologyWinkler Similarity

75

70

73

76

78

100

100

100

99

98

118 String metric


File size Jaro Winkler

20 43.2

40 243.6

60 1504

80 3331

100 7200


50.8

257.3

1632

3462

7211


54.6

267.2

1203

3023

7034

Dice Coefficient

78

300.1

1267

3023

7212


45

289.3

1452

3111

7345

Jaccard Similarity

34.4

240

1700

4674

8003


67.1

387.6

1332

3100

7248


45.2

221.2

1210

3111

7344

Euclidean Distance

44.8

207.3

1117

2900

6888

L1 Distance

63.1

276.2

1630

3046

7211

Gotoh Distance

57.7

376.7

1817

3025

7254

Jaro Distance

48.2

380.5

1330

3125

7215

Soundex Distance

45.1

412.8

1211

3028

7210

Overlap Coefficient

56.1

392.1

1440

3542

7331

Cosine Similarity

63.5

333.6

1331

3444

7347

65

327.1

1267

3700

7409

63.5

333.6

1331

3997

7758

q-Gram OntologyWinkler Similarity

Clone Frequencies (%) 20 *83 **17 *85 **15 *73

40 *70 **30 *73 **27 *65

60 *65 **35 *70 **30 *75

80 *65 **35 *65 **35 *70

100 *68 **32 *61 **39 *64

**27 *70 **30 *93

**35 *50 **50 *85

**25 *65 **35 *89

**30 *67 **33 *73

**36 *66 **36 *76

**7 *78 **22 *83

**15 *75 **25 *80

**11 *80 **20 *87

**27 *77 **23 *65

**24 *66 **36 *66

**17 *83 **17 *68 **32 *95 **5 *92 **8 *84 **16 *88 **12 *97 **3 *83 **17 *98 **2 *83 **17

**20 *70 **30 *72 **28 *99 **1 *90 **10 *70 **30 *70 **30 *70 **30 *70 **30 *70 **30 *70 **30

**13 *88 **12 *75 **25 *97 **3 *95 **5 *65 **25 *65 **25 *65 **25 *65 **25 *65 **25 *65 **25

**35 *65 **35 *65 **35 *68 **32 *83 **17 *60 **40 *65 **35 *65 **35 *65 **35 *65 **35 *65 **35

**36 *65 **35 *66 **34 *64 **36 *75 **25 *62 **38 *64 **36 *66 **36 *50 **50 *68 **32 *69 **33

Legend: -

* indicate identical clone, ** indicate similar clone

-

Threshold θ is 0.7

119 Table 2: Experimental result using MoFa miner String metric File size Jaro Winkler

Recall (%)

Precision (%)

20 80

40 85

60 76

80 77

100 76

20 100

40 100

60 100

80 99

100 98


85

83

80

78

76

100

100

100

99

98


89

83

83

75

76

100

100

100

99

98

Dice Coefficient

100

95

92

88

79

100

100

100

99

98


70

68

70

64

100

100

100

99

98

Jaccard Similarity

90

77

78

78

70

100

100

100

99

98


75

70

64

65

60

100

100

100

99

98


80

80

64

64

61

100

100

100

99

98

Euclidean Distance

85

80

73

70

68

100

100

100

99

98

L1 Distance

60

63

58

55

58

100

100

100

99

98

Gotoh Distance

65

60

58

55

57

100

100

100

99

98

Jaro Distance

67

66

40

40

42

100

100

100

99

98

Soundex Distance

87

80

83

85

87

100

100

100

99

98

Overlap Coefficient

86

83

80

72

79

100

100

100

99

98

Cosine Similarity

75

70

73

70

73

100

100

100

99

98

q-Gram

78

70

50

58

60

100

100

100

99

98


75

70

73

76

78

100

100

100

99

98

120 String metric


File size Jaro Winkler

20 43.2

40 243.6

60 1504

80 3023

100 7200


50.8

257.3

1632

3023

7200


54.6

267.2

1203

3023

7200

Dice Coefficient

78

300.1

1267

3023

7200


45

289.3

1452

3023

7200

Jaccard Similarity

34.4

240

1700

3023

7200


67.1

387.6

1332

3023

7200


45.2

221.2

1210

3023

7200

Euclidean Distance

44.8

207.3

1117

3023

7200

L1 Distance

63.1

276.2

1630

3023

7200

Gotoh Distance

57.7

376.7

1817

3023

7200

Jaro Distance

48.2

380.5

1330

3023

7200

Soundex Distance

45.1

412.8

1211

3023

7200

Overlap Coefficient

56.1

392.1

1440

3023

7200

Cosine Similarity

63.5

333.6

1331

3023

7200

65

327.1

1267

3023

7200

63.5

333.6

1331

3997

7758


Clone Frequencies (%) 20 *83 **17 *85 **15 *73

40 *70 **30 *73 **27 *65

60 *65 **35 *70 **30 *75

80 *65 **35 *65 **35 *70

100 *68 **32 *61 **39 *64

**27 *70 **30 *93

**35 *50 **50 *85

**25 *65 **35 *89

**30 *67 **33 *73

**36 *66 **36 *76

**7 *78 **22 *83

**15 *75 **25 *80

**11 *80 **20 *87

**27 *77 **23 *65

**24 *66 **36 *66

**17 *83 **17 *68 **32 *95 **5 *92 **8 *84 **16 *88 **12 *97 **3 *83 **17 *98 **2 *83 **17

**20 *70 **30 *72 **28 *99 **1 *90 **10 *70 **30 *70 **30 *70 **30 *70 **30 *70 **30 *70 **30

**13 *88 **12 *75 **25 *97 **3 *95 **5 *65 **25 *65 **25 *65 **25 *65 **25 *65 **25 *65 **25

**35 *65 **35 *65 **35 *68 **32 *83 **17 *60 **40 *65 **35 *65 **35 *65 **35 *65 **35 *65 **35

**36 *65 **35 *66 **34 *64 **36 *75 **25 *62 **38 *64 **36 *66 **36 *50 **50 *68 **32 *69 **33

Legend: -


-

Threshold θ is 0.7

121 Table 3: Experimental result using Gaston miner String metric File size Jaro Winkler

Recall (%)

Precision (%)

20 80

40 85

60 76

80 77

100 76

20 100

40 100

60 100

80 99

100 98


85

83

80

78

76

100

100

100

99

98


89

83

83

75

76

100

100

100

99

98

Dice Coefficient

100

95

92

88

79

100

100

100

99

98


70

68

70

64

100

100

100

99

98

Jaccard Similarity

90

77

78

78

70

100

100

100

99

98


75

70

64

65

60

100

100

100

99

98


80

80

64

64

61

100

100

100

99

98

Euclidean Distance

85

80

73

70

68

100

100

100

99

98

L1 Distance

60

63

58

55

58

100

100

100

99

98

Gotoh Distance

65

60

58

55

57

100

100

100

99

98

Jaro Distance

67

66

40

40

42

100

100

100

99

98

Soundex Distance

87

80

83

85

87

100

100

100

99

98

Overlap Coefficient

86

83

80

72

79

100

100

100

99

98

Cosine Similarity

75

70

73

70

73

100

100

100

99

98

q-Gram

78

70

50

58

60

100

100

100

99

98


75

70

73

76

78

100

100

100

99

98


String metric File size

20

40

60

80

100

Clone Frequencies (%) 20

40

60

80

100

122 Jaro Winkler

43.2

243.6

1504

3331

7200


50.8

257.3

1632

3462

7211


54.6

267.2

1203

3023

7034

Dice Coefficient

78

300.1

1267

3023

7212


45

289.3

1452

3111

7345

Jaccard Similarity

34.4

240

1700

4674

8003


67.1

387.6

1332

3100

7248


45.2

221.2

1210

3111

7344

Euclidean Distance

44.8

207.3

1117

2900

6888

L1 Distance

63.1

276.2

1630

3046

7211

Gotoh Distance

57.7

376.7

1817

3025

7254

Jaro Distance

48.2

380.5

1330

3125

7215

Soundex Distance

45.1

412.8

1211

3028

7210

Overlap Coefficient

56.1

392.1

1440

3542

7331

Cosine Similarity

63.5

333.6

1331

3444

7347

65

327.1

1267

3700

7409

63.5

333.6

1331

3997

7758


*83 **17 *85 **15 *73

*70 **30 *73 **27 *65

*65 **35 *70 **30 *75

*65 **35 *65 **35 *70

*68 **32 *61 **39 *64

**27 *70 **30 *93

**35 *50 **50 *85

**25 *65 **35 *89

**30 *67 **33 *73

**36 *66 **36 *76

**7 *78 **22 *83

**15 *75 **25 *80

**11 *80 **20 *87

**27 *77 **23 *65

**24 *66 **36 *66

**17 *83 **17 *68 **32 *95 **5 *92 **8 *84 **16 *88 **12 *97 **3 *83 **17 *98 **2 *83 **17

**20 *70 **30 *72 **28 *99 **1 *90 **10 *70 **30 *70 **30 *70 **30 *70 **30 *70 **30 *70 **30

**13 *88 **12 *75 **25 *97 **3 *95 **5 *65 **25 *65 **25 *65 **25 *65 **25 *65 **25 *65 **25

**35 *65 **35 *65 **35 *68 **32 *83 **17 *60 **40 *65 **35 *65 **35 *65 **35 *65 **35 *65 **35

**36 *65 **35 *66 **34 *64 **36 *75 **25 *62 **38 *64 **36 *66 **36 *50 **50 *68 **32 *69 **33

Legend: -


-

Threshold θ is 0.7

code clone detection using string based tree matching

code clone detection using string based tree matching

Suggest Documents

Incremental Code Clone Detection: A PDG-based ... - Semantic Scholar

Index-Based Code Clone Detection: Incremental, Distributed ... - TUM

CMCD: Count Matrix based Code Clone Detection - Yang Yuan

CMCD: Count Matrix based Code Clone Detection - Semantic Scholar

Incremental Code Clone Detection: A PDG-based ... - Osaka University

CMCD: Count Matrix based Code Clone Detection - Semantic Scholar

Non-parametric change-point detection using string matching

Binary Code Clone Detection across Architectures ...

Towards Flexible Code Clone Detection, Management, and ...

Binary Code Clone Detection across Architectures ...

Code Generation Using Tree Matching and Dynamic ... - CiteSeerX

Phoenix-Based Clone Detection Using Suffix Trees - Jeff Gray

a new automata based approximate string matching

Sewing string tree vertices using canonical forms

Pattern Matching for Clone and Concept Detection - Semantic Scholar

Optimal Clone Attack Detection Model using an

Scenario-Based Comparison of Clone Detection Techniques

Efficient parameterized string matching

Clone Detection Meets Semantic Web based ...

Approximate String Matching - Quretec

Interoperation Potential: Integration of Code-Clone Detection Methods ...

Automatic Wrapper Generation Using Tree Matching and Partial Tree ...

Clone Detection in Source Code by Frequent Itemset ... - CiteSeerX

On the Robustness of Clone Detection to Code Obfuscation