An Approach to Identify Duplicated Web Pages

4 downloads 0 Views 528KB Size Report
Piazza Roma, 82100 Benevento, Italy .... part and a list of items in the central part; they differ just for .... supporting the job of professional lawyers, and the WA2.
(draft) An Approach to Identify Duplicated Web Pages Giuseppe Antonio Di Lucca°, Massimiliano Di Penta*, Anna Rita Fasolino° [email protected], [email protected], [email protected] (°) Dipartimento di Informatica e Sistemistica, Università di Napoli Federico II Via Claudio, 21, 80125 Napoli, Italy (*) RCOST – Research Centre on Software Technology Università del Sannio Dipartimento di Ingegneria Piazza Roma, 82100 Benevento, Italy

Abstract

A relevant consequence of the unceasing expansion of the Web and e-commerce is the growth of the demand of new Web sites and Web applications. As a result, Web sites and applications are usually developed without a formalized process, but Web pages are directly coded in an incremental way, where new pages are obtained by duplicating existing ones. Duplicated Web pages, having the same structure and just differing for the data they include, can be considered as clones. The identification of clones may reduce the effort devoted to test, maintain and evolve Web sites and applications. Moreover, clone detection among different Web sites aims to detect cases of possible plagiarism. In this paper we propose an approach, based on similarity metrics, to detect duplicated pages in Web sites and applications, implemented with HTML language and ASP technology. The proposed approach has been assessed by analyzing several Web sites and Web applications. The obtained results are reported in the paper with respect to some case studies. Keywords: Web engineering, Web site analysis, Web site metrics, source code clones, clone analysis, software metrics

1. Introduction The rapid diffusion of the Internet and of the World Wide Web infrastructure is producing a considerable growth of the demand of new Web sites and Web Applications1 (WAs). The software industry is facing this new opportunity under the pressure of a very short time-to-market and an extremely high competition. As a result, Web sites and applications are usually developed without a formalized process. In the absence of disciplined analysis and design phases, Web pages are likely to be coded directly in an incremental way. Besides, to obtain a further reduction of time-to-market, new pages are obtained by reusing 1

In general, a Web site may be thought of as a static site that may provide dynamic information too. A Web application provides the Web user with a means to modify the site status (e.g., by adding/updating site information). We will use the term Web application to refer both Web site and Web application.

the code of existing pages, just by copy-and-paste operations, without explicitly documenting these code duplications. On the other side, the proliferation of pieces of code with the same, or very similar structure, is promoted by the lack of suitable reuse and delegation mechanisms offered by the languages generally used for implementing WAs. Since the presence of duplicated pages may increase software complexity, and augment the effort required to test, maintain and evolve the application, the detection of duplicated pages represents a feasible way to carry out testing or maintenance processes more efficiently. Moreover, the identification of duplicate pages among different Web sites aims to detect cases of possible plagiarism. The portions of duplicated code are generally called clones, and clone analysis is the research area that investigates methods and techniques for automatically detecting them. Several approaches to clone analysis have been presented in literature [Gri81, Ber84, Jan88, Hor90] with respect to traditional software systems. Exploring the portability of these approaches to the field of WAs is a relevant research topic to be addressed. Extending the existing clone analysis approaches first requires the concept of clone to be defined with respect to a WA. Since a WA is essentially composed of pages, it is possible to focus on clones made up of pages: two or more pages can be considered as clones if they have the same, or a very similar, structure, i.e., the code implementing the final rendering of the page in a browser, the business rules processing, the event management, etc., is the same in the pages, while they may differ just for the information included (i.e., the information to be read/displayed from/to a user). In this paper an approach to detect duplicated pages in WAs is proposed. The approach is based on similarity metrics, and addresses the detection of clones made up of client or server static pages of the WA, implemented with HTML language and ASP technology. Some specific features of the HTML and ASP are used in the computation of some metrics, and for defining a distance measure between WA pages. The distance measure is used to determine the similarity degree of the pages: two pages will be considered clones if they are characterized by the same values of the defined metrics (i.e., their distance is zero). The approach requires the WA pages to be statically analyzed, and the pre-defined features of the pages to be extracted.

The validity of the proposed approach has been assessed by means of experiments involving several WAs. The experimental results are encouraging. In order to carry out the experiments, a prototype tool has been developed to automatically compute the distance between pages. The remaining part of the paper is structured as follows: Section 2 provides a short background in clone analysis and the motivation of our research, while Section 3 presents our approach to WAs’ duplicated pages identification. The experiments carried out to assess the approach are described in Section 4, and conclusive remarks are given in Section 5.

2. Background and motivations 2.1 Software clones and clone analysis

Duplicated or similar portions of code in software artifacts are usually called clones, and clone analysis is the research area that investigates methods and techniques for automatically detecting them. The research interest in this area was born in the ‘80s [Gri81, Ber84, Jan88, Hor90] and focused on the definition of methods and techniques for identifying replicated code portions in procedural software systems. The methods and techniques for clone analysis described in the literature focus either on the identification of clones that consist of exactly matching code portions (exact match) [Bak93, Bak95, Bak95b], or on the identification of clones that consist of code portions that coincide, provided that the names of the involved variables and constants are systematically substituted (p-match or parameterized match). The approach to clone detection proposed in [Bal99] and [Bal00] exploits the Dynamic Pattern Matching algorithm [Kon95, Kon96] that computes the Levenshtein distance [Lev66] between fragments of code. Baxter [Bax98] introduced the concept of near miss clone, which is a fragment of code that partially coincides with another one. Further approaches, such as the ones proposed in [May96, Kon97, Lag97, Pat99], exploit software metrics concerning the code control-flow or data-flow.

2.2 Web applications and Software clones

A WA is composed of pages. It is possible to distinguish server pages, i.e., pages stored on the Web server (possibly containing server-side scripts), from client pages (pages as they are got by the browser). Client pages may, on their own, be classified as static pages or dynamic pages: a static page is saved in a file and its content is permanent, while a dynamic page is built by the server at run-time, according to a request from a client, and its content may be different each time it is generated. Web pages are usually coded by using HTML and scripting languages; they may include applets and other classes and objects, as well as pictures, images, movies and sounds. Moreover, they contain the information to be shown/provided to/from a user, being the information, made up by text, images, multimedia objects, etc., embedded in the page itself, or retrieved/stored from/in a file or database. Thus, in a Web page, we can recognize a control component (i.e., the set of items - such as the HTML code, scripts and applets - determining the page layout, business rule processing, and event management) and a data component (i.e., the set of items - such as text, images, multimedia objects determining the information to be read/displayed from/to a user). Web developers, when coding Web pages, usually create some initial pages, and then generate other pages by reusing the code of

the initial ones, especially the code implementing the page control component. We may think the initial pages as page templates, that are used to compose the other pages of the WA they have to implement. The developers duplicate these templates, and each copy is filled in with the information each actual page has to contain. Each page template may be considered as the control component of each actual page built from that template, while the added information is mainly the data component of that page. In such a way, we can identify groups of duplicated pages in a WA, each page deriving from the same template, having the same control component, i.e., showing the same rendering and functional behavior, and just differing for the data component they contain. However, duplication operations may not involve necessarily a whole page, but they may be limited just to some portions of code, such as lines of HTML code, scripting blocks, and so on. Duplication operations involve both client and sever static pages, even if page templates are mostly used for client pages. If a server page building dynamic pages is duplicated, then all the dynamic pages it can build are to be considered as clones of the pages built by the original server page. Moreover, people tend to reuse the basic structure of Web pages from an existing site to develop a new one. If it happens within the same organization, the phenomenon may be considered as a kind of reuse, and should therefore be promoted and supported. Otherwise, there may be a violation of copyright laws. On both cases, a clone detection method may result as useful to: • decide to store the most cloned pages into a repository, from where commonly used structures may be easily retrieved; • suspect a case of plagiarism between two sites by detecting pages having a high level of similarity. The search for clones in WAs may involve both the static and the dynamic pages. Since the control and the data component of a dynamic page depend on the sequence of events occurred at runtime, searching for clones in these pages should involve dynamic analysis techniques. Vice-versa, the structure of a static page is predefined in the file that implements it, and clone detection can be carried out by statically analyzing the file. In this paper, we consider as clones totally duplicated Web static pages. In particular, we consider two types of duplicated pages: couples of static client pages having the same HTML control component, i.e., pages composed by the same set of

HTML tags, and couples of static server pages, coded using ASP and including the same set of ASP objects. As an example, in Figure 1 a couple of duplicated static client pages is represented: the pages have the same structure and rendering, and differ only for the data component they contain. Both the pages show a column including seven buttons on the left side, a title in the upper part and a list of items in the central part; they differ just for the value of the title and for the values of the items included in the two lists. At the moment, a considerable growth of the size of Web sites and WAs can be observed, and the need of effectively maintaining these applications is spreading up fast [Ric00, War99]. The identification of clones in a WA is a valuable activity to effectively support and reduce the effort for testing, maintaining and evolving it. Moreover, clones can be looked for in a WA to support its migration to different architectures and platforms, to cluster similar/identical structures in single modules, and to facilitate the process of

(a)

(b) Figure 1: A couple of cloned pages

separating the content from the user interface (that may be a PC browser, a PDA, a WAP phone, etc.).

The effectiveness of clone analysis techniques, traditionally used to detect clones in procedural or object oriented software, has to be assessed in the context of WAs, and suitable approaches for tailoring these techniques in the renewed context have to be investigated. In the paper, among the various approaches proposed in the literature for clone analysis, the technique based on the edit (Levenshtein) distance will be considered. Moreover, a frequency based approach will be proposed, and the validity and effectiveness of both approaches will be discussed.

3. Metrics to detect duplicated Web pages 3.1 Detecting duplicated Web pages by the Levenshtein distance

The Levenshtein distance [Lev66, Ula72, Kur95] can be used to identify duplicated pages in a WA; WA pages may be considered as sequences of symbols, where each symbol corresponds to one element of the control component of the page. The computation of the Levenshtein distance requires that an alphabet of distinct symbols is preliminary defined. Since our interest is to compute the degree of similarity of the control components of both client and server static pages, we have to define and use different alphabets for each type of page, in order to take into account the different techniques and languages used to implement the control components in these pages. 3.1.1 Detecting duplicated client pages The control component of a static client page is mainly implemented in HTML, thus a candidate alphabet for identifying duplicated static client pages will include symbols each one corresponding to a HTML tag. In this way, it is possible extract from each static Web page a string composed of all the symbols corresponding to the HTML tags in a page: the Levenshtein distance between couples of such strings will be computed and used to compare the couples of pages which the strings were extracted from. As an example, let us consider the following two HTML code lines:

and the Table 1, where in the first row a reduced HTML tags alphabet is reported, while the second row reports the symbols corresponding to each tag. By analyzing the two HTML code lines, using the Table 1, we can identify the following sequence of HTML tags: (td, width, img, src, width, height, /td) and the corresponding string of symbols is: u = hifgieb In the following we will use the term HTML-string to refer to the strings of symbols associated to the sequence of tags extracted by HTML code. In a WA, a HTML-string is associated to a static client page. Table 1: An example of HTML alphabet /div

/td

align

div

a

b

c

d

height

e

img

f

src

g

td

h

width

i

The Levenshtein distance of each couple of HTML-strings, i.e., of each couple of client static pages of the WA will be computed. If the Levenshtein distance value of two HTML-strings is zero, the corresponding two pages are clones, while if the distance is greater than zero, but less than a sufficiently small and defined threshold, the pages are candidate to be near missing clones; higher values will indicate that the corresponding pages are largely different. Now, let us consider the following HTML code lines:
with reference to the same alphabet-table used in the previous example we can identify the following sequence of HTML tags: (td, width, div, align, img, src, width, height, /div, /td) and then extract the HTML-string: v = hidcfgieab The optimal alignment of u and v is: h i d c f g i e a b h i f g i e b and the Levenshtein distance between the strings u and v is: D(u, v)=3. In order to improve the effectiveness of such an approach, the

risk of detecting misleading similarities between pages, and the risk of not detecting meaningful similarities have to be minimized. The first type of risk, for instance, is due to the set of attributes that characterize HTML tags. In fact, it is possible to find equal sequences of HTML attributes, but each one referring to different tags; these sequences produce common sub-sequences in the HTML-strings, thus the resulting Levenshtein distance value is lower than the actual one. This means that a lexical similarity is identified, but it does not correspond to the actual structural similarity between the involved pages. In such a case we can detect false positive near missing clones. The second type of risk is connected both with the problem of the ‘composite tags’, that are sequences of tags providing a result equivalent to another single tag, and, finally, with the categories of tags that influence only the text format, like tags for text/character formatting, font selection and so on (e.g., the tags H1, H2, H3, etc.). These problems can be solved by refining the preliminary alphabet, let us call it A, including all the HTML tags, and substituting each composite tag in the alphabet with its equivalent one: let us call the resulting alphabet Α’. Then, the set of tags that just define the text/character formatting can be eliminated and a new refined alphabet Α’’ is obtained. The detection of duplicated static client pages can be therefore carried out according to the following process: (i) the HTML files are first parsed, the HTML tags are extracted and the composite tags are substituted with their equivalent ones; (ii) the resulting HTML-strings are composed by symbols from the Α’ alphabet; (iii) these strings are processed in order to eliminate the symbols not belonging to the Α’’ alphabet; (iv) the final HTML-strings are submitted to the computation of the Levenshtein distance: the Distance matrix obtained includes the distance between each couple of analyzed HTML-strings.

3.2 Detecting duplicated client pages using a frequency based metric

The detection of duplicated WA pages based on the Levenshtein distance is in general very expensive from a computational point of view. The computational complexity of the algorithm for computing the Levenshtein distance is in fact O (n2), where n is the length of the longer string. An alternative method to detect clones in WAs is based on the occurrences (i.e., the frequency) of each HTML tags in the pages. The method requires that each client page is associated with an array whose components represent the frequencies of each HTML tag in that page. The size of the array is equal to the number of considered HTML tags in the alphabet, and the i-th component of the array provides the occurrence of the i-th tag in the associated page. We will call HTML-array such an array. If we consider the same HTML code lines of in the examples in the previous section, and their associated HTML-strings u and v, the corresponding HTML-arrays are respectively reported in the second and in the third row of Table 2, where in the first row is reported the HTML alphabet considered. Given the arrays associated with each page, a distance function in a vector space can be defined, such as the linear distance or the Euclidean distance. Duplicated pages will be represented by vectors having a zero distance, since they are characterized by the same frequency of each tag, while similar pages will be represented by vectors with a small distance. For the previous example, the linear distance (LD) of the two HTML-arrays is LD=3, while the Euclidean one (ED) is ED=1.732. We will use the ED distance to detect duplicate pages.

Table 2: An example of HTML-arrays /div

/td

align

div

height

img

src

td

width

0

1

0

0

1

1

1

1

2

1

1

1

1

1

1

1

1

2

Since different pages may exhibit the same frequencies but not the same sequence of tags, the risk to detect false positive clones is higher than performing clone detection by the Levenshtein distance. In particular, given a WA, by using the frequency based metric all the clones identified by the Levenshtein distance are detected, but also other clones may be detected, and the latter ones may be false positive clones. However, the lower precision of the frequency-based metric is counterbalanced by its computational cost, that is lower than the Levenshtein distance one.

3.3 Detecting duplicated server pages

Active Server Pages (ASP) is one of the technologies used to create server pages; we referred it to define an approach to detect duplicated static server pages. The approach is based on the computation of the Levenshtein distance metric. The built-in ASP objects, together with their methods, properties and collections, may characterize the control component of an ASP page. Thus, an ASP page may be thought of as a sequence of the references to these elements. The reason of that is based on the hypothesis that, if two ASP pages present the same sequence of references to ASP features, they could have the same behavior. The symbol alphabet to be used for the Levenshtein distance computation will include all the built-in ASP objects elements than can be referred in an ASP page. ASP pages will be analyzed and an ASP-string extracted for each ASP page using an approach similar to the one used to compute the Levenshtein distance for HTML pages: for each couple of server pages, both ASP distance and HTML distance are computed; pages having null distance will, again, be considered as clones. Only the Levenshtein distance has been considered for detecting duplicated static server pages.

4. Case studies Several WAs were analyzed using the proposed approach with the aim of assessing its feasibility and effectiveness. A prototype tool that parses the HTML and ASP source files, extracts the HTML-tags/ASP-objects, produces the HTML/ASP-strings and automatically computes the distances between the pages has been developed to support the experiments. Two kind of experiments were carried out: the first one aimed to detect clones within a WA, while the second one aimed to detect cases of Web site plagiarism.

4.1 Clone detection within a WA

This section provides the results of a case study involving four WAs (named WA1, WA2, WA3, and WA4, respectively). The WA1 implemented a ‘juridical laboratory’ with the aim of supporting the job of professional lawyers, and the WA2 implemented the Web site of a research project involving some Italian Universities and industries. The WA3 realized the Web site of an Italian actress, and WA4 was a Web site reporting

Table 3: Main results from HTML pages analysis

Nr. of HTML files Nr. of couples of clones Nr. of files having a clone Nr. Cluster Computation time

WA1 201 46 18 3 2h 50m

Levensthein Distance WA2 WA3 16 331 6 3022 4 246 1 24 4s 21h 47m

WA4 115 20 21 8 14m

Frequency Based WA2 WA3

WA1 46 18 3 15s

6 4 1 1s

3466 247 24 21m

WA4 22 25 10 3m

Table 4: Main results from ASP pages analysis Nr. of ASP files Nr. of couples of clones Nr. of files having a clone Computation time

historical information about the Italian Middle Ages. While both client and server side source files were available from the WA1 and WA2, only the source files of the static client pages of the WA3 and WA4 were available. Therefore, for the latter two applications we were able to carry out just the analysis to detect duplicated static client pages. The Levenshtein and Euclidean distance between each couple of HTML pages in WA1, WA2, WA3 and WA4 was computed, and the Levenshtein distance between the couples of ASP pages in WA1 and WA2 was computed too. Table 3 and Table 4 report the main results of the analysis we carried out for detecting duplicated HTML and ASP pages, respectively (a Pentium III 850 MHz PC was used for the computations). Table 3 reports also the number of the clusters grouping sets of duplicated pages, all with the same template. As to the detected HTML clones, the Levenshtein distance approach and the frequency-based one detected the same set of clones both in the WA1 and in the WA2, while the set of clones detected in the WA3 and WA4 were characterized by some small differences, and a few false positives were detected. For each application, the couples of clones detected by the Levenshtein distance were visualized with a browser in order to validate the results of the analysis. Each couple actually implemented exact clones. Moreover, the couples of exact clones were further analyzed in order to group them into different clusters composed by identical or very similar pages; each cluster grouped a set of pages all having the same template. For instance, one of the three clusters of the WA1 grouped the pages representing the roots of the sub-trees of the Web site reachable from the home page of the application (the cloned pages shown in Figure 1 belong to this cluster); another cluster grouped the pages implemented by files with the same name ‘Mainframe.htm’; while the third cluster grouped the pages implemented by files with the same name ‘Title.htm’. The largest cluster was contained in the WA3 and included 51 pages: these pages contained pictures and an illustrative text about some movies performed by the actress the site is devoted to. No other exact clone, except the ones indicated in the Table 3, were detected by a ‘manual’ verification. However, the Levenshtein and Euclidean Matrixes distance included couples of pages with a very low distance that made them potential near missing clones. As an example, the Levenshtein distance Matrix of WA3 included 2333 couples of pages with a distance less than 10 and greater than zero: the most of these pages were actually near missing clones.

WA1 19 2 4 53s

WA2 41 33 23 2m

The computation times of the frequency-based method were very lower than the ones required by the Levenshtein distance one, even if the former method produced less precise results. Therefore, an opportunistic approach may be proposed: this approach would use the frequency-based method for preliminarily identifying potential couples of clones, and apply the Levenshtein method over these couples for detecting the actual clones and rejecting the false ones. As to the ASP analysis, the results produced by the proposed method were validated by a ‘manual’ verification performed by code inspection. One of the two ASP clones detected in the WA1 was actually a clone, while the second one was composed by two pages having the same ASP portion of code, but with large differences in the remaining code. Therefore, these two pages could not be classified as clones, even if they contained some duplicated ASP code fragments. Similar considerations could be made for the clones detected in the WA2: some detected clones were false positives, even if many of them presented significant similarities. However, in both WAs no other clone was detected. Therefore, the Levenshtein distance was not so successful for finding clones in ASP pages as it was for detecting clones in HTML pages. However, this method may be used to identify couples of pages having the same ASP-strings, that are candidates to be clones, and the actual clones should be looked for among them.

5. Conclusions In this paper an approach to clone analysis in the context of Web systems has been proposed. Pages of a WA having the same control component were considered as clones, even if they differed for the data component. Two methods for detecting duplicated WA pages - one exploiting the Levenshtein distance and the other one based on the frequency of the HTML tags in a page - have been defined and experimented with. During the experiment, the proposed methods detected clones among static Web pages, and a manual verification proved the methods’ effectiveness. The methods produced results that are comparable, but with different computational costs. In the case of static HTML pages, the frequency-based method produced set of couples of clones with few differences - with respect to the ones obtained by applying the Levenshtein distance - but with a lower computational cost. Therefore, the set of clones detected by the frequency-based method may be further refined using the

Levenshtein distance method. In the case of ASP pages, the Levenshtein distance may be just used to identify an initial set of potential ASP page clones, among which to search for the actual ones. The proposed approach has been successfully applied to identify a case of plagiarism too. Further experimentation should be carried out to better validate the proposed methods. Moreover, approaches based on the use of other suitable software Web metrics to identify clones, in particular among server pages, will be investigated. Clone detection allows to highlight reuse of patterns of HTML tags or ASP objects (i.e., recurrent structures among pages, implemented by specific sequences of HTML tags or ASP objects), provides an approach to facilitate Web maintenance, and the migration to a model where the content is separated from the presentation. Moreover, identifying clones facilitates the testing process of a WA, since it is possible to partition the pages in equivalence classes, and to specify a suitable number of test cases, accordingly.

References [Bak93] Baker S. B., A theory of parametrized pattern matching: algorithms and applications, in Proceedings of the 25th Annual ACM Symposium on Theory of Computing, 71-80, May 1993. [Bak95] Baker B. S., On finding duplication and near duplication in large software systems, in Proc. of the 2nd Working Conference on Reverse Engineering, IEEE Computer Society Press, 1995. [Bak95b] Baker S. B., Parametrized pattern matching via Boyer-Moore algorithms, in Proceedings of Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, 541-550, Jan 1995. [Bal00] Balazinska M., Merlo E., Dagenais M., Lagüe B., Kontogiannis K., Advanced clone-analysis to support object-oriented system refactoring, in Seventh Working Conference on Reverse Engineering, 98-107, Nov 2000. [Bal99] Balazinska M., Merlo E., Dagenais M., Lagüe B., Kontogiannis K., Measuring clone based reengineering opportunities, in International Symposium on software metrics. METRICS’99. IEEE Computer Society Press, Nov 1999. [Bax98] Baxter I. D., Yahin A., Moura L., Sant’Anna M., Bier L., Clone Detection Using Abstract Syntax Trees, in Proceedings of the International Conference on Software Maintenance, 368-377, IEEE Computer Society Press, 1998. [Ber84] Berghel H.L., Sallach D.L., Measurements of program similarity in identical task environments, SIGPLAN Notices, 9(8):65-76, Aug 1984. [Frak92] W.B. Frakes, R. Baeza-Yates - Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1992. [Gri81] Grier S., A tool that detects plagiarism in PASCAL programs, in SIGSCE Bulletin, 13(1), 1981. [Hor90] Horwitz Susan, Identifying the semantics and textual differences between two versions of a program, in Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, 234-245, June 1990. [Jan88] Jankowitz H.T., Detecting plagiarism in student PASCAL programs, in Computer Journal, 31(1):1-8, 1988. [Kon96] Kontogiannis K., De Mori R., Merlo E., Galler M., Bernstein M., Pattern Matching for clone and concept detection, in Journal of Automated Software Engineering, 3:77-108, Mar 1996. [Kon95] Kontogiannis K., De Mori R., Bernstein M., Merlo E., Pattern Matching for Design Concept Localization, in Proc. of the 2nd

[Kon97]

[Kur95]

[Lag97]

[Lev66]

[May96]

[Pat99]

[Ric00]

[Ula72]

[War99]

Working Conference on Reverse Engineering, IEEE Computer Society Press, 1995. Kontogiannis K., Evaluation Experiments on the Detection of Programming Patterns Using Software Metrics, in Proc. of the 4th Working Conference on Reverse Engineering, 44-54, 1997. Kurtz S., Fundamental Algorithms for a Declarative Pattern Matching System, Report 95-03, Technische Fakult.at der Universit. at Bielefeld, Bielefeld, FRG, 1995. Lagüe B., Proulx D., Merlo E., Mayrand J., Hudepohl J., Assessing the benefits of incorporating function clone detection in a development process, in Proceedings of the International Conference on Software Maintenance 1997, 314-321, IEEE Computer Society Press, 1997. Levenshtein, V. I., Binary codes capable of correcting deletions,insertions, and reversals, Cybernetics and Control Theory 10 (1966), 707-710. Mayrand J., Leblanc C., Merlo E., Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics, in Proceedings of the International Conference on Software Maintenance, 244-253, IEEE Computer Society Press, 1996. Patenaude J.F., Merlo E., Dagenais M., Lagüe B., Extending software quality assessment techniques to java systems, in Proceedings of the 7th International Workshop on Program Comprehension IWPC’99, IEEE Computer Society Press, 1999. Ricca F., Tonella P., Web Analysis: Structure and Evolution, in Proceedings of the International Workshop on Web Site Evolution, 76-86, 2000. Ulam S.M., Some Combinatorial Problems Studied Experimentally on Computing Machines, in Zaremba S.K., Applications of Number Theory to Numerical Analysis, 1-3, Academic Press, 1972. Warren P., Boldyreff C., Munro M., The evolution of websites, in Proceedings of the International Workshop on Program Comprehension, 178-185, 1999.