Using Clustering to Support the Migration from ... - Semantic Scholar

9 downloads 5192 Views 217KB Size Report
Using Clustering to Support the Migration from Static to Dynamic Web Pages. Filippo Ricca and Paolo ..... ysis, exploiting Natural Language Processing (NLP) tech- niques. More details on this ..... Colline dell'Irpinia Centrale. 350/450m.
Using Clustering to Support the Migration from Static to Dynamic Web Pages Filippo Ricca and Paolo Tonella ITC-irst 38050 Povo (Trento), Italy ricca, tonella @itc.it 

tion can be separated. This has remarkably positive effects on the Web site maintainability and overall quality. In fact, if data are stored separately (e.g., in a database), and Web pages are created dynamically, based on a fixed template, the update of information can be directly performed on the separated data storage. Programs for the dynamic generation of the pages may also need to be modified, but still with the advantage that formatting is not intermixed with content. Moreover, using a single dynamic program to generate a whole set of pages ensures a consistent structure and formatting of the result. For multilingual Web sites, the advantages of the dynamic approach are even more evident, since these sites have the additional problem of the consistency of structure and content across languages [12]. The dynamic generation of the pages in all languages from a common template, as well as the separation of the multilingual content from the page structure, alleviate the consistency problem and simplify Web site evolution. Several existing Web sites have not yet moved to the next generation of dynamic sites. There are several examples of sites in which the products commercialized by a company are presented in static Web pages, where a given HTML structure is replicated for each product-card, with productspecific values in each page. In this paper we propose a (semi) automatic process aimed at identifying static Web pages that can be transformed into dynamic ones, with the variable parts (e.g., product descriptions in product cards) replaced by dynamically generated ones. The proposed approach exploits clustering to recognize a common structure of Web pages. This is particularly useful when the physical organization of the pages does not hint any natural way to identify them. The result of the clustering phase provides the candidate template, to be used in the dynamic generation of the pages. A comparison between original pages and template provides the records to be inserted into the database. Then, a script generates the migrated pages dynamically, from the template and the database. Manual intervention is limited to the refinement of the constructed template and database.

Abstract Web sites of the first generation consist typically of a set of purely static Web pages. Content and presentation are not separated, and a same page structure is replicated every time a similar organization of the information is devised. Such a practice poses several problems to the evolution of these sites. It is not easy to update the content, and each time the HTML structure is modified, the same changes have to be propagated to all replications. In this paper, an approach is proposed for the identification of the Web pages that are more amenable to be migrated into a dynamic version, in that they share a similar structure, filled in with a content organized according to a common scheme. Clustering is used for this purpose: a common template is extracted from the pages in the same cluster, and the variable information of the pages matching the template is migrated to a database. A server side program extracts the requested information from the data base, and generates dynamically the HTML pages to be displayed in the browser.

1 Introduction In the last few years, Web sites have become important assets for several companies, offering new possibilities to advertise products, and to provide information and e-commerce services. The study of their production, evolution, and quality control has correspondingly become a separate thread in software engineering research, given their peculiarities over traditional software. The first generation of Web sites consisted typically of a collection of static Web pages, containing advertising material, information about the company, and product-cards. Such Web sites have been progressively replaced by more dynamic sites, with functionalities devoted to accessing data interactively, performing transactions on-line, and exchanging information. Among the others, one of the advantages of dynamic Web sites is that content (data) and presenta1

Several works on Web application understanding and reengineering have been presented in the literature [2, 4, 6, 7, 11, 12]. The work most related to ours is [4]. In this work, the authors describe a system to extract some items, such as the textual content or the images, from the Web pages of an existing site and to store them into a database. The original Web pages are recreated by retrieveing the migrated information from the database. Replications of a same information are detected and removed. The main difference with our work is that the authors of [4] do not attempt to recognize a repeated structure in a subset of all available pages, and all text fragments are treated the same way: they become a record, stored in a flat database table, with (conceptually) only one column (record field), containing the textual content. On the contrary, we are able to treat different pieces of text in different ways. Some text portions are not moved to the database at all, since they are recognized as a part of the common template (for example, field labels in different product cards). Other text portions are recognized as being values of specific fields, which become entries of proper table columns. Instead of generating a record for each textual content, with only one content field, we generate a record for each migrated page, with record fields associated with the various fields that have been identified (for example, a record for a product card will have a field for the product name, a field for the product price, etc.). Only the pages that match the identified structure are migrated to the database (in [4] all pages are migrated). In a previous work [11], we have investigated the application of rewrite rules to Web applications, with the aim of restructuring them. In another previous work [12], we considered the problem of retrieving traceability links between multilingual portions of a Web site. The main objective was restructuring to ensure multilingual consistency and to simplify future evolution of the site. Clustering has several applications in software analysis and reverse engineering [8], and has been recently applied to Web applications [6] with the aim of supporting program understanding. Cloned Web pages are identified in [7] by exploiting an edit distance measure similar to ours, but used for different purposes. The paper is organized as follows: after giving some general notions on clustering, the next section explains how clustering has been used to recognize Web pages with similar structure. In particular, this section presents the clustering algorithm adopted and the similarity measure used. Section 3 describes the restructuring process. The process is first presented in the simpler case of a mono-lingual Web site, and then it is extended to the general case of a multilingual site. Experimental results are provided in Section 4, where the feasibility and usefulness of the proposed process are discussed. The last section is devoted to conclusions and future work.

2 Clustering Clustering is a general technique aimed at gathering the entities that compose a system into cohesive groups (clusters). Given a system consisting of entities which are characterized by a vector of properties and are connected by mutual relationships, there are two main approaches to clustering [1]: the sibling link and the direct link approach. In the sibling link approach, entities are grouped together when they possess similar properties, while in the direct link approach they are grouped together when the mutual relationships form a highly interconnected sub-graph. We investigate the usage of clustering with the purpose of recognizing cohesive groups of static Web pages that can be migrated into a database. In fact, if a set of Web pages comprises pages that share a common structure (template) and differ only for the specific values of the information displayed, it is possible to generate them dynamically through a server side script that fills-in a fixed template with the records extracted from a database. We expect that the static Web pages to be restructured into a database be similar (but not necessarily identical) from a structural point of view. The typical example is a set of product-cards, where only product-specific information varies. Clustering can be applied to recognize the common elements in Web pages that are potential candidates for restrcturing. Since such pages share a common structure, a natural choice of clustering technique is the sibling link approach, based on a structural measure of similarity between Web pages. When applying clustering in the sibling link approach, two fundamental questions need be answered [1]: When are two entities similar? What algorithm determines the clusters from the similarity measures? Given the nature of our problem, the structural edit distance [3] among the Web pages is a good choice to measure the dissimilarity of pages (its complement gives the required similarity measure). For example, product-cards have typically a low structural edit distance (high similarity). In the literature there exist several different clustering algorithms [13], with different properties. Hierarchical algorithms do not produce a single partition of the system. Their output is rather a tree, with the root consisting of one cluster enclosing all entities, and the leaves consisting of singleton clusters. At each intermediate level, a partition of the system is available, with the number of clusters increasing while moving downward in the tree. Divisive algorithms start from the whole system at the tree root, and then divide it into smaller clusters, attached as tree children. Alternatively, agglomerative algorithms start from singleton 2

      !#"$%&(')*  (')  + ,-( .+  ,/ 1023%4#65(7897 :; ;1*"!=):(?  !  !#"$%&')(*%+(@-7=!AB21!7%#B!7!A%2-C# !#"$%&')(!123(ED7% $ = 1@.-7=AF 17!% *0   !  ! 2 $8 "*'(92 $8 "*:(

Suggest Documents