Towards reengineering web sites to web-services providers - Software ...

2 downloads 0 Views 518KB Size Report
HTML responses by the server; these pair-wise. Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR'04).
Towards Reengineering Web Sites to Web-services Providers Yingtao Jiang Department of Computing Science University of Alberta Edmonton, T6H 2E8, AB Canada +1 780 492 3118 [email protected] Abstract The web-services stack of standards is aimed at facilitating the development of web applications by integrating software components, developed across organizational boundaries. This flexible integration relies on the specification of the components services in terms of an open, XML-based standard, WSDL. A critical step in the process of reusing existing, WSDL-specified components is the availability of a multitude of such components. Automated reengineering methods for constructing web services out of functionalities already offered by existing web sites can therefore play an important role in facilitating the adoption of these standards. In this paper, we describe our work on reverse-engineering the interaction between web-site servers and client browsers into XML specifications, syntactically and semantically close to WSDL.

1. Motivation and Introduction The World Wide Web is rapidly being adopted as the medium of collaboration among organizations. The web-services stack of standards is aimed at facilitating the development of web applications by integrating software components, developed across organizational boundaries. It consists of a set of related specifications that define how reusable components should be specified (through the WebService Description Language – WSDL [15]), how they should be advertised so that they can be discovered and reused (through the Universal Description, Discovery, and Integration API – UDDI [14]), and how they should be invoked at run time (through the Simple Object Access Protocol API – SOAP [8]). A critical step in the process of reusing WSDLspecified components is the availability of such components. To date, even though there are a number of tools for supporting web-service development, there are few publicly accessible services in the public UDDI registries. Some of the reasons for this sparseness may be the steep learning curve of the above-mentioned tools, or the as-yet unclear business

Eleni Stroulia Department of Computing Science University of Alberta Edmonton, T6H 2E8, AB Canada +1 780 492 3520 [email protected] case for web-service deployment, or the relative lack of experience in establishing contractual agreements on the basis of such public services. All these reasons make the cost of developing the services high, compared to the potential return of their postdeployment usage, which is difficult to predict. This situation necessitates the development of automated re-engineering methods for constructing web services out of existing functionalities already offered through web sites of organizations today. There have already been a number of projects aimed at this general objective [13, 17, 5, 16, 1]. All of them, however, adopt a code-migration approach. Given that distributed object platforms, such as EJB for example, are easy to migrate to web services using the currently available web-service building toolkits, these projects have set as their goal to develop methods for migrating legacy web-site implementations to such object-oriented frameworks, so that they can subsequently be further migrated to a web-services platform. In our work, we have adopted an alternative methodology. Instead of re-engineering the web-site code base, we have chosen to reverse engineer the “presentation layer” of the web application, in order to extract from its behavior the set of functionalities it currently delivers. The extracted functionalities can then be specified in terms of WSDL web-service specifications, and they can be deployed through proxies accessing the original web server and parsing its responses. Intuitively, we are exploring a user-interface migration approach, wrapping browser-accessible web-application interfaces with programmatically accessible specifications. This approach is an instance of our general interaction-reengineering methodology and is similar to the legacy-interface migration method we originally developed in the context of the CelLEST project http://www.cs.ualberta.ca/~stroulia/CELLEST [11, 12]. The intuition underlying our approach is simple: web sites already deliver “services” through pairs of browser-issued HTTP requests and corresponding HTML responses by the server; these pair-wise

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR’04) 1534-5351/04 $ 20.00 © 2004 IEEE

Figure 1: The service discovery process. interactions naturally correspond to the input and output messages of a web-service operation, as it would be described in this service’s WSDL specification. Therefore, the objective of our method is to automatically discover such patterns of HTTP request and HTML response pairs by examining a multitude of examples of the interaction between client browsers and web servers. These patterns could potentially be web services. To decide which ones of them are indeed candidates, we have developed a set of heuristics to filter out spurious patterns and a visualization environment so that a domain expert can better perceive and assess them. The last step of the process involves the semi-automatic editing of the discovered pattern elements in service data types, messages and operations. The rest of this paper is organized as follows. Section 2 describes the overall service reverseengineering process and the architecture of the “webservice discovery and reengineering” system we have developed to implement it. Section 3 presents a case study we have conducted to evaluate our servicediscovery system and the method it implements. Section 4 places our work in context with other related research efforts. Finally, Section 5 concludes with a summary of our experience to date and our plans for future research.

2. The Web-Service Discovery and Reengineering System Our web-service discovery and reengineering system consists of five components – shown as

shadowed boxes in Figure 1– each one responsible for one step of the overall process. The data consumed and produced by these components are shown as gray callouts. The document-collection component is responsible for exercising the web site and collecting a set of examples of its behavior, i.e., a set of pairs of HTTP requests/HTML responses. The responses are subsequently processed by the translator and are transformed into a standardized format, conducive to pattern mining. The transformed responses are then forwarded to the pattern-mining component, which employs a set of algorithms to extract a set of patterns that are frequent in the collected document set. The patterns are then examined by a user of the visualization component to identify which ones actually correspond to useful services. Finally, the user can use the service-interface editor to translate the selected patterns into corresponding WSDL specifications.

2.1 The document-collection component This component is responsible for the first step of the process, namely the collection of the web-site behavior examples, from which the potential services will later be mined. It is essentially a web-site testing component: it automatically generates a sequence of similar HTTP requests to the web site and stores the resulting pairs of HTTP requests and corresponding HTML responses to a local repository. This component is configured through three XML files, described below.

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR’04) 1534-5351/04 $ 20.00 © 2004 IEEE

2.1.1 Master configuration file An example master configuration file is provided in the mainConfig.xml file, shown below. It consists of one or more site elements identifying the web sites to be reengineered. There are four sub-elements of a site element. The siteName element gives a unique descriptive name to the web site in question. The requestProtocolLoc element specifies the second page-collection configuration file, i.e., a file in which the syntax of the HTTP request protocol for this web site is described – discussed in detail in subsection 2.1.2. The outputLoc element specifies the directory in the local file system where the collected HTML response documents should be stored. The inputSet element specifies the third page-collection configuration file, i.e., the file that describes the “test data” based on which the requests to the web site will be formulated – discussed in detail in subsection 2.1.3. mainConfig.xml YahooStockQuotes ../../config/protocols/YahooStockQuotes.xml ../../output/YahooStockQuotes/ ../../input/YahooStockQuotes/input.xml

2.1.2 Request-protocol configuration file This file describes the HTTP request protocol by which client browsers access the web site. An example is shown in the reqProtocol.xml below. Its syntax is quite simple. The document root element is the request and it consists of a method sub-element with two attributes: “type”, which specifies the type of the HTTP request, in this example GET, and “url”, which specifies the location of the web site. The method element consists of a form element, which, in turn, consists of a series of parameter sub-elements. Each parameter element corresponds to an input variable of the request: the “name” attribute indicates the variable name of this input parameter in the HTTP request and the “value” attribute indicates either a variable name in the input

reqProtocol.xml

data-set or a literal value depending on the third attribute – “input”. If input equals “yes”, it means the actual input value will come from the input data set, and the corresponding variable name in the input data set is given by the value of attribute “value”. If the value of the “input” attribute is “no”, then the value of the “value” attribute is deemed literal. By default, “input” value is “no”. For example, the reqProtocol.xml file above specifies the syntax of the HTTP request to the yahoo web site to obtain the current value of a stock symbol. This is a GET request with two input parameters, named “s” and “d” respectively. “s” will take a different value from the “symbol” parameter of the input data set with each HTTP request while “d” will use the literal value “v1” as its value every time. 2.1.3 Input data set The third configuration file of the documentcollection process contains the actual data to be used in “testing’ the web site, i.e., in formulating requests to it according to the request-protocol specification. In this example, the data set is shown in the file inputData.xml, which consists of a sequence of inputData.xml MSFT AMD IBM KRK KVM ORCL OCLV HP

symbol elements. Based on the information provided by the request protocol and the input data set, the page collector component can generate a series of HTTP requests to the target web site and store the response HTML pages in the directory specified in the mainConfig.xml file. In our example, the values of the symbol elements in the inputData.xml file will be used as values for

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR’04) 1534-5351/04 $ 20.00 © 2004 IEEE

the “s” parameter of the request to the Yahoo stockquote web site (GET http://finance.yahoo.com/q) and the resulting HTML responses will be stored in the directory ../../output/YahooStockQuotes.

2.2 The translator component The translator component first uses the JTidy [2] package to “clean” the collected HTML documents. Then, it proceeds to translate the clean documents into the so-called “DM format” which is usable by the pattern-mining component invoked next. In the DM format, a part of the clean HTML document content is removed and the remainder elements are consistently translated into numbers. The HTML-to-DM translation process is governed by two input configuration parameters. The first one is the set of “interesting delimiters”, identifying the HTML elements that will be retained. Intuitively, this file makes explicit some tacit knowledge about the domain of HTML-document design: usually, designers highlight interesting output information within HTML tags, such as or or , or with a distinct font attribute, such as color. Other HTML elements, such as images and unstructured paragraphs usually contain peripheral information, not directly related to the output expected by the user issuing the request. The “interesting delimiters” set is web-site independent. The second configuration parameter is the “landmark set”. A landmark is defined as a word or phrase frequently used in a specific application domain. For example, in the stock-quote domain, the following phrases are usually found: “last trade”, “market”, “bid”, “open” …etc. Intuitively, these landmark phrases are expected to be used as labels in close proximity to the output information of the website HTML responses. Delimiter HTML tags and landmark phrases are the only parts of the original response content retained in the DM format. Furthermore, they are consistently translated into a numerical alphabet, since most pattern-mining algorithms assume such numerical sequences as input.

2.3 The pattern-miner component The pattern miner component consists of two subcomponents, each one implementing a different sequential pattern-mining algorithm. The first one implements the Sequitur algorithm [4]. This algorithm compresses a string into a context-free grammar (without recursion) by inferring the grammar from the string. If there is structure and repetition in the input string then the grammar may be very small compared to the original string, and the composition rules of the grammar capture essentially the frequently repeated

subsequences in the original string. Sequitur relies on two intuitive rules: First, that no pair of adjacent symbols (diagram) should appear more than once in the grammar (instead it should be substituted with a composition rule) and second, every production rule should be used more than once (there should be no non-repeatable rules). The second pattern-mining component implements the IPM algorithm [2]. IPM is a sequential pattern-mining algorithm, designed to discover patterns with insertion errors, i.e., patterns whose instances may not be exact replicates of the pattern itself but may contain a certain number – below a configurable threshold – of extraneous alphabet characters. This feature makes it especially suitable in situations where the input sequences may be noisy and a certain degree of flexibility is desired when inferring a pattern. The sequential patterns produced by the two pattern-mining sub-components are filtered through a heuristic process to reduce the number of discovered patterns. The details of those heuristics are discussed in the case study part of this paper. The ultimate output of the pattern-miner component is a set of “good” patterns, which together cover the parts of the web-site response documents that contain the information of interest to the user of the web site. Each pattern corresponds to a frequently occurring sequence of HTML tags and domainspecific landmarks, which is hypothesized to be a consistently structured part of the HTML response containing some of the desired information output of the request. Each pattern is also associated with a set of locations indicating where in the collected HTML response documents the patterns appears.

2.4 The Pattern-Visualization Component Usually, even after the heuristic pattern filtering, a substantial number of patterns remain. Most of them are still spurious patterns. The patternvisualization component is intended to highlight these patterns in the context of the HTML response documents in which they occur, so that a user can easily perceive which ones are actually useful, i.e., which ones contain the output information expected from the HTTP request. The user interface of the pattern-visualization component, shown in Figure 2, contains three frames. The frame to the left lists the discovered patterns. For each pattern (rule), the algorithm that discovered it is mentioned (all the rules shown in the Figure have been discovered by the Sequitur algorithm) and also its support rate, i.e., the number of its occurrences vs. the total number of examined pages (all the rules shown in the Figure have 100% support, since they all have 28 occurrences in 28 collected pages).

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR’04) 1534-5351/04 $ 20.00 © 2004 IEEE

Figure 2: The user interface of pattern visualization component. The frame at the right side is used to display a page returned by the web site, with the areas covered by the selected pattern highlighted. For example, in the Figure we see that the selected Rule 145 represents a pattern, whose occurrence in the displayed response page covers part of the tabular structure containing information about last trade time, closing and opening prices etc. The frame to the bottom contains a set of controls to select web pages with different parameters. Using the “Prev” and “Next” links, the user can see the occurrence of the same rule on other pages of the collection. In this manner, the user can perceive whether the rule covers consistent parts of the HTML response with information of interest to the user. If this is the case, the pattern is useful for extracting (some of) the data expected as part of the return message of the potential web service. Finally, the submit button, contained in the middle of this frame, is used to submit the selected pattern and web page to the service-description editor component.

2.5 The Service-Description Editor Component This component receives as input the patterns submitted through the pattern-visualization component and the corresponding HTML response documents in which they appear. It automatically calculates the locations of the patterns’ instances in

these pages as XPATH expressions (while at the same time generating a relative XPATH within the pattern for each piece of data selected by the user) and opens an editor window for the user to edit each data element appearing on the highlighted pattern instance. A snapshot of the prototype implementation of the user interface of the service description editor component is shown in Figure 3. To the left of the frame, a pop-up window displays each piece of highlight information (i.e., neither the landmarks, nor the chosen HTML tags) contained in the pattern instance and its XPATH location in the selected page. The user can designate a name and a data type to each piece of information. For example, in the Figure the information “17.02” is shown, and the user has named it “LastTradePrice” and is about to select its type from the drop-down menu (currently showing the default data type, “integer”). In this manner, the user specifies the set of data types that the reengineered web service will deliver as parameters of the output messages of its operations. Based on the data types defined, this component also provides support for specifying the messages, operations and port types of the web service. These specifications, together with the information about the web-site’s URL, request protocol and input parameters - originally contained in the configuration files, constitute a WSDL description of the reengineered web service.

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR’04) 1534-5351/04 $ 20.00 © 2004 IEEE

Figure 3: The user interface of the service-description editor. This specification is implemented by a run-time component (currently under development) capable of executing the original web site when it receives the input message – corresponding to the original http request – and producing the output message by parsing the web-site HTML response. Given this WSDL specification, a remote client application can correctly access the web site through this run-time component and receive the desired stock-quote information. At this stage, the implementation of this last component is not yet completed. However, we have implemented in the past components with similar functionality, in the context of the TaMeX project(http://www.cs.ualberta.ca/~stroulia/TAMEX), and the completion of the implementation is technically straightforward.

3. The case study In order to evaluate our web-service discovery and reengineering method and our prototype system, we have undertaken a series of case studies. In this paper, we report on the first of them whose objective was to reverse engineer web sites dedicate to providing stock-quote information. We selected three commonly used online stock-quote web sites, namely Yahoo (http://finance.yahoo.com/), Pcquote (http://www.pcquote.com/stock/) and Lycos (http://finance.lycos.com/qc/default.aspx) and evaluated our prototype service-discovery system based on them.

In order to evaluate the efficiency of the two algorithms and their performance under different number of collected pages and different desired support rates,1 we chose to collect 5, 15 and 20 pages from each web site and mine patterns with minimum support rate of 60%, 80% and 100%. For example, for the Lycos web site, we collected three sets of pages (5, 15, and 20) and run both the Sequitur and IPM algorithm on those pages. For the sake of clarity and compactness, in this paper we only report on some of the data, mostly the data we collected from our experimentation with the Lycos web site. The data we collected from the other two web sites exhibit the same kind of characteristics as the Lycos data we discuss in this section. For the Sequitur algorithm, the only input parameter required is the minimum number of occurrences of the pattern, and the algorithm generates patterns of arbitrary length, with at least this number of occurrences. On the other hand, the IPM algorithm requires as input the desired pattern length range and the minimum number of occurrences and returns as output all the maximal patterns that meet the occurrence and length constraints. Table 1 shows the number of patterns mined by the IPM algorithm with different input parameters 1 Support rate is defined as the ratio between the number of occurrences of a pattern and the number of sample pages.

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR’04) 1534-5351/04 $ 20.00 © 2004 IEEE

and sample page numbers. In the table, row labels indicate the desired pattern-length range: for example, the first row of the Table reports on discovered patterns with minimum length 5 and maximum length 29. Column labels indicate the sample size and the number of pattern occurrences discovered: for example, the first column of the Table reports on patterns discovered in the 5-page collection with at least 3 occurrences. The same convention is used in other tables and figures in this section. The number

of patterns mined by the Sequitur algorithm is shown in Table 2. In each cell of both tables, the first number is the total number of patterns generated by the algorithm and the number inside the parenthesis represents the number of interesting patterns. We will explain what an “interesting” pattern is later in this section. The corresponding execution times of IPM are shown in Figure 4. The execution times of the Sequitur algorithm as compared to the average execution time of IPM are reported in Figure 5.

Table 1: Raw/interesting number of patterns discovered by the IPM algorithm for the Lycos web site. 5~29 30~49 50~69 70~89 90~109 110~129

3/5 593(197) 577(237) 590(277) 590(311) 589(331) 589(351)

4/5 499(167) 423(174) 363(174) 323(168) 282(148) 242(128)

5/5 444(167) 340(174) 260(174) 215(168) 174(148) 134(128)

9/15 593(197) 577(237) 590(277) 590(311) 589(331) 570(351)

12/15 566(197) 530(237) 523(277) 503(311) 482(331) 462(351)

15/15 444(167) 340(174) 260(174) 215(168) 174(148) 134(128)

12/20 595(197) 578(237) 591(277) 590(311) 590(331) 589(351)

16/20 568(197) 531(237) 524(277) 503(311) 468(331) 427(351)

Table 2: Raw/interesting number of patterns discovered by the Sequitur algorithm. Lycos Pcquote Yahoo

3/5 58(3) 159(8) 114(11)

4/5 36(0) 97(0) 67(3)

5/5 21(0) 47(0) 47(2)

9/15 4(0) 29(0) 10(0)

12/15 2(0) 16(0) 7(0)

15/15 2(0) 10(0) 3(0)

12/20 3(0) 16(0) 9(0)

16/20 2(0) 9(0) 5(0)

20/20 2(0) 5(0) 2(0)

Table 3: An example output of running pattern coverage heuristics. 110~129

3/5 74

4/5 5

5/5 5

9/15 74

12/15 74

15/15 5

12/20 74

16/20 74

20/20 5

45000 40000

Execution time(ms)

35000 30000 Pattern Pattern Pattern Pattern Pattern Pattern

25000 20000

Length Length Length Length Length Length

15000 10000 5000 0 3/5

4/5

5/5

9/15 12/15 15/15 12/20 16/20 20/20

Figure 4: Execution time of the IPM algorithm on the Lycos web site.

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR’04) 1534-5351/04 $ 20.00 © 2004 IEEE

5 to 29 30 to 49 50 to 69 70 to 89 90 to 109 110 to 129

20/20 446(167) 341(174) 261(174) 215(168) 174(148) 134(128)

25000

Execution time(ms)

20000

15000 IPM Sequitur

10000

5000

0 3/5

4/5

5/5

9/15 12/15 15/15 12/20 16/20 20/20

Figure 5: Comparison between Sequitur and average execution time of IPM on the Lycos web site. From these figures and tables we can see that Sequitur is very fast as compared to IPM but at the same time, it generates a much smaller number of patterns. Most of the patterns generated by Sequitur are short, with relatively low support rate. This is because Sequitur requires that all the pattern occurrences be identical; since HTML documents frequently contain “noise” like unbalanced and extraneous tags, exact replications tend to be short and few. Moreover, Sequitur discovers a substantially smaller number of interesting patterns. For the IPM algorithm, Figure 4 shows that, given the same sample pages, the higher the support required, the faster the algorithm obtains the result. This is no surprise. With stricter support-rate requirements, more candidate patterns are pruned at the end of each iteration of the algorithm. We can also see from this figure another interesting characteristic of IPM: the execution time does not increase with the sample page number. Actually, the performance of the algorithm is only related to the number of patterns that actually exist in the sample pages and the number of different tokens2 that appear in the sample pages. Up to now, all the patterns we discussed are raw patterns, i.e., patterns discovered by the mining algorithms, without any further filtering. As we already discussed in section 2.3, many of them are spurious, usually the result of accidental cooccurrences of HTML tags; therefore, a filtering process is necessary to eliminate as many non2 Token here refers to everything left in the DM-format file generated by Translator component. A token can either be a HTML tag or a landmark.

interesting and redundant patterns as possible, in order to reduce the user’s efforts in the subsequent phrase of pattern validation through the visualization component of the system. Before discussing the filtering heuristics, we need to first define what makes an “interesting” pattern. Intuitively, an interesting pattern contains part or all of the data of interest. For example, for the stockquote web sites, one would be interested in data such as the “open” or “close” price of a certain stock, its “change” or “52 week range” etc. As discussed in section 2.2, landmarks are those words expected to be very close to the actually interesting data; therefore, we can define the “interestingness” of a pattern as the number of landmarks it contains. If a pattern contains landmarks, it will most likely also contain the data provided by the web application, of interest to the application users. Based on this concept of “interestingness” we developed the first and simplest heuristic, which only removes the patterns that do not contain any landmarks at all. In the cells of Table 1 and Table 2, the number inside the parentheses indicates the pattern number after removing noninteresting patterns. We can see that even after removing those non-interesting patterns, there are still a lot of IPM patterns left. To further prune the patterns we rely on the concept of “pattern coverage rate”. The coverage rate of a pattern is defined as the ratio between the number of landmarks it contains and the total number of landmarks specified for this web site. The bigger the coverage rate of a pattern, the more likely it contains more interesting data. So to prune the originally discovered patterns, the user may input a desired coverage rate, and the heuristic prunes all

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR’04) 1534-5351/04 $ 20.00 © 2004 IEEE

patterns with a lower coverage rate and returns those with a higher coverage rate. For example, Table 3 shows the number of patterns left as a result of requiring a pattern-coverage rate of 70% on the patterns reported in the last row of Table 1. At first one might assume that the higher the coverage rate the better the pattern is. But that is only true for each single pattern, not necessarily for a group of patterns. Remember, the ultimate objective of the process is to find a group of patterns (preferably the smallest such group), which together cover as much of the interesting data as possible. Simply setting the coverage rate high may result in better individual patterns, but may cause the whole group of patterns to cover fewer landmarks in total, i.e. the group coverage rate drops. For example, using the visualization component to validate the discovered patterns reported in Table 3, we found that the actual number of different landmarks covered by this group of patterns is actually less than the group of patterns with a lower individual coverage rate. It is our experience that, when the interesting data is dispersed over the pages in various small structures and there are irregularities interspersed among the interesting data, the coverage heuristic works well at short and medium sized patterns but backfires in longer patterns. If all the interesting data are grouped in the same structure in the web page (e.g. in the same table), this heuristic works well. Currently, we use this heuristic on a trial and error basis. If the user finds that there are too many patterns to examine, he may setup a coverage threshold, filter the patterns, and then examine whether the remaining patterns provide a satisfying coverage. If not satisfied with the pattern-group coverage, the user can try a lower coverage rate and examine the returned patterns again. To reduce the length of this trial-and-error process, we are currently working on new heuristics, which aim at returning the smallest group of patterns that provide the largest group coverage rate. Table 2 shows the numbers of raw vs. interesting patterns mined by the Sequitur algorithm. We can see that this algorithm only finds interesting patterns in small sample size and low support rate. Compared to the IPM algorithm its performance in finding interesting patterns is quite poor. However, we chose to investigate Sequitur as a “standard” widely used pattern mining algorithm. At this point, it seems that IPM completely “dominates” Sequitur, but further experimentation is needed to conclusively establish whether it may contribute interesting patterns under different circumstances of HTML document structure.

4. Related research The work discussed in this paper is, in fact, an extension of an ongoing project of our group, TaMeX. In the context of the TaMeX project, we have already developed an example-based learning method for extracting XML-based web site-wrapper specifications, from specially annotated 3 HTML response documents collected from a web site server [9, 10]. This earlier reverse-engineering approach was labor-intensive since it assumed that a user has to highlight the interesting parts of the HTML document that should be returned as output of the server functionality. In this extension, the role of the pattern-mining component is to discover these elements, thus eliminating this original requirement. A substantial body of research has been done in the field of reverse engineering of web applications [1, 5, 6, 7, 13, 16, 17]. Most of these efforts aim at reengineering the whole web site code implementing the application server and the user interface. They use a variety of different techniques and tools to extract a model of the web application to improve its architecture for better maintainability and easier migration to the web-services platform. Our work instead aims at developing a technique for wrapping existing web applications with WSDL descriptions so that organizations with a presence on the “browser-accessible web” can easily enable programmatic access of their functionalities to other applications. In this sense, the work of Ricca and Tonella is quite similar; in [5] they propose a solution for migration of static web pages into dynamic web pages by using a clustering technique to extract a common template from pages in the same cluster and save the variable part into database. We also share a similar goal – attempting to find the common structures across web-page behavior - with the clonedetection work in [13], where island parsing is used to identify static clones in dynamic web pages.

5. Conclusions and Future Work Our experience with the prototype servicediscovery and reengineering system to date is by no means mature. We have collected some initial evidence on how such a tool might be deployed and our case studies suggest that the approach is indeed promising. In the case study we discussed in this paper, we were able to indeed discover interesting 3

In the original TaMeX wrapper-construction process, the contents of interest had to be either manually highlighted on the example HTML document or alternatively all the contents of interest could be provided as input and an automatic highlighting process would identify their locations in the example page.

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR’04) 1534-5351/04 $ 20.00 © 2004 IEEE

patterns corresponding to the web site’s implicit stock-quote service. Once the last step of WSDL specification is complete, the newly specified services will easily be accessible through specialized proxies able to interpret that pattern and relative XPATH locations to parse the current web-server responses.

9.

10.

Acknowledgements The authors wish to thank Sze-Lai Mok and Edward Zadrozny for their contribution in the development of parts of the system. This research was supported by an IRIS grant.

11.

References 1.

2.

3. 4.

5.

6.

7.

8.

G. Antoniol, G. Canfora, G. Casazza, A. De Lucia: Web Site Reengineering Using RMM: Proc. Int’l Workshop Web site Evolution: 2000, pp. 9-16. M. El-Ramly, E. Stroulia, P. Sorenson: Interaction-Pattern Mining: Extracting Usage Scenarios from Run-time Behavior Traces, The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23 - 26, 2002, Edmonton, Alberta, Canada. Jtidy, http://sourceforge.net/projects/jtidy/ C. G. Nevill-Manning, I. H. Witten. Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence, 7:67-82, 1997. F. Ricca and P. Tonella: Using Clustering to Support the Migration from Static to Dynamic Web Pages: 11th International Workshop on Program Comprehension, pp. 207-216, Portland, Oregon, USA, May 2003. F. Ricca, P. Tonella, I. D. Baxter: Restructuring Web Applications via Transformation Rules: Proc. of SCAM'2001, Workshop on Source Code Analysis and Manipulation, pp. 150-160, Firenze, Italy, 5-9 November 2001. F. Ricca, P. Tonella: Understanding and Restructuring Web sites with ReWeb: IEEE MultiMedia, Vol. 8, No. 2, pp. 40--51, AprilJune 1998. Simple Object Access Protocol (SOAP) http://www.w3.org/TR/2003/REC-soap12-part020030624/

12.

13.

14.

15.

16.

17.

Q. Situ, E. Stroulia. Task-structure Based Mediation: The Travel-Planning Assistant Example. In the Proceedings of The Thirteenth Canadian Conference on Artificial Intelligence (AI'2000). 14-17 May, 2000. Montreal, Quebec, Canada. E. Stroulia, J. Thomson, Q. Situ: Constructing XML-speaking wrappers for WEB Applications: Towards an Interoperating WEB. In the Proceedings of the 7th Working Conference on Reverse Engineering (WCRE'2000). 23-25 November, 2000. Brisbane, Queensland, Australia, IEEE Computer Society. E. Stroulia, M. El-Ramly, P. Sorenson: From Legacy to Web through Interaction Modeling, International Conference on Software Maintenance, October 3-6, 2002, Montreal, Canada, pp. 320-329, IEEE Press. E. Stroulia, M. El-Ramly, P. Iglinski, P. Sorenson: User Interface Reverse Engineering in Support of Interface Migration to the Web, Automated Software Engineering Journal 10(3) 271 - 301 2003, Kluwer Academic Publishers. M. Synytskyy, J. R. Cordy, T. R. Dean: Resolution of Static Clones in Dynamic Web Pages: 5th International Workshop on Web site Evolution, pp. 49-58, Amsterdam, Netherlands, September 2003. UDDI technical paper, http://www.uddi.org/pubs/Iru_UDDI_Technical_ White_Paper.pdf Web Services Description Language (WSDL) (WSDL) http://www.w3.org/TR/wsdl P. Tonella, F. Ricca, E. Pianta and C. Girardi: Using Keyword Extraction for Web site Clustering: Proc. of WSE 2003, 5th International Workshop on Web site Evolution, pp. 41-48, Amsterdam, The Netherlands, September 22, 2003. P. Tonella and F. Ricca: Dynamic Model Extraction and Statistical Analysis of Web Applications: Proc. of WSE 2002, International Workshop on Web site Evolution, pp. 43-52, Montreal, Canada, October 2002.

Proceedings of the Eighth European Conference on Software Maintenance and Reengineering (CSMR’04) 1534-5351/04 $ 20.00 © 2004 IEEE