Reverse Engineering of Web Based Systems

Reverse Engineering of Web Based Systems Abdelwahab Hamou-Lhadj, Abdeslam En-Nouaary, Khalid Sultan Electrical and Computer Engineering Department Concordia University, Montreal, Canada {abdelw, ennouaar, k_sultan}@ece.concordia.ca Abstract Maintaining a poorly documented web application is not an easy task; software engineers must understand various parts of an application before they can make changes that preserve reliability and other system attributes. In recent years, there has been a noticeable increase in the number of studies that aim at reverse engineering web applications. These studies embody a rich set of techniques that differ in a variety of ways. In this paper, we study several techniques for reverse engineering of web applications. The objective is to understand the trends in this area as well as uncover key research questions that remain unaddressed. Keywords: Reverse engineering, web applications, program comprehension, software maintenance

1. Introduction A web application is defined as a software system that is developed using Internet technologies such as web browsers, Internet protocols, mark-up languages, and so on [3]. From the software engineering perspective, one would expect that the development of web applications follows the same rigorous and comprehensive engineering methods as the ones used in the development of traditional software. However, for a variety of reasons, this is rarely the case in practice. Developed under very tight deadlines and with much shorter than usual development life cycles, web applications often fail to comply with software engineering principles [13]. In addition, the design documentation associated with web applications is almost always inexistent, and if it does exist, it is usually out of date or incomplete. 978-1-4244-1841-1/08/$25.00 ©2008 IEEE

The problem is further complicated by the fact the web designers and other knowledgeable of the system's design commonly move to new companies, taking away valuable information about the system. These factors contribute to transform a web application into a software entity that one can use but does not necessarily understand. The problem becomes apparent when the time comes to maintain it by adding new functionality, adapting it to a new platform, or improving its performance. In such situations, one needs to spend a considerable amount of time understanding what parts of the application do, and how they interact with each other to implement particular functionality. Fortunately, there is a noticeable increase in the number of studies for reverse engineering web applications (e.g., [4, 5, 6, 8, 9, 11, 15, 17]) with the ultimate objective being to help software engineers speed up the maintenance process. The common practice is to extract abstract views of a web application that can be used by software engineers to understand how the application is designed and why it is designed this way. These studies propose several techniques that vary depending on the levels of abstraction of the extracted views, the notation used to represent the content of the views, and the type of analyses performed. In this paper, we present the techniques covered in several studies. The main contributions of the paper are twofold: • Understand the current trends engineering of web applications.

in

reverse

• Discuss key research challenges that need to be addressed. The remaining parts of this paper are as follows. In the next section, we present the techniques selected in this study. Web reverse engineering challenges are

presented in Section 3. We conclude the paper in Section 4.

2. Techniques for Reverse Engineering Web Applications Reverse engineering is defined as: “The process of analyzing a subject system to identify the system’s components and their inter-relationships and to create representations of the system, in another form at a higher level of abstraction” [2]. There are different types of reverse engineering activities. The focus of web reverse engineering has been essentially on three main techniques: Design and architecture recovery, user interface reverse engineering, and data reverse engineering. The following subsections discuss the studies selected in this paper, which are categorized according to the aforementioned reverse engineering activities. These studies have been selected because they provide a good coverage of existing techniques. The list is by no means exhaustive, but we believe that it is reflective of the current trends in reverse engineering of web applications.

2.1. Design and Architecture Recovery Design recovery techniques aim at recreating design abstractions from low-level implementation constructs. The objective is to help software maintainers understand what a program does and how it does it [2]. Architecture recovery is the process of recovering the architecture of the system in terms of its high-level components and their interrelationships [8]. One of the most comprehensive research projects in the area of web application design recovery is perhaps the work of Di Lucca et al. in [5, 6]. The authors present an approach, and a tool called WARE (Web Application Reverse Engineering), to recover the static and behavioral views of a web application. The static views are depicted using an extension to UML class diagrams, introduced by Conallen [3] to support typical web application elements such as frames, forms, control widgets, scripts, etc. The process by which static analysis has been applied encompasses several steps, among which the most important consists of parsing various artifacts of a web application, namely, HTML files, the web application directory structure, scripting languages (JavaScript and VB Script), and Java Servlets. The second step consists of mapping each component (e.g., an HTML page, a form, a frame, a script, etc.) into a class, whereas links between components (e.g., hypertext links, submit operations, and so on) are represented as associations

between the corresponding classes. Finally, parameters specific to a particular component are modeled as attributes of the corresponding class. Dynamic analysis is used to analyze the way the components extracted using static analysis interact with each other to implement a specific scenario. This necessitates observing the execution of the application by inserting probes at various places in the source code. The events are collected while the web application is executed, and traced back to the classes represented in the class diagram. Examples of events include visualizing an HTML page (or frame), the submission of a form, a link traversal, and so forth. Sequence diagrams are used to depict the messages exchanged between the components of the web application. Unlike static analysis, dynamic analysis requires human intervention in order to trigger the execution of the scenario (e.g., filling and submitting forms). The interaction diagrams are given to domain experts for examination. The behavioral aspects that involve external users are then modeled manually using use cases. Another important contribution to the area of web application design recovery in the work of Ricca and Tonella [15]. The authors present an approach and a tool, called ReWeb, which aims at facilitating the maintainability and understanding of web applications. Unlike the previous study that uses UML as the main notation, Ricca and Tonella propose modelling a web structure as a directed graph, where the nodes represent HTML pages (or frames), and the edges represent links between the pages. Using the tool, maintainers can view the structure of the web application at different levels of abstraction. The main views supported by ReWeb are: • System View: This tree-based view displays the directory structure of the web application. According to the authors, it has been shown to be useful in helping software engineers build an initial mental model of the site and detect structural problems at a high level. • Structural View: This view provides more details than the system view by displaying the web site’s pages (and frames) and the links among them. This view helps a maintainer identify structural problems at the navigational and organizational level. • History View: This view allows maintainers to keep track of the changes made to a web application and their impact on the structure of the web application.

• Report View: This is a textual view that shows various properties of a web application such as the technology it uses, the dynamic pages, error links, etc. Finally, the last study discussed in this category consists of the work of Hassan and Holt in [8, 9]. Unlike the previous techniques that focus on design recovery, the authors present a semi-automatic approach for the recovery of the architecture of web systems. Their approach extracts the structure of a web application and shows the interaction between its high level components such as databases, distributed objects, and web pages. The authors’ approach consists of three main steps. First, the source code and binaries are parsed to extract the components of the system and their interrelationships. For this purpose, several extractors have been developed to parse HTML, server-side scripts (such as Servlets), database access operations, source code (JavaScript and VBScript), and binary code. The second step consists of grouping the extracted entities into clusters in order to produce higher views. This step, which is guided by domain experts, relies on clustering techniques such as the ones based on file naming conventions, directory structure, or software metrics. The resulting clusters correspond to the high-level system components. The authors stated that they have successfully recovered the architecture of several large commercial and experimental web applications that have over 200 distributed web objects and over 15 databases per application.

2.2. User Interface Recovery User interface reverse engineering consists of abstracting out low-level user interface elements into higher views. Typical applications include improving the usability of the web application, or migrating the web user interface to another platform. In [11], Paganelli and Paterno present a method for analyzing the user interface of web applications with the objective being to automatically reconstruct the logical interaction between the web site’s components. In particular, the purpose of the work is to recreate a task model that can serve as an interface design for the web site. They have developed a tool called WebRevenge that takes as input static web pages and returns a task model represented in the CTT (ConcurTaskTrees) notation, which is a tree notation, developed by Fabio Paternò [12] to improve the design of user interfaces of interactive applications. The authors have designed rules that allow mapping web site’s elements into CTT constructs. The rules analyze the main elements of HTML web pages (e.g.,

buttons, links, checkboxes, etc.) and the relationships among them in order to build the task model corresponding to the web site under study. To validate their approach, the authors applied their technique to a small website. Although the results are promising, the authors recognize that scalability issues might arise if a large website were used. In addition, WebRevenge lacks the ability to capture the dynamic behavior of the website, since these pages are created on the fly. In [17], Vanderdonckt et al. present a technique for reverse engineering a presentation model of a web page to help make the web site accessible from multiple web based devices, such as a PDA, cell phones, etc. Their approach is implemented in a tool called VAQUISTA (reVerse engineering of Applications by Questions, Information Selection, and Transformation Alternatives). VAQUISTA parses HTML files so as to extract user interface elements. These elements are then transformed into instances of standard presentation elements through mapping rules. For example, an HTML form can be decomposed into several control widgets. The resulting user interface model is expressed in XIML [14], a standard XML-based schema for specifying user interface elements. Using VAQUISTA, the users have the flexibility to modify the mapping rules so as to generate user interface elements that satisfy the constraints of the target platform. VAQUISTA, like WebRevenge, also relies on only static HTML parsing.

2.3. Data Reverse Engineering Data reverse engineering is defined as “the use of structured techniques to reconstitute the data assets of an existing system” [1]. This reconstitution involves saving the data using a structured representation such as in a database. Estiévenart et al. present an approach for extracting data from static HTML pages and migrating it into a database [4]. The approach consists of the following steps. First, web pages are grouped according to their informational content. The authors used the directory structure of the web application as the main grouping criterion. The next step consists of cleaning HTML pages by removing common syntactical mistakes such as missing quotes around attribute values, etc. The third step, which is also the most important, consists of identifying concepts and the relations among them from web pages. These concepts will be candidate tables in a relational database, whereas the relations

among concepts will end up as relations among their corresponding tables. This step is performed semiautomatically. A tool has been developed to allow software maintainers to select a sample web page, and identify concepts by browsing its content. The concepts are saved in a file called META. Based on this file, the tool automatically searches for similar concepts in the pages that belong to the same group as the sample page, a process that is also user guided. A META file is created for each group of pages. The next step is to integrate all META files into one XML schema. The resulting XML file is then turned automatically into the new database. The approach was successfully applied to a small university web site.

3. Research Challenges In this section, we discuss research challenges that have yet to be addressed in the area of web application reverse engineering. We focus on three main challenges that we list here and explain in more detail in the subsequent sections: • Coping with multiple web technology platforms. • Dealing with scalability issues. • Applying static versus dynamic analysis.

3.1. Coping with Multiple Platforms Web systems tend to employ a large number of technologies ranging from mere HTML pages to a plethora of scripting languages including JavaScript, VBScript, Perl, CGI, PHP, etc. In addition, the components of a web application are often written in various programming languages, run on multiple distributed machines, and use various database technologies to store and share data. Such a heterogeneous environment necessitates the development of reverse engineering techniques that are specifically tailored to Web applications. To start with, there is a need to design effective parsers in order to extract data from software artifacts that are specific to the design of web applications. In addition, since the extracted information will ultimately need to be saved, it is important to work towards a standard schema (i.e., metamodel) for representing this information. Traditional reverse engineering schemas such as KDM (Knowledge Discovery Metamodel) [7] and DMM (Dagstuhl Middle Model) [10] are designed to support traditional programming languages (e.g., C++, Java, etc.) and do not offer support to languages specific to web applications. These schemas will eventually need to be extended so as to model scripting languages constructs,

HTML elements, and potential objects built using foreign environments such as Flash. It is also important to note that the new schema should not only support the static aspects of a web application but also its behavioral properties. The interaction between a web application’s components involves additional mechanisms such as communicating through hyperlinks, and web services.

3.2. Dealing with Scalability Issues The complexity and sheer size of typical web applications require efficient techniques for handling the volume of data generated by the analysis of such systems. Most of the aforementioned studies have been applied to small web sites, which makes their ability to scale up to large systems questionable. In addition, questions arise on how to best visualize the abstract views extracted from large web applications. As shown in the previous section, existing tools use various notations including UML diagrams, graph structures, and ConcurTaskTrees. However, these notations can be very cumbersome if the amount of data to be displayed is large. Software visualization techniques should be investigated in order to determine the best way to represent web application views with the ultimate objective being to provide assistance to software engineers during the program understanding process. It should be noted, however, that visualization is a complementary technique and needs to be used in conjunction with other program understanding techniques such as software metrics, source code analysis, etc. Finally, the development and maintenance of large systems involves more than one person, as such, there is a need to work towards collaborative reverse engineering environments in order to enable effective sharing of information.

3.3. Static Analysis vs. Dynamic Analysis Most existing web reverse engineering techniques are limited to static analysis of the source code, placing less attention to dynamic analysis techniques. However, static analysis can only reveal the static components of a web application and their interrelationships, whereas a dynamic analysis approach is more useful in enabling the understanding of the dynamics of a web application. Dynamic analysis, on the other hand, comes with its own sets of challenges. First, there is a need to determine the dynamic properties (i.e., the events) that need to be observed. When applied to traditional software systems, these events are generally well-

known. Examples include routine calls, messages between processes, etc. Web applications, on the other hand, might require the investigation of additional events, depending on the type of analyses performed. For example, if the goal is to understand how particular HTML pages interact with each other, one could trace the links between these pages. The resulting trace needs to be modeled and fed to a trace analysis tool, which supports the analysis of such events, for examination. The second issue is related to the sheer size of the data generated from executing a software system. There is a need to investigate techniques for efficient exploration and analysis of large execution traces. Finally, it is important to investigate ways to combine static and dynamic views in one reverse engineering environment, since a comprehensive approach to program understanding will typically involve the combination of both views.

4. Conclusion In this paper, we presented several reverse engineering approaches for web applications. These techniques aim at recovery the design, architecture, user interface elements, and data from web applications. The objective is to help software engineers maintain web systems in an efficient manner. We also discussed several research challenges that need to be addressed such as dealing with the heterogeneous aspect of web systems, investigating techniques to cope with scalability issues when working on large applications, and finally fostering the use of dynamic analysis as a complementary view to the ones generated using static analysis.

[6] G. A. Di Lucca, A. R. Fasolino, P. Tramontana, “Towards a Better Comprehensibility of Web Applications: Lessons Learned from Reverse Engineering Experiments”, In Proc. of the 4th International Workshop on Web Site Evolution, IEEE Computer Society, Montréal, Canada, 2002, pp. 33-42. [7] Knowledge Discovery Metamodel (KDM) 1.0 Specifications, 2007 - URL: http://www.omg.org/cgibin/doc?ptc/2007-03-15. [8] A. E. Hassan, and R. C. Holt, "Architecture Recovery of Web Applications", In Proc. 24th International Conference on Software Engineering, IEEE Computer Society, Orlando, Florida, USA, 2002, pp. 349-359. [9] A. E. Hassan, and R. C. Holt, "Towards a Better Understanding of Web Application", In Proc. of the 3rd International Workshop on Web Site Evolution, Florence, Italy, 2001, pp. 112-116. [10] T. C. Lethbridge, S. Tichelaar, and E. Ploedereder, "The Dagstuhl Middle Metamodel: A Schema for Reverse Engineering”, Electronic Notes in Theoretical Computer Science, Elsevier, 94, 2003, pp 7-18. [11] L. Paganelli, F. Paternò, “A Tool for Creating Design Models from Web Site Code”, International Journal of Software Engineering and Knowledge Engineering, 13 (2), 2003, pp. 169-189. [12] F. Paternò, Model-based Design and Evaluation of Interactive Applications, Springer Verlag, 1999. [13] R. S. Pressman, “What a Tangled Web Weave”, IEEE Software, 17(1), 2000, pp. 18–21. [14] R. Puerta, J. Eisenstein, “A Representational Basis for User Interface Transformations”, In Proc. of Workshop on Transforming the UI for Anyone, Anywhere, ACM Press, Seattle WA, USA, 2001.

5. References

[15] F. Ricca and P. Tonella, “Understanding and Restructuring Web Sites with ReWeb”, IEEE Multimedia Magazine, 8(2), 2001, pp. 40-51.

[1] P. Aiken, Data Reverse Engineering: Slaying the Legacy Dragon, McGraw-Hill, 1996.

[16] UML 2.0 Specifications, 2005 URL: http://www.omg.org/technology/documents/formal/uml.htm.

[2] E. J. Chikofsky, J. H. Cross, "Reverse Engineering and Design Recovery: A Taxonomy", IEEE Software, 7(1), IEEE Computer Society, 1990, pp. 13-17.

[17] J. Vanderdonckt, L. Bouillon, N. Souchon, "Flexible Reverse Engineering of Web Pages with VAQUISTA", In Proc. of the 8th Working Conference on Reverse Engineering, Stuttgart, Germany, 2001, pp. 241-252.

[3] J. Conallen. Building Web Applications with UML. Addison-Wesley Publishing Company, 1999. [4] F. Estievenart, A. Francois, J. Henrard, J. Hainaut, "A Tool-Supported Method to Extract Data and Schema from Web Sites", In Proc. of the 5th IEEE International Workshop on Web Site Evolution, IEEE Computer Society, Amsterdam, The Netherlands, 2003, pp. 3- 11. [5] G. A. Di Lucca, M. Di Penta, G. Antoniol, G. Casazza, “An Approach for Reverse Engineering of Web-Based Application”, In Proc. of the 8th Working Conference on Reverse Engineering, IEEE Computer Society, Stuttgart, Germany, 2004, pp. 231-240.

Reverse Engineering of Web Based Systems

Reverse Engineering of Web Based Systems

Suggest Documents

Ontology based Web Application Reverse-engineering - INFOCOMP ...

Ontology based Web Application Reverse-engineering - CiteSeerX

MDA-Based Reverse Engineering

Reverse Engineering of Computer-Based Navy Systems - CiteSeerX

Reverse Engineering of Computer-Based Control Systems - CiteSeerX

Automatic reverse engineering of interactive dynamic web ...

WebUml: Reverse Engineering of Web Applications

REportal: A Web-based Portal Site for Reverse Engineering

Reverse Engineering Web Configurators - HAL-Inria

Reverse engineering legacy Web applications with phpModeler

Reverse engineering and identification in systems ... - BioMedSearch

Reverse-Engineering New Systems for Smooth

Systems Biology: Knowledge Discovery and Reverse Engineering ...

Understanding software systems using reverse engineering ...

UML-BASED WEB ENGINEERING

Reverse Engineering

Reverse Engineering

reverse engineering

Engineering Reverse

Reverse engineering approach to appearance-based ... - CiteSeerX

An HTML Forms-Based Reverse Engineering Approach

Laser-based imaging for reverse engineering

Reverse Engineering Based Methodology for ...

Web Engineering - Computer Information Systems