Converting Web Applications into Standard XML Web Services: Two Case Studies ... examples: it only extracts texts and clips the web page into a limited page ...
Converting Web Applications into Standard XML Web Services: Two Case Studies
Natheer Khasawneh
Mohammed A. Shatnawi, Mohammad Fraiwan
Department of Software Engineering Jordan University of Science and Technology Irbid, JORDAN [email protected]
Department of Computer Engineering Jordan University of Science and Technology Irbid, JORDAN [email protected], [email protected] functionality can be distributed efficiently over the World Wide Web.
Abstract— Internet contains a tremendous amount of valuable web applications that can be used in many systems. To use this kind of applications with other systems, the interaction needs to be in a standard structured format such as XML web service. In this paper, we present a method to convert the current web applications into standard XML web services. The system design and implementation are presented. We applied the proposed system on two test cases: Jordan University of Science and Technology (JUST) course online schedule and Wiley product search engine.
The organization of the rest of this paper is as follows. In Section 2 we give an overview of the related work. In Section 3 we explain our proposed system design in details. System application to the two case studies is presented in section 4. Finally, we conclude our approach and give the future work in Section 5. II.
Keywords-component; web services, web mining, contents extraction, information integration, automated submission.
I.
Since few websites currently provide remote service functionalities [4], many implementations and researches about accessing web resources regardless if the websites provide a remote service or not have been developed to integrate the web applications. Scrapbook [6] is one of the examples: it only extracts texts and clips the web page into a limited page; however, it does not deal with dynamic pages. Pollock [8] is another system that can create a virtual web service from a query interface, but users still need to parse the returned HTML document. The authors in [5] proposed a hierarchical model that integrates all features extracted from web pages and learn their importance. The proposed system is a template independent model that extracts all fields from a web page and searches for the best field that may contain the desired data specified by end-users. In [4], the authors developed an end-user programming tool, which is called Marmite and combined the access to web pages content and services. Marmite is being currently implemented as a plug-in using JavaScript and XML User Interface Language (XUL) in the Firefox web browser and cannot be called remotely by the end-users. In [2], the authors presented a method to integrate different web pages for personal use. They implemented a system that can integrate the ordinary static HTML pages and dynamic pages having contents that are generated by client-side scripts. The same methodology as, in [3], is adopted for the proposed system. However, the response in [2] is reformatted through a user-defined language called WACDL (Web Application Contents Description Language), which is an XML based language that describes
INTRODUCTION
There are a huge number of applications that are available online these days. The only way to use these applications is via web browsers and through human interactions. The use of web applications from the web with limited human intervention is the new version of the current web standards, which is known as the semantic web (web 3.0). For example, it is easy to manually retrieve a list of courses from JUST course online schedule, but it is difficult to build a system which automatically extracts course information by executing the online from. To do this, an enduser would have to manually extract desired information or implement special-purpose software to handle the same job. Most existing technologies are based on human intervention, which is often completely manual process. Moreover, manual or static technology which depends on human interaction is very difficult, time consuming, and needs extra efforts. In this paper, we present a method to convert existing web applications into a standard XML web service, which makes these applications easily accessible through other systems. To achieve this purpose, we propose a flexible and generic web service that can easily access web contents and return the result in a structured Simple Object Access Protocol (SOAP) format. Using web services would clearly simplify and generalize the extraction process and standardize the communication message format for end-users through the use of Web Service Description Language (WSDL) and SOAP messages. Hence, the web application
c 978-1-4244-8136-1/10/$26.00 2010 IEEE
RELATED WORK
807
web content, scopes of target contents and the desired information to be fetched. III.
I. SETUP PHASE Webpage URL
PROPOSED SYSTEM
In this paper, we propose a system to convert existing web applications into a standard XML web service. The method was applied to two test cases: JUST courses online schedule and Wiley search engine. Starting from known test cases would simplify the process of finding a generic framework that is capable of handling dynamic webpages. XML web service is the best way to distribute our system functionalities over the Internet. End-users need just to check web service description language reference and start using the service. Once the extraction process is finished, a SOAP object that contains the response is returned to endusers for further processing. As shown in Fig. 1, the system consists of two phases: setup phase and execute phase. These phases are divided into several layers, where the output of each layer is consumed by the next one. In the Setup phase, each new webpage URL is processed in cooperation with end-users to define three main features: The first one is a description of the desired output. The second is full details about the input that is required by users, different input fields are extracted to obtain this feature. The third feature is a definition of how the result obtained would be converted to a structured XML format. Based on the selected features, the WSDL is constructed and saved in the service database for execution phase. In the Execution phase, which is executed each time an end-user wants to extract web data, end-users can only run this phase for web pages that are already learned about in the setup phase. Users need to pass the intended webpage URL and multiple actions would be taken until the final result is returned as structured XML SOAP response.
Select Inputs
Execute form/View result
Select Outputs
Generate Conversion Descriptions
Build Service (WSDL)
Save I/O and conversion descriptions to Services Database
II. EXECUTE PHASE Webpage URL
Look-up Services Database
Generate SOAP Request
Get Structure XML SOAP Response Figure 1. Overall System Architecture
IV.
SYSTEM IMPLEMENTATION
In the section we present the application of the proposed system to two test cases: JUST courses online schedule and Wiley publisher products search engine. A. JUST Courses Online Schedule JUST courses online schedule is an online application that enables students to browse available courses according to: semester, faculty, department and section status (opened, closed). The form uses the POST method to exchange data with the server. One main problem arises in this form is the HTML controls dependency. (i.e., when a student selects his/her faculty, the page will automatically display all the departments related to that one). We addressed this problem by posting the data times the number of controls until the final HTML result is obtained. We implemented a web service method to execute the form automatically without human intervention according to end-users argument. The final result is then formatted in SOAP object for further processing in the client-side. The following steps would clarify the whole process:
808
x x x x x x x
Step 1: Retrieve the posted data from end-users. Step 2: Post the first argument (Semester) and wait for the other HTML fields to be filled. Step 3: Automatically, select the desired faculty. Step 4: Wait for the department’s field to be filled and select the posted department. Step 5: choose whether to show only the opened, closed or both sections status. Step 6: Create an HTML DOM tree. Step 7: Pick only needed nodes in the tree and build the structured data that contains information about each course.
As shown in Fig. 1, all HTML form inputs are selected and extracted as the first step in the setup phase. The number of times that the form must be executed should be specified to get the correct result. After the desired HTML output is obtained, our proposed system extracts courses information and generates structured data as shown in Fig. 6 and the process of data conversion is stored as XML schema
2010 10th International Conference on Intelligent Systems Design and Applications
description. The WSDL file shown in Fig. 2 is built at this phase and stored with the inputs, outputs and the conversion schema in a database to future use. Each time JUST course schedule URL is passed by endusers to retrieve courses information, the URL enters the execute phase as the WSDL web service is fetched from the database. At this phase, end-users generate a SOAP request and include their intended semester, faculty, department and section status as parameters to call the web service method to get the final SOAP response. The formats for both SOAP messages are shown in Fig. 4 and Fig. 5. For example, if an end-user is interested in retrieving all courses belonging to the medicine department. JUST web server executes end-user’s form request and returns the courses from the server database in an HTML format. A sample of the returned HTML text is shown in Fig. 3. The bold text represents the intended data to be extracted for each course. The returned HTML text contains a division (div tag) with a defined ID. Inside the div, a number of (N) tables equal to number of courses filtered by the end-user, is created by the server. Each table contains details about each course which are extracted, structured and finally sent as SOAP response. Our proposed wrapper which is implemented as a web service method gets the inner HTML code for the div, build a DOM tree and iterate over all tables which represent the actual courses information. B. Wiley Products Search Engine Wiley is a company that publishes books, journals, and encyclopedias, in print and electronically, as well as selling online products and services. We implemented a web service method that searches for all products published by Wiley and returns the result in a structured format to be consumed by end-users. Besides retrieving a list of matching products, our proposed method also sorts and narrows user search to certain products. The web service method takes three arguments: search query, sort method and product type. End-users may leave the last two parameters blank to search for all unsorted products. The result is returned from Wiley search engine server as a plain HTML text, which is then interpreted and processed by our web service method. In the setup phase, the three HTML inputs (search query, sort method and product type) are selected. The desired output locations and each detail about the product are specified and the structured conversion process is generated according to the output specified by end-users. Finally, the WSDL schema that represents the actual search method is built and stored in the database with the conversion description for future use in the execute phase. Fig. 7 shows a portion of the WSDL file that presents the generated web service method that uses Wiley search engine. From end-users side, end-users should generate a SOAP request as described in Fig. 8 and specify the three main arguments, representing the selected inputs and then make a
web service call. Our proposed system will immediately respond to that call and send the result back in the SOAP format shown in Fig. 9. í í í í Figure 2.
Part of JUST WSDL web service
Line Number:
662110
Course Symbol:
VM211
Section
Days
Time
Hall
Capacity
Reg Students
12345
10:30**11:30 … Figure 3. Returend HTML code for JUST courses
string int int int int Figure 4.
Courses SOAP 1.1 request message format
2010 10th International Conference on Intelligent Systems Design and Applications
809
int int string string Figure 5.
Courses SOAP 1.1 response message format
í í + í 662110 3 VM211 ANIMAL-HEALTH í í 48 48 1 G2121 12345 Figure 6.