Converting Web Applications into Standard XML Web ... - IEEE Xplore

0 downloads 0 Views 206KB Size Report
Converting Web Applications into Standard XML Web Services: Two Case Studies ... examples: it only extracts texts and clips the web page into a limited page ...
Converting Web Applications into Standard XML Web Services: Two Case Studies

Natheer Khasawneh

Mohammed A. Shatnawi, Mohammad Fraiwan

Department of Software Engineering Jordan University of Science and Technology Irbid, JORDAN [email protected]

Department of Computer Engineering Jordan University of Science and Technology Irbid, JORDAN [email protected], [email protected] functionality can be distributed efficiently over the World Wide Web.

Abstract— Internet contains a tremendous amount of valuable web applications that can be used in many systems. To use this kind of applications with other systems, the interaction needs to be in a standard structured format such as XML web service. In this paper, we present a method to convert the current web applications into standard XML web services. The system design and implementation are presented. We applied the proposed system on two test cases: Jordan University of Science and Technology (JUST) course online schedule and Wiley product search engine.

The organization of the rest of this paper is as follows. In Section 2 we give an overview of the related work. In Section 3 we explain our proposed system design in details. System application to the two case studies is presented in section 4. Finally, we conclude our approach and give the future work in Section 5. II.

Keywords-component; web services, web mining, contents extraction, information integration, automated submission.

I.

Since few websites currently provide remote service functionalities [4], many implementations and researches about accessing web resources regardless if the websites provide a remote service or not have been developed to integrate the web applications. Scrapbook [6] is one of the examples: it only extracts texts and clips the web page into a limited page; however, it does not deal with dynamic pages. Pollock [8] is another system that can create a virtual web service from a query interface, but users still need to parse the returned HTML document. The authors in [5] proposed a hierarchical model that integrates all features extracted from web pages and learn their importance. The proposed system is a template independent model that extracts all fields from a web page and searches for the best field that may contain the desired data specified by end-users. In [4], the authors developed an end-user programming tool, which is called Marmite and combined the access to web pages content and services. Marmite is being currently implemented as a plug-in using JavaScript and XML User Interface Language (XUL) in the Firefox web browser and cannot be called remotely by the end-users. In [2], the authors presented a method to integrate different web pages for personal use. They implemented a system that can integrate the ordinary static HTML pages and dynamic pages having contents that are generated by client-side scripts. The same methodology as, in [3], is adopted for the proposed system. However, the response in [2] is reformatted through a user-defined language called WACDL (Web Application Contents Description Language), which is an XML based language that describes

INTRODUCTION

There are a huge number of applications that are available online these days. The only way to use these applications is via web browsers and through human interactions. The use of web applications from the web with limited human intervention is the new version of the current web standards, which is known as the semantic web (web 3.0). For example, it is easy to manually retrieve a list of courses from JUST course online schedule, but it is difficult to build a system which automatically extracts course information by executing the online from. To do this, an enduser would have to manually extract desired information or implement special-purpose software to handle the same job. Most existing technologies are based on human intervention, which is often completely manual process. Moreover, manual or static technology which depends on human interaction is very difficult, time consuming, and needs extra efforts. In this paper, we present a method to convert existing web applications into a standard XML web service, which makes these applications easily accessible through other systems. To achieve this purpose, we propose a flexible and generic web service that can easily access web contents and return the result in a structured Simple Object Access Protocol (SOAP) format. Using web services would clearly simplify and generalize the extraction process and standardize the communication message format for end-users through the use of Web Service Description Language (WSDL) and SOAP messages. Hence, the web application

c 978-1-4244-8136-1/10/$26.00 2010 IEEE

RELATED WORK

807

web content, scopes of target contents and the desired information to be fetched. III.

I. SETUP PHASE Webpage URL

PROPOSED SYSTEM

In this paper, we propose a system to convert existing web applications into a standard XML web service. The method was applied to two test cases: JUST courses online schedule and Wiley search engine. Starting from known test cases would simplify the process of finding a generic framework that is capable of handling dynamic webpages. XML web service is the best way to distribute our system functionalities over the Internet. End-users need just to check web service description language reference and start using the service. Once the extraction process is finished, a SOAP object that contains the response is returned to endusers for further processing. As shown in Fig. 1, the system consists of two phases: setup phase and execute phase. These phases are divided into several layers, where the output of each layer is consumed by the next one. In the Setup phase, each new webpage URL is processed in cooperation with end-users to define three main features: The first one is a description of the desired output. The second is full details about the input that is required by users, different input fields are extracted to obtain this feature. The third feature is a definition of how the result obtained would be converted to a structured XML format. Based on the selected features, the WSDL is constructed and saved in the service database for execution phase. In the Execution phase, which is executed each time an end-user wants to extract web data, end-users can only run this phase for web pages that are already learned about in the setup phase. Users need to pass the intended webpage URL and multiple actions would be taken until the final result is returned as structured XML SOAP response.

Select Inputs

Execute form/View result

Select Outputs

Generate Conversion Descriptions

Build Service (WSDL)

Save I/O and conversion descriptions to Services Database

II. EXECUTE PHASE Webpage URL

Look-up Services Database

Generate SOAP Request

Get Structure XML SOAP Response Figure 1. Overall System Architecture

IV.

SYSTEM IMPLEMENTATION

In the section we present the application of the proposed system to two test cases: JUST courses online schedule and Wiley publisher products search engine. A. JUST Courses Online Schedule JUST courses online schedule is an online application that enables students to browse available courses according to: semester, faculty, department and section status (opened, closed). The form uses the POST method to exchange data with the server. One main problem arises in this form is the HTML controls dependency. (i.e., when a student selects his/her faculty, the page will automatically display all the departments related to that one). We addressed this problem by posting the data times the number of controls until the final HTML result is obtained. We implemented a web service method to execute the form automatically without human intervention according to end-users argument. The final result is then formatted in SOAP object for further processing in the client-side. The following steps would clarify the whole process:

808

x x x x x x x

Step 1: Retrieve the posted data from end-users. Step 2: Post the first argument (Semester) and wait for the other HTML fields to be filled. Step 3: Automatically, select the desired faculty. Step 4: Wait for the department’s field to be filled and select the posted department. Step 5: choose whether to show only the opened, closed or both sections status. Step 6: Create an HTML DOM tree. Step 7: Pick only needed nodes in the tree and build the structured data that contains information about each course.

As shown in Fig. 1, all HTML form inputs are selected and extracted as the first step in the setup phase. The number of times that the form must be executed should be specified to get the correct result. After the desired HTML output is obtained, our proposed system extracts courses information and generates structured data as shown in Fig. 6 and the process of data conversion is stored as XML schema

2010 10th International Conference on Intelligent Systems Design and Applications

description. The WSDL file shown in Fig. 2 is built at this phase and stored with the inputs, outputs and the conversion schema in a database to future use. Each time JUST course schedule URL is passed by endusers to retrieve courses information, the URL enters the execute phase as the WSDL web service is fetched from the database. At this phase, end-users generate a SOAP request and include their intended semester, faculty, department and section status as parameters to call the web service method to get the final SOAP response. The formats for both SOAP messages are shown in Fig. 4 and Fig. 5. For example, if an end-user is interested in retrieving all courses belonging to the medicine department. JUST web server executes end-user’s form request and returns the courses from the server database in an HTML format. A sample of the returned HTML text is shown in Fig. 3. The bold text represents the intended data to be extracted for each course. The returned HTML text contains a division (div tag) with a defined ID. Inside the div, a number of (N) tables equal to number of courses filtered by the end-user, is created by the server. Each table contains details about each course which are extracted, structured and finally sent as SOAP response. Our proposed wrapper which is implemented as a web service method gets the inner HTML code for the div, build a DOM tree and iterate over all tables which represent the actual courses information. B. Wiley Products Search Engine Wiley is a company that publishes books, journals, and encyclopedias, in print and electronically, as well as selling online products and services. We implemented a web service method that searches for all products published by Wiley and returns the result in a structured format to be consumed by end-users. Besides retrieving a list of matching products, our proposed method also sorts and narrows user search to certain products. The web service method takes three arguments: search query, sort method and product type. End-users may leave the last two parameters blank to search for all unsorted products. The result is returned from Wiley search engine server as a plain HTML text, which is then interpreted and processed by our web service method. In the setup phase, the three HTML inputs (search query, sort method and product type) are selected. The desired output locations and each detail about the product are specified and the structured conversion process is generated according to the output specified by end-users. Finally, the WSDL schema that represents the actual search method is built and stored in the database with the conversion description for future use in the execute phase. Fig. 7 shows a portion of the WSDL file that presents the generated web service method that uses Wiley search engine. From end-users side, end-users should generate a SOAP request as described in Fig. 8 and specify the three main arguments, representing the selected inputs and then make a

web service call. Our proposed system will immediately respond to that call and send the result back in the SOAP format shown in Fig. 9. í í í í Figure 2.

Part of JUST WSDL web service

12345
Line Number: 662110
Course Symbol: VM211
SectionDaysTime HallCapacity Reg Students
10:30**11:30
… Figure 3. Returend HTML code for JUST courses

string int int int int Figure 4.

Courses SOAP 1.1 request message format

2010 10th International Conference on Intelligent Systems Design and Applications

809

int int string string Figure 5.

Courses SOAP 1.1 response message format

í í + í 662110 3 VM211 ANIMAL-HEALTH í í 48 48 1 G2121 12345 Figure 6.

Structure of JUST class courses

The details of each product are stored inside different HTML tags. For example, the product HTTP link can be found in a “” tag, which is located inside HTML div belonging to “product-title” class. However, one problem arises when end-users search for certain keywords that return large number of matched products; Wiley automatically creates a pagination bar to traverse all the results. We addressed this problem by recursively opening all result pages. At each page, we build an HTML DOM tree that represents the product information details, extract only needed data and convert the product to a structured format. Finally, we send the result as SOAP response message. Fig. 8 shows a sample of HTML result after searching for the “MCSA” keyword. V.

FUTURE WORK

After we applied the proposed method to two examples of real web applications with different styles, layouts, HTML inputs, HTTP methods and results, we introduce in this

810

section a general model to build a framework that can handle webpages dynamically.

í í í í Figure 7.

Part of Wiley WSDL file

...
...
...
by Microsoft Official Academic Course
July 2004, ©2007...
£36.99 / €42.60...


Figure 8. Returned HTML code for Wiley products

string All or Books or TextbooksAndCourseOfferings or Media or Journals Relevance or AzByTitle or AzByAuthor or PublicationDate Figure 9. Wiley products SOAP 1.1 message request format

2010 10th International Conference on Intelligent Systems Design and Applications

string string string string string string string string Figure 10. Wiley products SOAP 1.1 message response format

í í í í … … í … July 2004, ©2007 £36.99 / €42.60 í … Paperback í … Figure 11. Structure of Wiley products

To build a generic web extraction system, the first issue the system needs to deal with is how to extract contents. This depends on the representation format for web pages. Robust representation can improve extraction correctness [5]. However, building such framework is not a trivial task and faces many problems: x If the target webpage contains a complex HTML form, then there may exist dependences among multiple inputs. x Extracting information from webpages and distilling data in a structured format. x Merging data from separate HTML pages, in case of tag, introduces the data integration problem. x Finding all target HTML pages on a website, (i.e., if the site returns paginated data which requires multiple navigating to collect all the result [9]). x Dynamic generated HTML code which is typically computed based on user inputs or JavaScript [9].

contain similar attributes such as: title, price and description. Such pages can be easily structured automatically and processed by a general extractor. In future, semantic web insures automatic-metadata with a standard XML format that can be interpreted by all software agents. Nevertheless, current webpages do not have descriptions that would undoubtedly facilitate content extractors job. In order to build a wide-ranging extractor framework, the following issues should be taken into account: x If the desired webpage contains a legible description, the framework should extract information efficiently and this would increase the accuracy. x To address the data integrity problem, end-users should tell the system that the target site contains multiple-sources and give enough description to each one. x Manual-metadata for webpages layout. For example, end-users may tell the framework that the intended data is found in
tags with certain classes or IDs. x To overcome the difficult of HTML inputs dependency such as JUST course online form, the number of times that the form must be executed should be known before the start of extraction process started to insure that the anticipated result is returned acceptably. x A robust contents converter should be implemented to convert the information embedded within the HTML code to a structured format. Unfortunately, each website is constructed based on different layout, which in turns, require different structure class. x A global Web classifier with a pre-defined training set can help improve the framework, because many websites belong to the same category, share common layouts and have a semi-structured data could be processed by a certain extractor within the framework. REFERENCES [1]

[2]

[3]

There is also a need to extract simple data items such as crawling online news articles websites which often contain an image with a semi-long text body. The extracted data would simply be represented in a structured XML format. Moreover, product webpages share common features and

[4]

Marek Kowalkiewicz, Tomasz Kaczmarek and Witold Abramowicz, “myPortal:Robust extraction and aggregation of web content,” in VLDB '06: Proceedings of the 32nd international conference on Very large data bases (2006), pp. 1219-1222. Hao Han, Junxia Guo and Takehiro Tokuda, “Towards flexible integration of any parts from any web applications for personal use,” in ComposableWeb'09, pp. 69-79. Hao Han and Takehiro Tokuda, “A method for integration of web applications based on information extraction,” in Web Engineering, 2008, ICWE '08. Eighth International Conference, pp. 189-195, doi: 10.1109/ICWE.2008.29. Jeffrey Wong and Jason I. Hong, “Making mashups with marmite: towards end-user programming for the web,” in ACM New York, NY, USA, 2007, ppt. 1435-1444.

2010 10th International Conference on Intelligent Systems Design and Applications

811

[5]

[6]

[7]

[8]

[9]

812

Jun Zhu, Zaiqing Nie and Ji-Rong Wen, “Simultaneous record detection and attribute labeling in web data extraction,” in ACM New York, NY, USA, 2006, pp. 495-503. Atsushi Sugiura and Yoshiyuki Koseki, “Internet scrapbook: automating web browsing tasks by demonstration,” in ACM New York, NY, USA, 1998, pp. 6-18. Fabio Ciravegna, Alexiei Dingli, David Guthrie and Yorick Wilks, “Integrating information to bootstrap information extraction from web sites,” in IJCAI-03 Workshop on Information Integration on the Web (IIWeb-03), August 9-10, 2003, Acapulco, Mexico. 2003. Yi-Hsuan Lu, Yoojin Hong and Jinesh Varia, “Pllock: automatic generation of virtual web services from web sites,” in the Proceedings of the 2005 ACM symposium on Applied computing, 2005, pp. 16501655. Jussi Myllymaki, “Effective web data extraction with standard XML technologies,” in proceedings of the tenth international world wide web conference, hong kong, 2001, doi: 10.1.1.1.4674.

2010 10th International Conference on Intelligent Systems Design and Applications