Using XML in a Web-Oriented Information System

6 downloads 2706 Views 116KB Size Report
If not, the conversion will not be performed. ... some details about the conversion from HTML into XML. ... and CSS (cascading Style Sheets) are available. 2 ...
Using XML in a Web-Oriented Information System Petr Kroha Faculty of Informatics TU Chemnitz 09107 Chemnitz, Germany [email protected]

Lars Gemeinhardt Faculty of Informatics TU Chemnitz 09107 Chemnitz, Germany [email protected]

Abstract In this contribution an information system WEBIS will be described that updates its data using specific Web-pages. The problem that the structure of HTML-pages can be unexpectedly restructured has been solved by using a controlled transformation into XML and storing XML-documents for data exploitation into an XML repository.

Keywords Network-based information system, HTML, wrapper, XML repository.

1

Introduction

Information systems need to be fed by data. In principle, the following phases of this process can be distinguished: • data capture,i.e. identification and recording of source data, • data entry, i.e. converting source data into a computerreadable form, • data validation, i.e. data will be checked according to the integrity and consistency, • data input, i.e. computer-readable data will be actually accepted for inserting into the database. In data capturing there are more and more possibilities to get data from some automated electronic sources, like databases having some kind of API on remote servers in network. Very often there is the case that these databases support HTML-pages suitable for a visual user. This means that data can be found and seen by somebody who surfs through the internet and uses a mouse to click the interesting data.

Another possibility is that we can visit the interesting URL using a Java-program and try to process the content of these HTML-pages in some way. This automated processing brings the following problems: • Only such HTML-pages can be used that guarantee validated data. • HTML-pages should have a fixed structure because the semantics of data extracted from these pages is given through their position on the page and any changes in the structure of the HTML-page can bring a chaos into the interpretation of the meaning of extracted data. The first problem - problem of a reliable source - can be solved by specifying a list of URLs that are in public domain but very probably reliable (do not contain false data or obsolete links) or by paying some fees as an subscription for HTML-pages where the reliability is more or less guaranteed by a provider. The second problem remains and becomes to the main problem because authors of HTML-pages construct their pages for visual visitors only and feel free to change the structure of their pages any time. To solve this problem our data cannot be represented as a single value like in HTML but as a pair where not only a value is given but also is semantics. Exactly this XML makes possible. This is the reason why we have converted the content of some reliable HTML-pages into XML. During this conversion we test whether the specified path through the HTMLpage remained as defined. If not, the conversion will not be performed. In such a case the structure of the HTML-page has been changed and the path to the data that should be filtered out has to be defined anew. The data captured in internet as HTML-pages will be converted into XML, tested, validated, and stored as XMLdocuments into a XML-depository which is organized as an XML-database. The mapping between HTML-pages and XML-documents is not necessarily 1-to-1. Often data

from more HTML-pages will be presented in one XMLdocument. In our project WEBIS (WEB-oriented Information System) data concerning stock exchange will regularly be captured in internet, processed as described above, and then stored in a temporal XML-database as persistent documents. The investigation of historical trends is very important in WEBIS and is supported. There have been used some predefined queries containing additional filters but also ad hoc queries can be asked. The goal is that the processing does not require a human interaction. This concept can be commonly used in the sense that everybody can specify and use an individual newspapers tailored for himself and containing actual and historical events and information after any form of processing. The rest of this paper is organized as follows. Related work will be described in Section 2. In Section 3 we explain some details about the conversion from HTML into XML. Section 4 describes the architecture of the information system WEBIS. Then we describe wrappers in Section 5, and the XML repository in Section 6. The implementation is described in Section 7. In Section 8 we present an example. The achieved results and future work are discussed in Section 9.

2

section) and stored in a object-oriented database. Such solution offers Ozone-DB. Another solution [2] uses a serialized XML document stored in a file. As query languages [4] there is a choice between: XQL/XPath, XMLQL, Qiult (implemented as Kweelt), and XSL. In common the query languages suitable for querying XML-documents form two groups: document-oriented languages and data-oriented languages. Document-oriented languages (XQL/XPath) could not build a new structure from extracted data, the data-oriented languages (XMLQL, Quilt, XSL) can do it.

3

From HTML-data to XML-data by using wrappers

Both languages HTML and XML are subsets of the language SGML (Standard generalized Markup Language). HTML is more simple. Differently from XML, its documents have not to be well-formed (there need not be a closing element for every opening element). Semantics of HTML focused on representation of data without carrying their semantics. In XML, a semantic structure can be constructed using a DTD document (Document Type Definition). To illustrate this difference, we investigate a description of a table. In HTML, format of the table will be described (’TABLE’, ’TR’, ’TD’, ...) but the meaning of values in columns is only contained in the text segments in heads of columns. In XML, the coded meaning of values in a column is part of the structure. We can speak about semistructured or irregular structured data because we can find semantically corresponding data even in documents written in different formats and patterns. Discussing XML we need to explain the following aspects:

Related Work

There are many related approaches and systems. First, we mention the system XML-Broker [2]. It looks for data in web using a robust wrapper Jedi, a DOM-based in-memory data warehouse, and a declarative query language for XMLdocuments. This system focus on input data processing but not on information processing like WEBIS. As the next, the W4F system has to be mentioned [1]. Before we started to develop our own wrapper-generator we have compared wrapper-generator Jedi and the wrappergenerator W4F. We have found that W4F all features has that we need and finally we have got the permission from its authors to use it in our system. This wrapper works with HTTP and SSL, Cookies, passwords, it has an errortolerant parser, a simple but strong grammar for extraction rules, good documentation, support by a wizard, etc. The functionality of the wrapper will be specified using rules for retrieval, rules for extraction, and rules for mapping. The next problem, how to store an XML repository as a database is analyzed in [3]. Here can be found the idea of converting of the XML-schema into a database schema Fig. ??. Except of converting a schema also converting documents can be necessary according to the storing structures of the database Fig. ??. It has been commonly accepted that object-oriented databases offer some interesting features for storing highly structured objects. XML-documents can be seen as objects in the PDOM representation (see the next

• How to access elements of an XML document. There are in principle two possibilities. The first one, SAX (Simple API for XML) controls the processing of an XML-document by an XML-processor by events. The access to elements is only sequential, i.e. a document tree will be linearised and investigate. There are some important advantages, e.g. it requires not as much memory as the next alternative, but it is not suitable for our purpose of processing stored data. The next alternative, DOM (Document Object Model) generates and stores the complete object hierarchy. In our project, we have to store XML-documents persistently. This alternative will be denoted as PDOM (Persistent Document Object Model). • How to present XML documents. Languages XSL (eXtensible Style Sheet Language) and CSS (cascading Style Sheets) are available. 2

• How to chain XML-documents. Either XLink ( ) or XPointer ( ) can be used.

web

wrapper− generator 

• How to describe the semantic structure. Usually, DTD will be used but because of some weak features XML-schema will currently be used.

XML / HTML wrapper 

wrapper 

web− monitor

wrapper−invocation

To get an XML-document from one or more HTMLpages we need a program called wrapper that has to solve the following problems:

XML

query



freshness−rules 

mediator

repository

maintenance−rules



• To cooperate with protocols HTTP and SSL.

configuration 

XML

query

• To understand HTML (including all formatting data) and generate XML.

main−controller 



interpretation−rules

interpretation transformation data −> information

• To find and extract the data described in the query also when the structure of the HTML-page changes.

representation−rules (styles)

XML 

• To access also HTML-pages protected by username and password

configuration (user−feedback)



HTML web 

• To navigate through portals to the data using links.

GUI 

? 

user−interface 

To write a wrapper is a routine but a complex task. Usually, wrapper generators are used, so did we. Figure 1. Basic architecture of WEBIS

4

Architecture of WEBIS manager that cooperates with a DOM-interface and makes then the storage management transparent for the repository management.

The architecture of WEBIS can be seen on the Fig. [1]. A wrapper (constructed by a wrapper generator) finds in internet the HTML-pages that should be found and the data on these pages that should be extracted, transforms them into an XML-document, and stores into the XML-repository. Wrapper will be started (wrapper invocation) by the Webmonitor that has a schedule how often the repository should be updated. We can see that wrapper completes all phases of input data processing as mentioned in Section 1. The other parts of WEBIS serve to the querying. User interface supports the user in formulating the query that will be interpreted in cooperation between the Interpreter and the Repository. Results will be shown through the user interface using some style (representation rules). An important part of the system is a Configurator. It must have information about HTML-pages (URL) and data to be extracted from them (path), it must have stored refreshing rules for configurating Web-monitor, maintenance rules for the repository (insert, update, delete), and interpretation rules for the Interpreter that produce information from data. The Configurator makes WEBIS flexible.

5

6

Implementation

The information system WEBIS has been written in Java/JDK using compiler ”jikes” instead of the standard ”javac”. We have used the possibility that there are many part already ready for integration of such a system and any times we have found a suitable software-stone for our mosaic we have authors for a license. We have used: OzonDB and GMD IPSI as databases, AT&T XML-QL, AT&T Strudel, XML4J, OROMatch for XML-QL, Xalan for XSL and XPath, Kweelt for query language Kweelt, SUN Project for XML parser, and Javaregex for wrapper W4F.

7

Example

As example, we have configured WEBIS for collecting some data about stock exchange. There are reliable, public domain HTML-pages (e.g. cbs.marketwatch.com) and semantics of these data is simple. For creating definitions for the wrapper-generator the Wizard of W4f has been used. For formulating the query the language Quilt resp. Kweelt has been used and the results obtained were formatted by

XML repository

The architecture of the XML repository can be explained in Fig. 2. Because we finally wanted to use both products (Ozon-DB and GMD IPSI) we have written a document 3

XML−repository query−managment

instance−manager



repository− managment

XQL (GMD)



XML−QL (AT&T) 

instance−manager− ozone

XPath (xalan) document−

Quilt (Kweelt)

managment

XSL (xalan)



uses

instance−manager− GMD





XML−repository uses

uses

DOM−interface 

XML−repository− default 

storage−managment PDOM (GMD)



PDOM (ozone) 

XML−repository− ozone

XML−repository− GMD





uses



PDOM−implementation document− manager

uses

uses



ozone 

database document−manager− ozone

uses

document−manager− GMD

storage DOM 

uses

uses

DOM ozone 

filesystem (local / remote)

GMD 

legend: 

uses

interface

physical use 

uses

Figure 2. Basic architecture of the XML repository

logical use

class

generalization







Figure 3. Class hierarchy of the XML repository

a style described in XSL. Finally, we could use Microsoft Internet Explorer to represent results on the screen. We were looking for the following data: Dow Jones Industrial Average index, Nasdaq index, S&P index, US-10Years rates, DAX index, Euro Stoxx 50 index, Nemax 50 index, exchange rate Euro/$, and lists of positive and negative earning surprises of NYSE. To the last data source the Fig. 5 relates. Results are shown in Fig. 6.

8

abstract class

Conclusions and further work

Our goal was not to develop a commercial product but to obtain a tool for investigation of extraction data from internet and investigation of information systems based on XML databases. Information system of that kind needs a reliable, easy accessible data whose semantics can be easily understand. Data about stocks are public domain and suitable for this purpose. The information system WEBIS can be used for any purpose. It is only necessary to change specification of the wrapper to specify HTML-pages and data to be extracted, also Web-monitor should know how often input data should be refreshed, and Interpreter should have rules converting data to information.

Figure 4. Source HTML-page

4

Figure 5. Filtered data

References [1] F. Azavant, A. Sahuguet. W4F User Manual. Tropea 2000. . [2] Ronald Bourret. XML and Databases. TU Darmstadt, 1999. Available at . [3] Ronald Bourret. XML Database Products. TU Darmstadt, 2000. Available at . [4] M. Fernandez, J. Simeon, P. Wadler. XML Query Languages: Experiences and Exemplars. Bell, 2000. [5] L. Gemeinhardt. Ein web-orientiertes Informationssystem. M.Sc. Thesis, Faculty of Informatics, TU Chemnitz 2001. [6] Jonathan Robie. XQL FAQ. Ibiblio, 2000. Available at . [7] Arnaud Sahuguet. Querying XML in the New Millennium. Upenn 2000. Available at .

5

Suggest Documents