Building Databases with Information Extracted from Web ... - FIng

Building Databases with Information Extracted from Web Documents Alejandro Gutiérrez, Regina Motz and Daniel Viera✼ Instituto de Computación, Facultad de Ingeniería Universidad de la República, Montevideo, Uruguay e-mail: [gutierre, rmotz]@fing.edu.uy, [email protected]

form and time to periodically download those documents in order to incorporate changes.

Abstract

The aim of this work is to provide assistance to accomplish those tasks. In this sense, we describe a mechanism for instantiated a user-specified domain with information extracted from web documents. We also analyze how suitable is that mechanism for automatically propagating changes occurring on the structure of the user-specified domain as well as changes occurring on web documents.

We propose a mechanism to gather information about a user-specified domain from HTML documents and to store it in a local database. We also analyze how suitable this mechanism is for automatically propagating changes occurring in the structure of the user-specified domain as well as changes occurring in web documents.

1

The proposed mechanism gathers information about the user-specified domain from HTML documents and stores it in a local database. Using a local database allows users to submit queries, to further process the information and to increase the availability of the information. We use an object-oriented schema to represent the user-specified domain and to build the local database. The information extraction process is semiautomatic. It obtains data from HTML documents by means of a mapping given by the user between elements of the database schema and elements of the document , such as tables and lists.

Introduction

The World Wide Web (WWW) has become a major source of information about all areas of interest. Information brokers and global information management systems allow users to classify documents and offer capabilities for retrieving whole documents. However, in case of integrated information needs, such as finding the cheapest computer from several online merchants or compiling a table of films exhibit per city from a region, users need to extract, synthesize and maintain information from several documents. These tasks require effort to organize the interesting data in a convenient

We use existing techniques developed in the database area, such as those proposed in [HGMC+97, HFAN98] for extracting information from web

✼ This work was partially done while the author was working at Tecnologia Informática, Uruguay

1

documents, which are based on a language for the specification of extraction patterns that allow to build structured objects from information contained in HTML documents. The mechanism described in this paper focuses on particular HTML marks (tables and lists) to enable an interactive way of defining the mapping between the elements in the web document and the entities in the user domain. The user domain is given by an object-oriented database schema. Once the mapping is defined, it is used to add the document’s contents to the local database. Thus, the local database can be manipulated by the user obtaining information from web documents by taking advantage of a database-structured representation.

The mechanism we are proposing is intended to aid users to build a structured and local store integrating a set of HTML documents previously identified (e.g. after using a search tool or as the result of gathering documents provided by other people). By means of a database schema, a user administrator defines a structured representation of the domain of interest. That is, the schema provides a description of the relevant information the user wants to extract from web documents. Given a HTML document and a schema, the extraction process presents the table and list elements found in the document to the user in order to establish a mapping between those elements and the schema. Using this mapping the system instantiated the local database from relevant information of the web document. Periodically, the change management process retrieves the web documents and detects simple changes in them by using the mapping. So far, we have focused mainly on the definition and implementation of the extraction process.

The rest of the paper is organized as follows. Section 2 presents the general approach and its components. Section 3 gives details about the information extraction process and presents an example applying the mechanism. Section 4 discusses a first analysis of the Change Management Process. Finally, Section 5 presents some conclusions and future work.

3 2

The Extraction Process

Our approach The main task of the extraction process is the retrieval and structuring of the information contained in HTML documents in order to add it to a local database. Three elements are involved in that task: HTML documents, a schema representing the user-specific domain, and a mapping between the schema components and the information in the documents.

Figure 1 shows the main components of our mechanism for extracting and dynamically maintaining data from HTML documents [Vie99]. There are two processes. The extraction process consists of the extraction and structuring of information from HTML documents, while the change management process performs the detection of changes in the source documents and their respective propagation to a local database. database schema HTML documents

user-specific domain (database)

Extraction Process query, postprocessing, … mapping definition

Mapping Schema

Change Management Process

Document-structurre Figure 1: The components of the mechanism

2

The schema is defined by the user. We choose ODMG [CB97], the standard for object-oriented databases, as the data model due to its rich set of modeling structures. Concerning the documents, we focus on making easier the extraction of tables and lists, which appears embedded in the HTML code together with other kind of data. The mapping is also defined by the user by means of a graphical user interface (GUI).

3.1

3.2

We implemented a prototype in Java based on the architecture shown in Figure 2. It basically consists of three modules corresponding to each of the components. The ODMG parser and the table and list extractor are written using Jedi (Java Extraction and Dissemination of Information) [HFAN98]. Jedi provides a language which incorporates definitions of parsing rules making easy the task of programming parsers and processing of parsed data. The parsed data can be used to build Java objects. An interesting feature of Jedi is that parsing rules are “fault tolerant”.

Architecture

Figure 2 describes the architecture used in the extraction process. The ODMG schema specification is recognized by a parser and presented to the user. Similarly, information about data contained in tables and lists of the given HTML documents is extracted using a table and list extractor and also presented to the user. More precisely, this module extracts information about the structure of the given documents (e.g. columns of a table) together with a sample of the data contained in them.

This means a flexible parsing which is important when dealing with HTML documents since it is really the case that they do not fully comply with the HTML grammar. Due to implementation considerations, we work with relational databases. Thus, the ODMG parser also translates the given schema into a relational one. This translation is designed using techniques described in [Ambl98]. Finally, the mapping is defined by the user through a GUI where the correspondence between data in tables and lists of the document and the relational schema can be defined.

Both the ODMG parser and the table and list extractor are syntactic tools that only allow to build intermediate objects representing the ODMG schema and the relevant information contained in the documents. So, we need a way to give meaning to the information extracted from a document with respect to the specific domain. This is done by defined a mapping between the ODMG schema and the structures extracted from the HTML document. Based on this mapping, information is loaded into the local database.

odmg schema

HTML documents

The Prototype

More details about the prototype can be found in [Vie99].

mapping definition

odmg parser

schema-documents mapping

table and list extractor

Figure 2: Architecture of the extraction process

3

user-specific domain (database)

Figure 3: Web Page containing information about films and cinemas

3.3

An Example interface Film (key FilmName ) { attribute String film_name; relationship Set Cinema is_being_played_at inverse Cinema::Plays ; }

This section describes an example where data is extracted from a web document in order to instantiate a user schema about films exhibited in the cinemas of a city. The page shown in Figure 3 displays information about films being exhibited at all the cinemas of a city. Information includes, among others, cinema’s telephone and the time each film is exhibited.

The parser recognizes the ODMG schema and transforms it into a relational schema. The table and list extractor scans the document and extracts the table of cinemas contained in it with a sample of its associated data. The relational schema and sample data from the page are then shown to the user in order to define the corresponding mapping between them. This step is shown in Figure 4. The relational schema is shown at the top of the screen, the retrieved data from the page is shown in the table below it. For each column of each relational table the user indicates its corresponding

Assume that the following ODMG schema represents the user domain about cinemas and films exhibited by them. interface Cinema ( key CinemaName ) { attribute String cinema_name; attribute String phone ; relationship Set Film Plays Film::is_being_played_at ; }

inverse

4

Figure 4: Schema –Document Mapping column and from which document’s table it will be extracted. To this purpose the user gives to each column of each relational table a number that identifies the document’s table and the column, numbering that is according to the display order of the document’s tables. This implementation was decided for testing the extractor process only; interface functionalities are not been taken into account at this point.

constraint FK_plays_Name_Film FOREIGN_KEY film_name references Film(film_name) ); Below, we show a subset of the instance for the plays table (fields are separated by the character ‘:’). Cinemetro 1 : El Principe de Egipto Cinemetro 1 : Rescatando al soldado Ryan Cinemetro 2 : Patch Adams De las Americas : Todavia se lo que hicieron el verano pasado De las Americas : Haloween H20 Ejido 1 : La vida es bella Ejido 2 : El divino Ned Ejido 3 : Bichos

As output, this process returns two files: one containing SQL sentences to create the relational schema, and another containing delimited text which is used to load the instance of schema into the database. The following is the output for the running example. create table Cinema ( cinema_name varchar(30), phone varchar(10) ) ; create table Film ( film_name varchar(30) ) ;

Given the local database about films and cinemas, we can now obtain, for instance, cinemas which exhibit the greatest number of films by means of the following SQL expressions.

create table Plays ( cinema_name varchar(30), film_name varchar(30), constraint FK_plays_Name_Cinema FOREIGN_KEY cinema_name references Cinema(cinema_name),

5

create view Cinema_NbofFilms (cname, phone, nboffilms) as select cinema_name, phone, count(*) from Plays P, Cinema C where P.cinema_name = C.cinema_name group by C.cinema_name, C.phone

For a finer distinction we follow a taxonomy of schema changes based on schema evolution operators introduced by Banerjee et al. [BKKK87], which we extend with: (i) complex schema changes, such as objectification of attributes or relationships [Lerner95] and changes involving multiple classes, like merging or generalization of classes [Breche96], and (ii) particular schema modifications specially tailored to ODMG like e.g. encapsulation/desencapsulation of attributes into/from a struct.

select cname, phone from Cinema_NbofFilms C1 where not exists (select * from Cinema_NbofFilms C2 where C1.nboffilms < C2.nboffilms) )

4

Moreover, it is broadly recognized that schema modifications can be categorized according to their results [Miller93, Berg97]. When changes imply a rebuilding of the schema due to the use of different constructs to represent the same information we talk about pure structural changes. For example, when an attribute is moved to another class. On the other hand, changes may imply loss or addition of information capacity, e.g. when removing an attribute from a class or adding a new one, respectively. When modifications imply a loss of information capacity we talk about capacity reducing changes, whereas in the opposite case we talk about capacity augmenting changes.

The Change Management Process

The extraction process consolidates data from web pages into a local database guided by an ODMG schema. However, an important issue from the web is its volatility since data and structure of pages will change at any time. Data from documents is consolidated into an ODMG schema providing by the user in accordance with his/her requirements. Therefore, when user’s requirements change, the local database schema will also change.

Figure 5 provides a taxonomy of schema modifications for where (A) indicates a capacity augmenting change, ( R ) indicates a capacity reducing change and ( P ) indicates a pure structural change.

In this section we first point out the possible ODMG schema changes. Later on, we analyzed the way in which a web page can change and the possibilities to propagate these modifications to the local database using the proposed mechanism.

4.1

Schema evolution is concerned with the study of schema changes and how they may affect other parts of the database. Thereby, we define the semantics of each ODMG schema change by explaining its impact on the rest of the schema, instances and methods. In the following we present the specification of two schema modifications. The complete specification can be found in [Motz99].

Changes on ODMG Schema

Databases are often exposed to evolution at the schema level. Schema evolution is usually applied due to two reasons: (i) as the result of a bad schema design involving the removing of anomalies and redundancies in the schema, or (ii) because the domain being modeled is evolving. There has been active research on object oriented schema evolution (see BKKK87, Lerner95, Breche96, Bena99). In general, structural modifications of an object-oriented database are grouped into two types: (i) changes to a class, and (ii) changes to the class lattice. Changes to a class consist of changes to its properties, such as changing the name or domain of an attribute, adding or dropping an attribute or method, etc. Changes to the class lattice includes e.g. adding or dropping a class, changing the superclass/subclass relationship between a pair of classes and converting a relationship into a class.

Add a new attribute, relationship or operation. The addition of a new attribute, relationship or operation x to an already existent class p, is an information augmenting change. Preconditions: Since the same rules apply for the addition of both an attribute and a relationship we use here the generic term ``property' ' to refer to both. Obviously, the property or operation to be added must have a distinct name, since we must preserve the uniqueness of properties and operations. Moreover, in case that the new property redefines an inherited one, the domain compatibility invariant needs to be maintained. This means that the new property must be a

6

•

because a local property overrides inherited ones. Because null values are not allowed in ODMG, the addition of an attribute to a class must be accompanied by a default value for that attribute. The addition of a relationship or an operation has not impact on the instance level.

Changes to a class. −

Rename a class, its attributes, relationships or operations. (P)

−

Add a new attribute, relationship or operation. (A)

−

Remove an existing attribute, relationship or operation. ( R )

−

Change the specification of an attribute. (P)

−

Change an operation’s signature. (P)

−

Partition an attributes. (P)

−

Merge attributes into a one new attribute. (P)

−

Encapsulation of attributes into a struct. (P)

−

Desencapsulation of attributes from a struct. (P)

•

attribute

into

Methods: The addition of a new class, property or operation to a class presents no impact on methods when there is no inherited property with the same name in that class. Otherwise, all methods referring to the inherited property are marked in order to warn the developer since the modification may have changed the behavior of the method. However, there may be also some other kind of methods, for example a method that ‘‘prints all attributes of a class’’, which need to be modified. One possible solution is to provide the user with a list of methods associated to the modified class and let her/him decide which methods require manual adaptation.

several

Remove an existing class. This modification removes a class p from the class lattice. It is annotated as an information reducing change.

Changes to the class lattice. −

Add a new class. (A)

−

Remove an existing class. ( R )

−

Add a specialization edge. (P)

−

Remove a specialization edge. (P)

−

Generalization of classes and relationships.(P)

−

Partition a class into several classes. (P)

−

Merging of classes. (P)

−

Objectification of relationships.(P)

attributes,

struct

Preconditions: An existing class can be removed only when it is a leaf of the inheritance graph. Removing a class in the middle of the inheritance graph may have two different semantics (and can be represented with two different evolution operations): either remove recursively all subclasses of the removed class, or remove the class and connect all its subclasses to its directly superclasses, whenever possible. Nevertheless, these two operations can be achieved by a combination of specialization edge deletion and class deletion.

and

Schema: All attributes that make reference to the removed class elsewhere must be also removed.

Figure 5: A Taxonomy of ODMG Schema Changes

Instances: Removing a class implies the deletion of all instances of the class and the modification of all instances of other classes which refer the deleted ones.

specialization of the inherited one. The same invariant must be maintained in all subclasses of the class where the new property is being added, i.e., properties locally defined must be specializations of the new one. (A relationship is more specialized than another if its cardinality is more restrictive than the one of the other).

Methods: Removing a class leads to remove all methods which directly or indirectly make reference to the class, either through one of its properties or through its name. Schema modifications can be automatically determined for example, by comparing the local schema with the evolved one. Work in this direction is performed by Ambite and Knoblock [AK95] and by Goni et al. [GIMB97]. Here we assume that they are supplied as part of the input.

Schema: This modification has no impact on the rest of the schema. Instances: This modification leads to an instance modification operation. Adding an attribute to a class implies its addition to all instances of the class and its propagation to all subclasses. Such propagation is stopped in case of a locally redefinition of the attribute

7

Propagation of Web Document Modifications. Changes that affect display are easy to detect. On the other hand, changes that affect the structure of web document are a bit more complicated. The main problem is to define when two tables are equal, that is, when the current document and the previous one reference to the same concept at the user’s domain. Inspecting the pure structure of the page, i.e. the number of tables and columns it contains, is not enough. For example, a page containing a table with three columns representing code, description and price of a product can be modified to a page containing a table with also three columns but representing code, description and stock of the products.

Propagation of Schema Modifications . We observe that, in general, the propagation of schema modifications can be accomplished by adopting one of the following alternatives according to the capacity information of the changes: •

When occurs a capacity reducing change: The information extracted in previous steps is still enough, in fact there are more information than the new required. However, in this case there is not necessary to rewrite the mappings.

•

When occurs a pure structural modification: The mappings must be revised because the changes may induce a re-structuration, i.e. a table must be partitioned then the mapping of the previous columns must be rewritten.

•

When occurs a capacity augmenting change: The extraction process must be re-initiated from the beginning in order to identify the new required data to be extracted. For example, when a new class is added to the ODMG schema, then the extractor process need to parser the page again searching for the possible information that will be mapped to this class.

4.2

The schema-document mapping defined by the user gives semantics to the web-document according to the ODMG schema (even when the columns do not have titles). In the previous example, the user maps the three columns representing code, description and price of the table to the corresponding attributes in the ODMG schema. The new tables in the modified document have also three columns but are these columns semantically equivalent to the previous ones? Because the semantics of the table is only given by the user-defined mapping, the system needs the new mapping in order to recognize the user’s new domain and in that way detect the change. Therefore, the detection of a change is performed only via the comparison of the user’s defined mappings. This means that the mechanism as it was defined so far cannot support propagation of changes. This problem is even worst when the columns do not have title.

Changes on Web Documents

We identify the following kinds of web document modifications: •

•

Therefore, in order to propagate document modifications to the local database, we need to manage not only the pure structure of the page but also “descriptors” of the tables, where by descriptors we mean either the title or the semantics of the columns.

Changes that affect display. Examples of this kind of modifications are: changes on fonts, on background colors, or in the position place of a table in the page. This kind of changes affects HTML code. However, they should not be taken as relevant changes since they do not affect data or structure.

5

The main contribution of the paper lies in the provision of a mechanism for semi-automatic generation of a local database that describes a user-specified domain knowledge instantiated from HTML sources. We also analyzed the problems to detect and propagate schema and web-document changes to the local database.

Changes that affect the structure of web documents. This is the case when tables and lists change their structure by adding or deleting columns, a table disappears from a page, or new ones are inserted.

•

Conclusions

In particular the mechanism proposes an extraction process that allows building a local database representing the user’s domain from data contained in tables and list from HTML documents. This extraction process deals not only with tables having columns with titles, but also with tables with columns without titles. This flexibility is

Changes that affect the data contents of a document.

8

achieved allowing the user to (self) define a mapping between columns of the documents tables and the elements of the required schema. From the analysis of the propagation issues, we identify the need for extend the current mechanism. In this direction, we are working adding an ontology to the system to support semantic equivalence between concepts. We believe that this approach will enable to reach a step more to the challenge of automatic generation of schema-document mapping and propagation of changes.

[GIMB97]

A. Goni and A. Illarramendi and E. Mena and J. M. Blanco, "Monitoring the Evolution of Databases in Federated Relational Database Systems.", in CAiSE'97 Workshop on Engineering Federated Database Systems (EFDBS'97), June,1997.

[HGMC+97]

J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, A. Crespo, “Extracting Semistructured Information from the Web”, Proc. of the Workshop on Management of Semistructured Data. Tucson, Arizona, May 1997.

[HFAN98]

G. Huck, P. Fankhauser, K. Aberer, E. Neuhold, “JEDI: Extracting and Synthesizing Information from the Web”, COOPIS 98, New York, August, 1998. IEEE Computer Society Press

[Lerner95]

B. Lerner, “A Model for Compound Type Changes, Technical Report,95-095, University of Massachusetts at Amherst, October, 1995.

[Miller93]

R. J. Miller, Y. E. Ioannidis and R. Ramakrishnan, "The use of Information Capacity in Schema Integration and Translation", in Proc. of the 19th. VLDB Conf. Dublin, Ireland,1993.

[Motz99]

R.Motz. "Mantenimiento Dinamico de un Esquema Integrado", . PhD Thesis, Darmstadt University, Germany, (to forthcoming).

[Vier 99]

D. Viera. “Extracción y Mantenimiento Dinámico de Datos de la Web”. Engineer’s Degree. Final Project. Facultad de Ingeniería, UdelaR, Uruguay. Abril 1999.http://www.fing.edu.uy/~csi/Proyectos/ FinGrado/Finalizados/1998/InformeCompleto /1998_inf1.zip

Acknowledgments We would like to thank Gonzalo Varalla (Chasque), Gabriel Fialco (Tecnología Informática) and Raul Ruggia (Instituto de Computación) for their support in the initial discussion of the engineer’s degree project.

References [AK95]

J. L. Ambite and C. A. Knoblock, "Reconciling Distributed Information Sources", Working Notes of the AAAI Spring Symposium on Information Gathering in Distributed Heterogeneous Environments, Palo Alto, CA,1995.

[Ambl98]

S. Ambler, “Mapping Objects to Relational Databases”. An AmbySoft Inc. white paper, 1998. http://www.AmbySoft.com/mappingObjects.h tml.

[Bena99]

B. Benatallah, “ A Unified Framework for Supporting Dynamic Schema Evolution in Object Databases”, Porc. of the Entity Relational Conference, 1999.

[BKKK87]

J. Banerjee, W. Kim, H. Kim and F. Korth, "Semantics and Implementation of Schema Evolution in Object-Oriented Databases", in SIGMOD Record (Proc. Conf. on Management of Data)",16(3), 1987, pp 311322.

[Berg97]

P.L. Bergstein, "Maintenance of ObjectOriented Systems during Schema Evolution", Theory and Practice of Object Systems, 3(3),1997, pp 1-28.

[Breche96]

Philippe Breche, "Advanced Primitives for Changing Schemas of Object Databases",in CAiSE'96, May 1996.

[CB97]

R. Cattell, D. Barry, “Object database standard : ODMG 2.0”, Morgan Kaufmann, 1997.

9

Building Databases with Information Extracted from Web ... - FIng

Building Databases with Information Extracted from Web ... - FIng

Suggest Documents

Building Databases with Information Extracted from Web Documents

Building Databases with Information Extracted from Web Documents

Combining evolutionary information extracted from frequency profiles ...

Information Coupling in Web Databases? - CiteSeerX

Deductive Databases with Incomplete Information

Private Information Retrieval from Coded Databases with Colluding ...

Databases: From Paper-based to Web-based

Building Self-Managing Web Information Systems from ... - CiteSeerX

Experiments with Geographic Evidence Extracted from ... - XLDB

Experiments with Geographic Evidence Extracted from ... - CiteSeerX

Web Ontology Reasoning with Logic Databases

Building Varactor Physical Equivalent Circuit Model from the extracted ...

Classifying Based on Extracted Information

Building with Lines (web-version)

Building Web-Based Data-Information-Knowledge ...

quality assessment of customer reviews extracted from web pages

Emotion Classification Using Massive Examples Extracted from the Web

Compare NDVI extracted from Landsat 8 imagery with that from

Topographic information of sand dunes as extracted from ... - CiteSeerX

Electronic band structure information of GdN extracted from x-ray ...

Compare NDVI extracted from Landsat 8 imagery with that from ...

Building Information Dashboards with R

Structural information extracted from the diffraction of XFEL fs-pulses

an assessment of semantic information automatically extracted from ...