Web Services for Information Extraction from the Web Benjamin Habegger Laboratiore d’Informatique de Nantes Atlantique University of Nantes, Nantes, France
[email protected] Abstract Extracting information extraction from the Web is a complex task with different components which can either be generic or specific to the task, going from downloading a given page, following links, querying a Web-based applications via an HTML form and the HTTP protocol, querying a Web Service via the SOAP protocol, etc. Therefore building Web Services which proceed to executing an information tasks can not be simply hard coded (ie. written and compiled once and for all in a given programming language). In order to be able to build flexible information extraction Web Services we need to be able to compose different sub tasks together. We propose a, XML-based language to describe information extraction Web Services as the compositions of existing Web Services and specific functions. The usefulness the proposed framework is demonstrated by three real world applications. (1) Search engines : we show how to describe a task which queries Google’s Web Service, retrieves more information on the results by querying their respective HTTP servers, and filters them according to this information. (2) E-Commerce sites : an information extraction Web Service giving access to an existing HTML-based e-commerce online application such as Amazon is built. (3) Patent extraction : a last example shows how to describe an information extraction Web Service which allows to query a Web-based application, extract the set of result links, follow them, and extract the needed information on the result pages. In all three applications the generated description can be easily modified and/or completed to further respond the user’s needs and create value-added Web Services.
1. Introduction The field of information extraction from the Web emerged with the growth of the Web and the multiplication of online data sources. Indeed, the Web can now be considered as the world’s hugest database. However the data contained on the Web, generally in the form of
Mohamed Quafafou Institut des Application Avancées de l’Internet Marseille, France
[email protected] HTML pages, is destinated to be viewed in a browser by human users. The languages in which data are given are presentation languages which give no idea of the semantics of the data contained in these pages. Therefore this enormous amount of data is seemingly useless. Nevertheless, the presentational format of the data give clues on the structure of the underlying data. This is especially true when the pages are dynamically generated. In this case the data which is presented generally comes from a structured database and the presentational format of the generated pages reflect this structure. This fact makes it reasonable to consider giving machine access to data. This can lead to many applications such as the creation of mediators accessing multiple data bases, creating information agents, building finer grain services which respond more precisely to a user’s information needs. Existing Web Services might be able to respond to such information needs. However existing web services are black boxes which have been hard-coded in the sense that they have been written once and for all in a given language. In order to be able to respond to a wide variety of information needs one needs flexibility. Building a specific service responding to a specific need should be made easy. This can be done by decomposing an information extraction task into smaller simple tasks which can easily be automated. Existing research on Web Services has considered the problem of composing Web Services together [4]. In the case of information extraction tasks, however, there is a need to be able to compose with specific components which are not only Web Services, for example wrappers for accessing data sources which do not provide Web Services. In this paper we will show how to decompose an information extraction task and build a new web service from it. We will also see that accessing a Web Service can be only part of an information extraction task. A set of basic information extraction operators are described. Their composition enables to build many realistic information extraction task as will be shown by the given examples. In order to describe such tasks and make Web services executing them we propose an XML language. This language allows to de-
scribe the set of operators needed for a specific information extraction task and coordinate them. We have implemented a method which allows to directly execute tasks described in our language. This paper is organized as follows. Section 2 gives an overview of information extraction from the Web. Section 3 outlines the links between information extraction and web services. Section 4 describes how an information extraction task can be decomposed in a set of operators. Section 5 presents our proposal of an XML language for describing information extraction web services. In section 6 we present three concrete applications in which we used this language to respond to specific information needs. Finally, in section 7 we conclude and discuss future work.
2. Information extraction from the web The aim of information extraction from the web is to give machine access to online web sources such as search engines, e-commerce sites, bibliography listings etc. This requires the users to fill in an HTML form to express their informational needs. The filled form makes up a query which is sent to the Web server which generates one or more result pages presumably containing the answer to the users informational needs. Making such sources available to automated programs leads to wide varieties of applications, such as building shopping agents, allowing mediators to access web data sources, or maintaining meta-search engines. To give access to such sources one needs to build a function called a wrapper, which (1) translates the query expressed in the internal language of the application into the sources language and (2) transforms the resulting generated HTML pages into the applications internal format [5]. Most research on information extraction from the web reduce the problem to this wrapper construction task. Most of the time, this problem has itself been reduced to the extraction of the result items from a set of result pages. Existing methods which allow to resolve this problems include wrapper induction based on hand-labeled example documents [5, 7, 6], structure discovery [1, 2], knowledge-based wrappers [8] and user-oriented wrapper construction [3]. However, realistic information extraction tasks require more work than just querying and extracting the results. For example, extracting information from a site such as http://dblp.uni-tier.de requires downloading multiple pages which can be reached by following specific links on an index page. One solution might be to download manually all the pages, learn a wrapper for these pages and then apply this wrapper to the pages. However if the same task needs to be reconducted in the future one will have to again manually download the updated pages. This example shows that information extraction can not be reduced to querying and retrieving results.
A set of pages containing each a list of items In most cases when extracting a list of result items, for example from ecommerce sites or search engines, the list to extract contained on one or more pages. Usually a sequence of pages which the user browses one by one by following a "next" link is generated. Extracting from such a set of pages can often be done by applying the wrapper constructed for one of the pages. However, it happens that some pages do not contain all the formats in which an item can be found. Therefore multiple pages might be needed to construct the wrapper. This however does not particularly change the wrapper construction problem. We propose a solution to this problem in [3]. Multiple pages containing each one item Some online sources, such as the CIA World Fact Book1 , describe one item per page. Building a wrapper for such sources requires the analysis of a subset of the pages of the source. This is due to the fact that the construction of a pattern usually based on generalizations. Having only one page gives a too specific wrapper for the source. Fully handling a web data source Up to now, research in information extraction from the has been mostly limited the extraction of results from a set of documents. However, in order to fully give automated access to a Web data source two other tasks need to be handled : querying the source and retrieving the document result set. A simple example is the access to a search engine. First the user needs to give his/her query terms. Secondly, a first page of results is presented to him/her. The following pages are accessible by successively following a “next” link. To fully give automated access to such online sources one needs to automate these querying and result retrieving tasks as well as the application of the extraction of the results from each page.
3. Web services and information extraction With the emergence of Web Services giving computer access to Web-based applications might not seem meaningful. This would be true if all services accessible thru web-based application also had a Web Service based access. This is surely not true yet and will presumably not be true in the future since it induces the maintenance in parallel of two different accesses : one for applications the other for browsers. Furthermore, the information as returned by a Web Service may not be adapted to the users needs or to application specific needs. This adaptation of information to the users need can be considered as an information extraction task. Also, in the context of information extraction, flexibility is necessary to respond to the users needs and/or the used devices (eg. mobile technology). However, 1
http://www.cia.gov/cia/publications/factbook/
in the current state, Web Services are black boxes which are hard-coded in the sense that they are written in a given programming language and compiled once and for all. This hard coding limits the possible modification, refinements and reuse of these Web Services. Also, from the users point of view, the Web Service may not directly respond to his/her needs. For example, Google’s Web Service offers to query the Web for documents given a set of query terms. One can imagine a case where a user might be using the service from a mobile phone and therefore only wants access to documents adapted to this device. Moreover using such a service as is, generates overly high costs since it requires downloading unwanted information. Many other example of specific needs can be imagined. We therefore need to be able to build dedicated services for information extraction. Existing Web Services can be useful when executing an information extraction task since they give access to information in a computer accessible manner. Using a Web Service relieves from the analysis of generated pages. However it is still necessary to have access to the semantics of the Web Service generated data. This is eased by the availability of a Web Service description (ie. a WSDL document). We propose to compose existing Web Services, with information extraction predefined operators in order to build new information extraction Web Services.
4. Decomposition of web information extraction tasks The Web has evolved from a set of hyper-linked pages, to dynamically generated web sites, and now introduces Web Services. While facilitating the process of information extraction these evolutions have not yet lead to the flexibility necessary to execute realistic information extraction tasks. Firstly, the needed information may not be directly accessible : the user has to fill forms, follow several links before getting to the information he/she needs. Secondly, it may not be provided in a single place : often it is found on several different sites and displayed on many different pages. Thirdly, it can not be used as is : the page on which it is found also contains much useless information. Fourthly, a Web Service directly responding to the users information needs is not always available : for example he might find a Web Service offering TV listings, but the user might only be interested in movies to be broadcast. An example information extraction task is to query and retrieve the results from an e-commerce site such as Amazon. Figure 1 show such a task being executed manually. In the first screen the user is connected to the Amazon.fr index page and he/she follows the DVD link which leads to the second screen. Then he/she follows a link giving access to an advanced query form found on the third screen. There he/she fills in the form by completing the actor field with
the string "Robin Williams", submits is and obtains the result page of the last screen. From there he/she has to manually extract the information (DVD title, main actor, date, and price in EUR) which interests him/her for each result. The objective of our work is to allow an easy automation of this process. This can be done by decomposing the complex task into simple elementary tasks such a finding a link, downloading a page, etc.
4.1. Information extraction operators For each basic subtask we can associate a basic operator, some of which are generic such as querying, fetching or parsing, while others are specific to the task. Most of the time the generic set of operator is sufficient to proceed to the extraction of the desired information. We currently have determined the following set of basic generic operators which can be instanciated by setting a set of parameters. Each operator takes an object as input an returns a list of objects (eventually empty) as its output. HTTP query building A first operator is the HTTP query building operator. An HTTP query is composed of three parts : a query method, a base URL and a set of key/value pairs. Applying an HTTP query building operator consists in building these three parts from the parameters the operator is given. This operator builds a list containing a unique item : the HTTP query. Fetching A fetching operator takes as input either a URL or an HTTP request and proceeds to the downloading of the document referred to. Its output is either a list containing an HTTP response a as its unique item or an empty list in case of an error. Web Service querying A Web Service querying operator takes as input a set of parameters and outputs the result of calling a predetermined Web Service with these parameters. Two of the parameters are the location of the Web Service’s description (ie. its WSDL file) and the method to call. This operator generates a list containing a unique item : the SOAP envelope returned by the Web Service. Parsing A parsing operator takes an XML or HTML document, parses it and returns a DOM object. This object model gives a highly flexible access to the different parts of an XML/HTML document. This operator either returns a list containing a unique item : the DOM object, or an empty list in case of a parsing error. Filtering A filter operator does a selection on its input according to a predetermined predicate. Any input object verifying the predicate is returned. All other input is kept back. This predicate is defined by a set of tests. This operator either returns an empty list if the input does not match the predicate or a list containing the input item as its unique element.
(1)
(2)
(3)
(4) Figure 1. Manual executed information extraction task
Extracting An extraction operator returns subparts of its input. Which subparts to extract is determined by giving an expression which is applied to the input. For example, given the DOM representation of an HTML page and the //a/@href XPath expression, the resulting extraction operator returns the links contained in the input document. This operator can generate a list containing zero, one or more items. The returned list is composed of all the input object subparts matching the operators expression. Transforming A transformation operator consists changing th format of the input. When the input is an HTML/XML document (or its DOM representation) the transformation can be described by an XSL Stylesheet. This operator returns a list containing the transformed item as its unique object.
4.2. Coordination of the operators In order to build a complete information extraction task it is necessary to coordinate the basic tasks. This is simply done by telling each tasks what to do with its results. For example, after having built a query, the next step is to fetch the query result. This can be done by setting up a query task and a fetching task and telling the query task to send its results to the fetching task. Whenever the query task receives input and builds a new query, it then sends the generated query to the fetching task.
4.3. Examples of information extraction tasks In the following we describe three example task using both web services and specific methods.
Figure 2. Google extraction task network Google via its Web Service The objective of the task is to obtain the modification date, size and type of the results given by Google to a query. The results are obtained by using Google’s doGoogleSearch Web Service. However they do not contain the information wanted which is the type of document, the last modification date, and the content size. This information can be obtained by querying the server on which the page can be found by sending an HTTP HEAD request. To resolve this task, we first need a Web Service querying operator which knows where the google service is located, which method to call and how to translate incoming data into a suitable parameter list for the web service call. Secondly, we need a XML parsing operator to give us a DOM representation of the obtained SOAP message. Then we need an extraction operator knowing how to extract from this message the list of result URLs. To obtain the information on each of the URLs, a fetching service is necessary to query the host server of the document pointed to by each URL. Finally, we need an extraction operator to keep for each result the desired information (ie. the url , its modification date, its size and its type). Figure 2 gives the coordination graph of this task. Extracting DVD listings from Amazon In this case the ob-
to initiate a session and give access to a valid form action URL. The next operators repeat a classic fetch next and extract loop. However it should be noted that an automatically generated wrapper was easily integrated into the task by creating an external operator.
Figure 3. Amazon DVD extraction task network Figure 4. Patent extraction task network jective is to use information extraction to build a Web Service for an existing classic Web-based source. In our example this source is Amazon This involves accessing the query page, posting a query, retrieving each of the result pages and extracting from these pages each of the information on each of the result item, namely the DVD title, it’s date, the main actor of the movie, and the price of the DVD. This extraction can be done by applying a wrapper built specifically for the source. We used the algorithms in [3] to automatically learn this wrapper which can be directly integrated into our system as external operators. One of the difficulties to access Amazon is that they have set up a cookie-less tracing system. When browsing the index page of the Amazon site a key is generated and included in every URL sent back to the user agent. This key is needed to access the other pages on the site and to query the site. An extraction task accessing Amazon therefore needs to simulate user browsing by fetching the index page, following the links to the query page, retrieving the action URL (which contains the generated session key) of the form in that page and posting the users query to this URL. We therefore need the following operators : (1) a first fetch operator which retrieves Amazon’s index page and initiates a new session, (2) an extraction operator which extracts the DVD URL, (3) a second fetch operator which retrieves the DVD index page, (4) an extraction operator which extracts the link to a page containing an advanced query form, (5) a third fetch operator to retrieve the advanced query page, (6) an extraction operator which extracts the action attribute of the advanced query form, (7) an HTTP query building operator which transforms the users query and the extracted action URL into an HTTP request, (8) a fourth fetch operator which retrieves the first result document obtained with this query, (9) the external operator which extracts the result instances from a result page, and (10) an extraction operator which extracts the next link URL. The coordination of these operators is given if figure 3. The first seven operators are just in sequence. They allow
Retrieving information on patents from the Web Another example information extraction task is that of extracting patents from an online source accessible through an HTML form. This form leads to a first result list page which contains a link to other result list pages. Each result list page contains a list of links to documents each describing one patent. These patent pages contain information on the patents such as their title, the inventors, the assignees, their international classification number, etc.
5. XML-based description of information extraction web services An information extraction web service is a set of basic operators which are coordinated. In the XML description each basic operator is represented by an XML element. The attributes of this element and its content fully instanciate the operator.
5.1. Describing the operators First of all, we need to describe the set of operators. In our XML language each operator is associated to an XML element. Each operator can be setup by declaring the values of a set of parameters by adding child elements to the operator element. The name attribute of the element gives a name of the operator. query The query element builds and HTTP request given a set of parameters which are either have fixed values or come from the input object. Figure 5 gives an example query operator for Amazon. When sending an HTTP request to a web server a query can be associated to the request. It takes the form of a set of attribute-value pairs. These are set with the param elements under parameters. The attribute names correspond to the value of the name attribute. The value
call : doGoogleSearch("XXX", "utf8")