WIRE A WWW-based Information Retrieval and Extraction System

4 downloads 4859 Views 136KB Size Report
search engines often return too much useless data and are generally incapable of automatically extracting specific information such as names and email ...
WIRE  A WWW-based Information Retrieval and Extraction System Sudhir Aggarwal, Fuyung Hung, and Weiyi Meng Department of Computer Science State University of New York at Binghamton Binghamton, NY 13902 email: {sudhir, meng}@cs.binghamton.edu

Abstract Locating and retrieving specific data from the World Wide Web (WWW) is an important problem. Existing search engines often return too much useless data and are generally incapable of automatically extracting specific information such as names and email addresses. In this paper we describe WIRE  a WWW-based Information Retrieval and Extraction system whose goal is to accurately retrieve and organize specific information from the World Wide Web. WIRE employs several innovative techniques. First, queries of WIRE are tree structured. This not only provides an order in which Web pages are to be searched/retrieved but also provides a context for more accurate retrieval. Second, WIRE employs a library of search templates based on the structure of HTML files to extract specific information. These templates can be complemented by user-provided search examples and patterns for better results. Third, WIRE has a filter mechanism to filter out undesired information to further improve retrieval accuracy.

1 Introduction In just a few years the World Wide Web has become an essential and pervasive information resource. Locating and retrieving specific desired data from this environment is however becoming problematic. This is because information providers are pretty much free to structure and organize data as they wish and also because there is very little central control and management of the data. There are typically two approaches for finding information in the WWW environment, browsing, exemplified by Yahoo, and searching, which is based on using a search engine. Most search engines such as Alta Vista, Hotbot and Microsoft Index Server allow the user to find Web pages of interest through the use of keywords

provided by the user. These search engines have the following common features. First, they are designed for naive users using keyword based queries. Second, Web pages are retrieved based on the content of the page alone. In other words, how the pages are organized (e.g., whether two pages are hyper-linked and are in the same organization) is seldom a factor in determining whether a page should be retrieved. Third, the search engines make use of inverted file indexes constructed in advance. While the inverted file is important for efficient processing of user queries, it also dictates that search queries must be keyword based. A problem common to current search engines is that too many irrelevant pages are typically returned. Some newer search engines attempt to overcome this problem by providing better summarizing capabilities (e.g., WebCompass) or filtering techniques (e.g., Microsoft Index Server). In this paper we describe WIRE  a WWW based Information Retrieval and Extraction System. WIRE is an effort to develop a search engine that can accurately retrieve and organize specific information from the World Wide Web. WIRE does not maintain a pre-built index database for its searches. WIRE has the following distinguishing features: 1. Search terms are organized into a tree that we call the query tree. It is clear that much of the data in a typical organization has a hierarchical structure. For example, a university home page usually has hyperlinks that point to academic units which in turn point to departments, etc. See Figure 1 for an example query for finding telephone numbers and e-mail addresses of computer science faculty at Binghamton University. WIRE uses this natural tree hierarchical organization to improve the efficiency and accuracy of the search. First, it provides an order in which the Web pages are to be searched/retrieved. By following a specified path, the search engine can quickly narrow down the search space and obtain the desired data. Second, it provides a context for more accurate retrieval. For example, with the above hierarchy, we will not retrieve the telephone numbers and e-mail addresses of faculty not in the computer science department even if their Web

pages contain relevant words such as {Binghamton, university, computer, science, ...}. In comparison, existing search engines support only flat queries. 2. WIRE is designed to extract specific information such as telephone numbers and addresses from Web pages using a library of search templates based on the structure of HTML files. WIRE also permits users to provide examples and patterns that help in the search and information extraction. The examples and patterns are associated with the search terms of the query tree. These examples and patterns are organized into a paradigm tree that is similar in structure to the query tree. The capability of extracting specific information can be useful in many applications. Existing search engines do not support such a capability. 3. WIRE has a sophisticated filter mechanism to filter out undesired information to further improve retrieval accuracy. A filter condition could be either local, global or structural. The filter conditions are organized as a restriction tree. B in g h a m to n U n iv e rs ity

representing data that is less structured and that changes frequently. Trees are also ideal structures for browsing. The tree structures in [2,11] mix schema information and data items in the same tree. In contrast, the tree structure used in WIRE is based on our SCOPE structure [1], which separates schema information (i.e., interpretation) and data items (i.e., tokens) into different trees (i.e., interpretation tree and token tree) and thus allows greater flexibility. The rest of the paper is organized as follows. In section 2 we describe the software architecture of WIRE and the tree structures needed as input for a structured search. In section 3 we discuss the retrieval and extraction algorithm used by WIRE. We also describe the template library that is used to facilitate the search and extraction of specific data. In section 4 we report the results of an example of the use of WIRE. Finally, in section 5, we present some concluding remarks.

2 Architecture of WIRE and the Search Tree The high level architecture of WIRE is depicted in Figure 2. U s er

C o m p u te r S c ie n c e D e p a r tm e n t U s er In terface

F a c u lty T e le p h o n e N um ber

E - m a il A d d ress

Figure 1 An Example Query Tree

W3QS [7] is a WWW information retrieval system that has some features similar to WIRE. However, there are major differences as W3QS works typically at the web page level and it is not primarily designed to extract specific information for the user. The input to the search is significantly different in WIRE, which uses a hierarchical tree, and W3QS, which uses either an SQLlike language called W3QL or a form-based GUI interface. Another major difference is that WIRE uses templates to both locate as well as extract specific information and thus need not rely solely on a user provided pattern. W3QS provides the ability to follow links based on user input to search more deeply if needed, but this mechanism is not as general as the related terms approach of WIRE. Finally, WIRE has a substantially more powerful filtering mechanism to obtain the desired specific information. Several other studies on designing query languages/facilities for semi-structured/unstructured or hyperlinked data have been reported recently [2,4,6,8,10]. Through the use of a query tree and a restriction tree, WIRE has a practical query language that is easy to use. Recently, several edge-labeled trees have been proposed to organize less structured data [1,2,11]. The flexibility of edge-labeled trees is excellent for

P attern Library

S earch Tree

W eb S erver

WWW

Tem plate Library R etrieval and E xtractio n

D ata D ictionary

To ken Tree

D isp lay

Figure 2 Architecture of WIRE

The user interface is used to help the user describe the query tree and auxiliary information (such as patterns) necessary to formulate a query. The pattern library, consisting of a number of commonly used patterns such as for email addresses and telephone numbers, makes it much easier for users to formulate the queries. The query tree and its auxiliary information will together be termed the search tree. The search tree is used as input to the retrieval and extraction component. This component locates potentially useful Web pages from the World Wide Web and extracts and organizes the desired information from these Web pages. The retrieval and extraction are aided by a system provided template library and several online data dictionaries. The results of the search are then organized into a token tree that is viewable by a standard browser such as Netscape.

2.1 Query Tree

2.2 Auxiliary Information

WIRE uses trees to organize data similar to that of the SCOPE system [1]. Data is organized as information labeling edges of three parallel trees called interpretation tree, representation tree and token tree. There is an automatic identification of corresponding edges in the three trees. A token is an individual instance of data such as ‘John’ and ‘45’. An interpretation is an (arbitrary) meaning, such as ‘address’, assigned to a set of data tokens. A representation, such as string, is a storage property common to a set of data tokens with the same interpretation. A special representation type called index is used to indicate that the corresponding interpretation edge has a set of token edges associated with it. See Figure 3. When the user provides a query term that is of type index, a special string edge will be created automatically as its child with the label “name”. university name

index

department address name

string

faculty

name

index string string

e-mail

index

string

(a) Interpretation/Query Tree

string

(b) Representation Tree

u1

u2

SUNY at d11 Bing hamton Bing hamton, NY 13902

d1k Harvard f1

Computer Science Sudhir Aggarwal

d21

d2m

f2 W eiyi M eng

[email protected]

[email protected] hamton.edu

(c) Token Tree

Figure 3 Parallel Trees

During the retrieval and extraction process tokens are added to the token tree. If the representation type of a target token is index, then an index number will be automatically generated and stored in the corresponding token edges in the token tree. In Figure 3c, the u1, u2, u3, ..., etc. are token edges for the retrieved universities and correspond to the university edge in the query tree. Note that since the set of token edges are hierarchically organized, the set of child tokens for u1 corresponding to department (d11, d12, ...) can be distinguished from the set of child tokens corresponding to department for u2 (d21, d22, ...) See Figure 3c.

Auxiliary information can be used to improve the search process and the quality of the retrieved data. It is represented as information on the edges of the query tree. Examples (such as Computer Science) and patterns (such as the seven digit structure for telephone number) are one type of auxiliary information. We call this type of auxiliary information the paradigm tree. Another type of auxiliary information can be used to restrict or filter the retrieved information (such as retrieving only the Biology and Chemistry departments). This auxiliary information is also defined on edges of the query tree and is termed the restriction tree. 2.2.1 Paradigm Tree In WIRE, an index edge indicates that multiple tokens satisfying the search term of that edge are to be expected For example, the search term “department” indicates that multiple departments for each university are expected as result tokens. In order to find the list of items from Web page(s) corresponding to an index edge, WIRE iteratively tries various templates from the template library (see section 3). WIRE allows the user to provide examples of the result tokens for each index edge. The examples are used as follows. First, the examples can be used to identify correct templates for extracting result tokens. The correctly identified template can in turn be used to extract other result tokens. Second, having retrieved a list of tokens, examples can be used to verify whether this list is correct or not. A good example is a token that is likely to be among the retrieved set of tokens. While examples are useful for index edges, patterns are more useful for non-index edges. We use PERL to express patterns and our search engine also uses PERL for specifying templates and extracting data. For example, the address pattern that matches a typical state and zip code (e.g., NY 13902) would be \w{2} \d{5}. WIRE also uses an on-line pattern dictionary to store predefined sets such as the 50 U.S. states, referenced by @state. Figure 4 shows a possible paradigm tree corresponding to the query tree of Figure 3a. “ H a rv a rd ” o r “ P rin c e to n “ “ C o m p u te r S c ie n c e ” o r “ M a th e m a tic s“ @ c ity , @ sta te \d {5 } sta te = A L |A R |...|W A c ity = A d d iso n |...|Y o rk [\.\w ]+ @ [\.\w ]+

Figure 4 An Example Paradigm Tree

2.2.2 Restriction Tree The use of query terms and the paradigm tree help in making the search efficient but do not provide sufficient precision because too many incorrect tokens may be retrieved. WIRE has a sophisticated filtering capability integrating three types of restriction mechanisms. First, WIRE supports local restrictions, similar to other search engines, that are based on the contents of the individual tokens retrieved. Examples of local restrictions are substring, subset and URL. For instance, when “URL restriction” is imposed on a query edge of type index, this requires that the token must appear in the anchor name of a URL. Second, WIRE supports global restrictions such as “unique” that is imposed on a set of tokens to require all retrieved tokens to be unique. The global restrictions are used after all tokens (corresponding to an index edge) are extracted from Web pages. Third, WIRE uses structural restrictions that are based on the structure of the query tree. For example “multiple restriction” imposed on a query edge requires that the parent edge in the token tree be disqualified unless more than one token corresponding to this query edge is retrieved. Figure 5 shows a possible restriction tree corresponding to the query tree of Figure 3a. s u b s tr in g = “ B in g h a m to n ” U R L

u n iq u e

su b strin g = “ C o m p u te r S c ie n c e ” U R L m u ltip le u n iq u e

Figure 5 Restriction Tree

2.3 User Interface The GUI interface of WIRE is implemented using Java. and allows a user to enter the information discussed above. See [5] for details.

3 Retrieval and Extraction Algorithm The starting point of the retrieval and extraction algorithm is the root edge in the query tree. The algorithm traverses the edges of the query tree in a breadth first manner, locating appropriate tokens corresponding to each query edge traversed. The algorithm treats index edges and non-index edges differently. For non-index edges, the algorithm attempts to find a single token. For index edges, the algorithm expects to retrieve multiple tokens. The algorithm also treats the first child edge of an index edge (always labeled “name”) differently. The retrieved token for this edge is actually determined during the search of its parent index edge.

Extracting specific information from Web pages and determining which hyperlinks to use next is a difficult problem because people have many different ways of organizing the information on the pages. WIRE uses the following two approaches to overcome this problem.  WIRE makes extensive use of the fact that the information on Web pages is described in HTML and is thus highly structured. For example, when searching for an index edge where multiple tokens are expected, we exploit the fact that such tokens are likely to appear in a list structure or a table format. We have developed a library of templates that can capture and extract information from these styles. It is clearly impossible to have prestored templates for all possible styles. A key idea of ours is to use the examples provided by the user to dynamically determine a template based on the assumption that other tokens will follow the same local style of the example.  WIRE emulates a certain aspect of human word understanding during the search process based on the fact that humans know synonyms for words and also know related terms for words. For example, when WIRE is looking for a word such as “university” it can also use the synonym “college” if necessary. Also, when looking for departments, it can use related terms such as “academics” and make use of the fact that departments may be found on the Web pages linked through these related terms. Although synonym dictionaries are commonly available, we are not aware of any general related terms dictionaries. The starting Web page for a WIRE search is called the initial seed page and is typically provided by the user through a URL of the page. For each query edge, WIRE needs to have a starting Web page(s) to begin searching for tokens related to that edge. This is called a seed page for that edge. During the search we traverse each edge in the query tree. We load the seed page from its location (which is stored in the Token Tree) and search for tokens using the information provided in the query tree. If we find useful tokens on the seed page, then it is called an object page and it automatically becomes the seed page for its child edges. The token retrieval and extraction algorithm is outlined below: 1. Input the search and provide a starting Web page. 2. Traverse each edge in the query tree using Breadth First Search starting from the root edge. 3. If the current edge is an index edge, call the search index() procedure else call the search string() procedure. 4. If an appropriate token is not found, call the relative_search() procedure. /*This uses related terms to continue the search*/ 5. Filter out disqualified tokens using conditions in the restriction tree. 6. Display the search results in a Web browser window.

3.1 The Search_index() Procedure This procedure applies certain templates in a given order from the template library in an attempt to match possible list structures. Templates that use more available information (such as user provided examples) and those that have heuristically been determined to have a greater success rate are applied first. The most frequently used list structures provided in HTML are
,
and
. People frequently use leading text in front of a list. They also tend to highlight the leading text in large font or in an emphasized format such as boldface. The leading text is likely to be adjacent to the related list. We thus additionally restrict the leading text to be in the range of five lines above the list. For example, when looking for the query edge department, we would expect to find the target department tokens in a list, and the word department would likely appear in the leading text. When the user provides a paradigm tree, we can use the examples to identify whether this is the correct list or not. If one of the keywords is found in the emphasized leading text and we find that some item in the list matches some examples, then it is likely that this is the correct list. We then extract all of the items as tokens. We define an emphatic tag array to match the leading text. This array consists of

,

,

,

,

,
, , , and . An example template which matches both keywords and examples is as follows: @keywords1 .

3.2 The search_string() Procedure This procedure is designed to obtain tokens for nonindex edges. This procedure essentially consists of two separate procedures: search_no_pattern() if a pattern is not provided and search_pattern() if a pattern is provided. Both procedures start with the seed page for the edge. For a set of templates (depending on the procedure) in the template library try to match each template. For search_no_pattern() we have found only one template that has been useful in extracting tokens:
(
@keywords
*)?
. For search_pattern(), if a pattern is provided and it contains @xxx, we will look for xxx in the on-line pattern dictionary and enumeratively use all of the possible values to replace @xxx in the original pattern. We then use the new pattern in the template under consideration to extract the token from the Web page.

3.3 The relative_search() Procedure One of the more important aspects of WIRE’s design is the ability to do a “relative search” using related terms. If our query tree structure matches the structure of the target 1 2

A keyword array containing all synonym terms. An example array containing all examples.

Web pages, then the procedures discussed above are sufficient when searching for tokens corresponding to a query edge term. Unfortunately, this ideal situation does not often hold. Consider the query tree shown in Figure 6a, which indicates that department information is expected to appear either on the university page or on pages reachable from the university page by one-step hyperlinks. In reality, however, the departments may actually appear several hyperlinks “deeper” from the university page. See Figure 6b. In order to find the departments under such circumstances, we construct potential query trees hoping that one will match the real WWW structure. The potential query trees are constructed by incorporating additional edges, with labels from related terms of the edge under consideration (e.g. department), on a path from the university edge to the department edge. university

university name

department address name

name

academic address name

department name

(a) User query tree

(b) Real WWW structure

Figure 6 Mismatch between query tree and real WWW structure

4 An Example of the use of WIRE In this section, we illustrate the use of WIRE through an experiment. We wish to retrieve all the departments in Binghamton University and all the faculty for each department. For each department, we are only interested in the name of the department. For each faculty member we wish to retrieve the name, telephone number and the email address. The search tree for this experiment is shown in Figure 7. In the paradigm tree, we give examples of possible faculty names. Whenever English names are likely to be among the names retrieved by a query, using popular names will help the search algorithm to identify correct lists and tokens. In the paradigm tree, we also provide patterns for email addresses and telephone numbers. At Binghamton University, departments are actually categorized in schools and schools themselves reside under the academics homepage. Thus the structure of the query does not directly match the structure in the Web pages at Binghamton University. With “academic” and “school” as related terms of department, the related search procedure is able to correctly locate department home pages. We also use “division” as a synonym term of department and “professor” as a synonym term of faculty.

The initial seed page to http://www.binghamton.edu/.

start

the

department name

is

index string

faculty

name

search

telephone

index

string

string

e-mail

string

(a) Query Tree

(b) Representation Tree

URL multiple

Michael Richard William David

multiple

\d{3}-\d{4} [\.\w]+@[\.\w]+

(d) Restriction Tree

Figure 7 Search tree for the university database

In our experiments, we considered two cases based on slightly different variations of the paradigm tree. In the first case, the search used the paradigm tree of Figure 7c except that we did not use the examples (of names). The second case used the paradigm tree with the examples. By comparing the results of the two cases, the effectiveness of using examples for this experiment can be measured. We use the following terms to report the search results:  #Tokens: The total number of desired (correct) tokens.  #Found: The total number of tokens found.  #Correct: The number of correct tokens found.  #Wrong: The number of wrong tokens found.  #Missing: The number of correct tokens not found.  Success Rate (SR) = #Correct / #Tokens.  Error Rate (ER) = #Wrong / #Tokens.  Missing Rate (MR) = #Missing / #Tokens. The search results for all tokens are summarized in the following table. The values in parentheses refer to the use of examples. When examples were used, a better success rate was achieved. Dept Name Faculty Name Email Tel Total

#T 32

#F 31

#C 29

#W #M 2 3

477 (477) 112 102 723 (723)

391 (491) 102 103 627 (727)

336 (432) 56 (60) 96 6 95 8 556 72 (652) (76)

142 (46) 7 7 159 (63)

WIRE has some new features not found in previous search tools:  Queries are tree structured through query, paradigm and restriction trees.  WIRE is capable of extracting specific information from the Web pages through the use of templates, patterns, examples, and restrictions  WIRE has a sophisticated filtering mechanism consisting of local restrictions, global restrictions, and structural restrictions. WIRE has a powerful graphical interface that allows a user to easily define a search query and returns the results in a browseable form that is easy to understand. We are currently exploring extensions to WIRE to make it even more useful for a variety of different applications.

References

unique

(c) Paradigm Tree

5 Conclusions

SR 91%

ER 6%

MR 9%

70% (91%) 86% 93% 77% (90%)

12% (13%) 5% 8% 10% (11%)

30% (10%) 6% 7% 22% (9%)

[1] S. Aggarwal, I. Kim, and W. Meng. “Database Exploration with Dynamic Abstractions”, Proc. Of the 5th International Conference on Database and Expert System Applications (DEXA ’94), pp.621-630, 1994. [2] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. “A Query Language and Optimization Techniques for Unstructured Data”. ACM SIGMOD, 1996. [3] L. Colby, T. Griffin, L. Libkin, I. Mumick, and H. Trickey. “Algorithms for Deferred View Maintenance”. ACM SIGMOD, 1996. [4] M. Consens, and A. Mendelzon. “Expressing Structural Hypertext Queries in GraphLog”. Second ACM Conference on Hypertext. 1989. [5] F. Hung, “WIRE - A WWW-based Information Retrieval and Extraction System,” M.S. thesis, Department of Computer Science, State University of New York at Binghamton, 1997. [6] R. Goldman, and J. Widow. “Dataguides: Enabling Query Formulation and Optimization in Semi-structured Databases”. VLDB, 1997. [7] D. Konopnicki, and O. Shmueli, “W3QS: A query system for the World Wide Web”. VLDB, http://www.cs.technion.ac.il /~W3QS, 1995. [8] L. Lakshmanan, F. Sadri, and I. Subramanian. “A Declarative Language for Querying and Restructuring the Web”. Sixth International Workshop on Research Issues in Data Engineering (RIDE’96), 1996. [9] A. Mendelzon, G. Mihaila, and T. Milo. “Querying the World Wide Web”. PDIS, 1996. [10] T. Minohara, and R. Watanable. “Queries on Structure in Hypertext”. Conference on Foundations of Data Organization (FODO’93), 1993. [11] Y. Papakonstantinou, H. Garcia-Molina, and J. Widow. “Object Exchange across Heterogeneous Information Sources”. IEEE International Conference on Data Engineering, 1995.

Suggest Documents