Adapting Web Information Extraction Knowledge via Mining Site-Invariant and Site-Dependent Features TAK-LAM WONG City University of Hong Kong and WAI LAM The Chinese University of Hong Kong
We develop a novel framework that aims at automatically adapting previously learned information extraction knowledge from a source Web site to a new unseen target site in the same domain. Two kinds of features related to the text fragments from the Web documents are investigated. The first type of feature is called, a site-invariant feature. These features likely remain unchanged in Web pages from different sites in the same domain. The second type of feature is called a sitedependent feature. These features are different in the Web pages collected from different Web sites, while they are similar in the Web pages originating from the same site. In our framework, we derive the site-invariant features from previously learned extraction knowledge and the items previously collected or extracted from the source Web site. The derived site-invariant features will be exploited to automatically seek a new set of training examples in the new unseen target site. Both the site-dependent features and the site-invariant features of these automatically discovered training examples will be considered in the learning of new information extraction knowledge for the target site. We conducted extensive experiments on a set of real-world Web sites collected from three different domains to demonstrate the performance of our framework. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention. Categories and Subject Descriptors: H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing; I.2.6 [Artificial Intelligence]: Learning—Induction The work described in this article is substantially supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Nos: CUHK 4179/03E and CUHK4193/04E) and the Direct Grant of the Faculty of Engineering, CUHK (Project Codes: 2050363 and 2050391). This work is also affiliated with the Microsoft-CUHK Joint Laboratory for Human-centric Computing and Interface Technologies. Authors’ addresses: T. L. Wong, Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong; email:
[email protected]; W. Lam, Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong, Shatin, Hong Kong; email:
[email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or
[email protected]. C 2007 ACM 1553-5399/07/0200-ART6 $5.00. DOI 10.1145/1189740.1189746 http://doi.acm.org/ 10.1145/1189740.1189746 ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
2
•
T.-L. Wong and W. Lam
General Terms: Algorithms, Design Additional Key Words and Phrases: Wrapper adaptation, Web mining, text mining, machine learning ACM Reference Format: Wong, T.-L. and Lam, W. 2007. Adapting Web information extraction knowledge via mining siteinvariant and site-dependent features. ACM Trans. Intern. Tech. 7, 1, Article 6 (February 2007), 40 pages. DOI = 10.1145/1189740.1189746 http://doi.acm.org/10.1145/1189740.1189746
1. INTRODUCTION The vast amount of online documents in the World Wide Web provides a good resource for users to search information. One common practice is to make use of search engines. For example, a potential customer may browse different bookstore Web sites with the aid of a search engine hoping to gather precise information such as the title, authors, and selling price of some books. A major problem of online search engines is that the unit of the search results is an entire Web document. Human effort is required to examine each of the returned entries to extract the precise information. Automatic information extraction systems can automate the task by effectively identifying the relevant text fragments within the document. The extracted data can also be utilized in many intelligent applications such as on online comparison-shopping agent [Doorenbos et al. 1997], or an automated travel assistant [Ambite et al. 2002]. Unlike free texts (e.g., newswire articles) and structured texts (e.g., texts in rigid format), online Web documents are semi-structured text documents with a variety of formats containing a mix of short, weakly grammatical text fragments, mark-up tags, and free texts. For example, Figure 1 depicts a portion of an example of a Web book catalog.1 To automatically extract precise data from semi-structured documents, a commonly used technique is to make use of wrappers. A wrapper normally consists of a set of extraction rules that can identify the text fragments in the documents. In the past, human experts analyzed the documents and constructed the set of extraction rules manually. This approach is costly, time-consuming, tedious, and error-prone. Wrapper induction aims at automatically constructing wrappers by learning a set of extraction rules from the manually annotated training examples. For instance, a user can specify some training examples containing title, authors, and price of the book records in the Web document, as shown in Figure 1, through a GUI. The wrapper induction system can automatically learn the wrapper from these training examples, and the learned wrapper is able to effectively extract data from the documents of the same Web site. Different techniques have been proposed, which demonstrate that wrapper induction achieves very good extraction performance [Califf and Mooney 2003; Ciravegna 2001; Downey et al. 2005; Freitag and McCallum 1999; Muslea et al. 2001; Soderland 1999]. One major limitation of existing wrapper induction techniques is that a learned wrapper from a particular Web site cannot be applied to a new unseen Web site even in the same domain. For instance, suppose we have learned 1 The
URL of this Web site is http://www.powells.com.
ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
3
Fig. 1. A portion of a sample Web page of a book catalog.
the wrapper for the Web site as shown in Figure 1. In this article, it is called a source Web site. Figure 2 is another book catalog collected from a Web site different from the one shown in Figure 12 . Although both the source site (Figure 1) and the new Web site (Figure 2) contain information about book records, the learned wrapper for the source site cannot be applied directly to this new Web site for extracting information because their layouts are typically quite different. To automatically extract data from the new site, we must construct another wrapper customized to this new site. Hence, a separate human effort is necessary to collect the training examples from the new site and invoke the wrapper induction process separately. In this article, we develop a novel framework that can fully automate the adaptation and eliminate the human effort. This problem is called wrapper adaptation: it aims at automatically adapting a previously learned wrapper from a source Web site to a new unseen site in the same domain. Under our model, if some attributes in a domain share a certain amount of similarity among different sites, the wrapper learned from the source Web site can be adapted to the new unseen sites without any human intervention. As a result, manual effort is guaranteed to be reduced for preparing training examples in the overall process. We have previously developed an algorithm called WrapMA [Wong and Lam 2002] for solving the wrapper adaptation problem. WrapMA can adapt the previously learned extraction knowledge from one Web site to another new unseen Web site in the same domain. The main drawback of WrapMA is that human effort is still required to scrutinize the intermediate data during the adaptation phase. In this article, we present a novel method called Information 2 The
URL of the Web site is http://www.halfpricecomputerbooks.com. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
4
•
T.-L. Wong and W. Lam
Fig. 2. A portion of a sample Web page about a book catalog collected from a different Web site then Figure 1.
Extraction Knowledge Adaptation (IEKA) for solving the wrapper adaptation problem. IEKA is a fully automatic method without need for manual effort. The idea of IEKA is to analyze the site-dependent features and the site-invariant features of the Web pages in order to automatically seek a new set of training examples from the new, unseen, site. A preliminary version was reported in Wong and Lam [2004b]. In this article, we substantially enhance the framework by modeling the dependence among various kinds of knowledge and site-specific features for Web environment. An information-theoretic approach to analyzing the DOM (document object model) structure of Web pages is also incorporated for seeking training examples in the new site more effectively. The performance of our new approach, IEKA, is very promising, as demonstrated in the extensive experiments described in Section 9. For example, by just providing training examples from one online book catalog Web site, our approach can automatically extract information from ten different book catalog sites achieving an average precision and recall of 71.9% and 84.0% respectively without any further manual intervention. 2. PROBLEM DEFINITION AND MOTIVATION Consider the Web site shown in Figure 1. Figure 3 depicts an excerpt of the HTML texts of this page. Suppose we want to automatically extract the information such as the title, authors, and prices of the books from this Web site. We can construct a wrapper for this Web site to achieve this task. To learn the wrapper, we first manually annotate some training examples similar to the one depicted in Table I via a GUI. We employ our wrapper learning system (HISER), which considers the text fragments of the data item as well as the ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
5
Fig. 3. An excerpt of the HTML texts for the Web page shown in Figure 1. Table I. Sample of Manually Annotated Training Examples from the Web Page Shown in Figure 1 Item Book Title: Author: Final Price:
Item value Programming Microsoft Visual Basic 6.0 with CDROM (Programming) Francesco Balena 59.99
surrounding text fragments [Lin and Lam 2000]. The learned wrapper is composed of a hierarchical record structure and extraction rules. For example, Figure 4 shows the learned hierarchical record structure and Table II shows one of the learned extraction rules for the book title. A hierarchical record structure is a tree-like structure representing the relationship of the items of interest. The root node represents a record that consists of one or more items. An internal node represents a certain fragment of its parent node. An internal node can be a repetition, which may consist of other subtrees or leaf nodes. A repetition node specifies that its child can appear repeatedly in the record. A leaf node represents an attribute item of interest. Each node of the hierarchical record structure is associated with a set of extraction rules. The extraction rule contains three components. The left and right pattern components contain the left and right delimiters of the items, and the target pattern component consists of the semantic meaning of the items. After obtaining the wrapper, we can apply it to the other pages from the same Web site to automatically extract items. The learned wrapper can effectively extract items from the Web pages of the same site. However, it cannot extract any item if we directly apply the learned wrapper to a new unseen site in the same domain, such as the one shown in Figure 2. Figure 5 depicts an excerpt of the HTML texts for this page. The failure of extraction is due to the difference between the layout formats of the two Web sites; the learned wrapper becomes inapplicable to the new site. In order to automatically extract information from the new site, one could learn the wrapper for the new site by manually collecting another set of training examples. Instead, we propose and develop our IEKA framework to automatically tackle this problem. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
6
•
T.-L. Wong and W. Lam
root title
price repetition( author)
author Fig. 4. The learned hierarchical record structure for the Web page shown in Figure 1.
Fig. 5. An excerpt of the HTML texts for the Web page shown in Figure 2.
There are several characteristics for IEKA. The first is that IEKA utilizes the previously learned extraction knowledge contained in the wrapper of the source Web site. For example, the extraction rule depicted in Table II shows that the majority of the book titles contain alphabets and words starting with a capital letter. Such knowledge contains useful evidence for information extraction for the new unseen site in the same domain. However, it is not directly applicable to the new site due to the difference between the contexts of the two Web sites. We refer such knowledge as weak extraction knowledge. The second characteristic of IEKA is to make use of the items previously extracted or collected from the source site. These items can contribute to deriving training examples for the new unseen site. One major difference between this kind of training example and ordinary training examples, is that the former only consist of information about the item content, while the latter contain information for both the content and context of the Web pages. We call this property partially specified. Based on the weak extraction knowledge and the partially specified training examples, IEKA first derives those site-invariant features that remain largely unchanged for different sites. For example, one kind of site-invariant feature is the patterns, such as capitalization, about the attributes. Another kind of site-invariant feature is the orthographic information of the attributes. Next, a set of training example candidates is selected by analyzing the DOM structures of the Web documents of the new unseen site based on an information-theoretic approach. Machine learning methods are then employed to automatically discover some machine-labeled training examples from the set of candidates, based on the site-invariant features. Table III depicts samples of the automatically discovered machine-labeled training examples from the new unseen site shown in Figure 2. Both site-invariant and site-dependent features of the machine-labeled ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
7
Table II. A Sample of a Learned Extraction Rule for the Book Title for the Web Page Shown in Figure 1 Left pattern component Scan Until(, SEMANTIC), Scan Until(“”, TOKEN), Scan Until(“
”, SEMANTIC), Scan Until(“”, TOKEN). Target pattern component Contain() Contain() Right pattern component Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“
”, TOKEN), Scan Until(, SEMANTIC). Table III. Samples of Machine Labeled Training Examples Obtained by Adapting the Wrapper from the Web Site Shown in Figure 1 to the Web Site Shown in Figure 2 Using Our IEKA Framework Example 1
Example 2
Item Book Title: Title: Final Price: Author: Final Price:
Item value C++ Weekend Crash Course, 2nd edition Stephen Randy Davis 23.99 Steve Oualline 31.96
training examples will then be considered in the learning of the new wrapper for the new target site. The newly discovered hierarchical record structure for the new site is the same as the one shown in Figure 4. Table IV shows the set of adapted extraction rules of the book title. The newly learned wrapper can be applied to extract items from the Web pages of this new site. 3. RELATED WORK Research efforts about information extraction from various kinds of textual documents ranging from free texts to structured documents have been investigated [Chawathe et al. 1994; Srihari and Li 1999]. Among different extraction approaches, a wrapper is a common technique for extracting information from semistructured documents such as Web pages [Kushmerick and Thomas 2002]. In the past few years, many wrapper learning systems that aim at constructing wrappers by learning from a set of training examples have been proposed [Blei et al. 2002; Ciravegna 2001; Cohen et al. 2002; Freitag and McCallum 1999; Hogue and Karger 2005; Hsu and Dung 1998; Kushmerick 2000a; Lin and Lam 2000; Muslea et al. 2001; Soderland 1999]. These approaches can automatically learn wrappers from a set of training examples and the learned wrapper can effectively extract items from the Web sites. However, they suffer from two common drawbacks. First, as the layout of Web sites is changed, the learned wrapper typically becomes obsolete and useless. This refers to the wrapper maintenance problem. Second, the learned wrapper can only be applied to the Web site from whence the training examples come. In order to learn a wrapper ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
8
•
T.-L. Wong and W. Lam Table IV. The Set of Extraction Rules for Extracting the Book Title from the Web Page Shown in Figure 2 Obtained by Adapting the Wrapper from the Web Site Shown in Figure 1 Using Our IEKA Framework Left pattern component Scan Until(
, TOKEN), Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“
”, SEMANTIC). Target pattern component Contain() Contain() Right pattern component Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“
”, TOKEN), Scan Until(“
”, TOKEN). Left pattern component Scan Until(
, TOKEN), Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“
”, SEMANTIC). Target pattern component Contain() Contain() Right pattern component Scan Until(“”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“
”, TOKEN), Scan Until(“”, TOKEN).
for a different Web site, a separate manual effort is required to prepare a new set of training examples. Wrapper maintenance aims at relearning the wrapper if the wrapper is found to be no longer applicable. Several approaches have been developed to address the wrapper maintenance problem. RAPTURE [Kushmerick 2000b] has been developed to verify the validity of the wrapper using regression technique. A probabilistic model is built based on the extracted items when the wrapper is known to operate correctly based on the extracted items. After the system operates for a period of time, the items extracted are compared against the model. If the extracted items are found to be largely different, this wrapper is believed to be invalid and needs to be relearned. However, it can only partially solve the wrapper maintenance problem since it cannot learn a new wrapper automatically. Lerman et al. [2003] have developed the DataPro algorithm to address the problem. It learns some patterns from the extracted items. For example, , which represents a word containing alphabets only, followed by another word starting with a capital letter, is one of the patterns learned from the business names such as “Cajun Kitchen.” When the layout of ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
9
the Web site is changed, the DataPro algorithm will automatically label a new set of training examples by matching the learned patterns in the new Web page. The patterns are mainly composed of the display format information such as lower case and upper case of the items. However, it is doubtful that the items have the same display format in the old and new layouts of the Web site. Several approaches have been designed to reduce the human effort in preparing training examples. These approaches have an objective similar to wrapper adaptation. Bootstrapping algorithms [Ghani and Jones 2002; Riloff and Jones 1999] are well-known methods for reducing the number of training examples. They normally initiate the training process by using a set of seed words and incorporating the unlabeled examples in the training phase. However, bootstrapping algorithms assume that those seed words must be present in the training data, leading to ineffective training. For example, the word “Shakespeare” may appear in the title, or as the author of a book.3 DIPRE [Brin 1998] attempts to find the occurrence of some concept pairs such as title/author in the documents to obtain training examples by finding text fragments exactly matched with the user inputs. Once sufficient training examples are obtained, it learns extraction patterns from these training examples. DIPRE can reduce the effect of incorrect initiation in bootstrapping. However, it can only work on site-independent concept pairs such as title/author. It cannot extract sitedependent concept pairs such as title/price. The reason is that it assumes that the prices of a particular book are the same in different Web sites and the prices from different sites are known in advance. Moreover, quite a number of concept pairs are required to be prepared in advance in order to obtain sufficient training examples. Cotesting [Muslea et al. 2000] is a semi-automatic approach for reducing the number of examples in the training phase. The idea of cotesting is to learn different wrappers from a few labeled training examples. One may learn a wrapper by processing the Web page forward and learn another by processing the same Web page backward. These wrappers are then applied to the unlabeled examples. If the wrappers label the examples differently, users are asked to manually label those inconsistent examples. The newly labeled examples are then added to the training set and the process iterates until convergence. However, such an active learning approach can only partially reduce the human work. ROADRUNNER [Crescenzi et al. 2001], DeLa [Wang and Lochovsky 2003], and MDR [Liu et al. 2003] are approaches developed for completely eliminating the human effort in extracting items in Web sites. The idea of ROADRUNNER is to compare the similarities and differences of the Web pages. If two different strings occur in the same corresponding positions of two Web pages, they are believed to be the items to be extracted. DeLa discovers repeated patterns of the HTML tags within a Web page and expresses these repeated patterns with regular expressions. The items are then extracted in a table format by parsing the Web page to the discovered regular patterns. MDR first discovers the data regions in the Web page by building the HTML tag tree and making use of 3 The
word “Shakespeare” appears in the title of the book “Shakespeare – by Michael Wood” and appears in the author of the book “Romeo And Juliet – by William Shakespeare”. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
10
•
T.-L. Wong and W. Lam
string comparison techniques. The data records from each data region are extracted by applying some heuristic knowledge of how people commonly present data objects in Web pages. These three approahes do not require any human involvement in training and extraction. However, they suffer from one common shortcoming. They do not consider the type of information extracted, and hence the items extracted by these systems require human effort to interpret their meaning. For example, if the extracted string is “Shakespeare,” it is not known whether this string refers to a book title or a book author. Wrapper adaptation aims at automatically adapting the previously learned extraction knowledge to a new unseen site in the same domain. This can significantly reduce the human work in labeling training examples for learning wrappers. In principle, wrapper adaptation can solve the wrapper maintenance problem. It can also be applied to other intelligent tasks [Lam et al. 2003; Wong and Lam 2005]. Golgher and da Silva [2001] proposed to solve the wrapper adaptation problem by applying a bootstrapping technique and a query-like approach. This approach searches the exact matching of items in the new unseen Web page. However their approach shares the same shortcomings as bootstrapping. In essence, their approach assumes that the seed words, which refer to the elements in the source repository in their framework, must appear in the new Web page. Cohen and Fan [1999] designed a method for learning pageindependent heuristics for extracting items from Web pages. Their approach is able to extract items in different domains. However, a major disadvantage of this method is that training examples from several Web sites must be collected to learn such heuristic rules. KNOWITALL [Etzioni et al. 2005] is a domainindependent information extraction system. Its idea is to make use of online search engines and bootstrap from a set of domain-independent and generic patterns from the Web. It can extract the relation between instances and classes, and the relation between superclasses and subclasses. However, one limitation of KNOWITALL is that the proposed generic patterns cannot solve the multislot extraction problem, which aims at extracting records containing one or more attribute items. The machine-labeled training example discovery component of our proposed framework is related to the research area of object identification or duplicate detection, which aims at identifying matching objects from different information sources. Tejada et al. [2001, 2002] developed a system called Active Atlas to solve the object identification problem. They designed a method for learning the weights for different string transformations. The identification of matching objects is then achieved by computing the similarity score between the attributes of the objects. MARLIN [Bilenko and Mooney 2003] is another object identification system based on a generative model for computing the string distance with affine gaps, which applied SVM to compute the vector-space similarity between strings. Cohen defined the similarity join of tables in a database containing free text data [Cohen 1999]. The database may be constructed by extracting data from Web sites. The idea is to consider the importance of the terms contained in the attributes and compute the cosine similarity between the attributes of the tuples. However, one major difference between our machine-labeled training example discovery and these object identification methods is that ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
11
fI Site Invariant Features
β
fD
Item Knowledge
α
Content Knowledge (Domain Dependent and Site Invariant)
γ
Site Dependent Features
Context Knowledge (Site Dependent)
Web page
Web Site Domain
Fig. 6. Dependence model of text data for Web sites for a particular domain.
machine-labeled training example discovery identifies the text fragments, which likely belong to the items of interest, within the Web page collected from the new unseen site, while object identification determines the similarity between records that are obtained or extracted in advance. Moreover, the goal of machine-labeled training example discovery is to identify the text fragments belonging to the items of interest, but not to integrate information from different information sources. The technique used in object identification is not applicable since it is common that the source Web site and the new unseen site do not contain shared records. 4. OVERVIEW OF IEKA 4.1 Dependence Model Our proposed adaptation framework is called IEKA (Information Extraction Knowledge Adaptation). It is designed based on a dependence model of text data contained in the Web sites. Figure 6 shows the dependence model for a particular domain. Typically, there are different Web sites containing data records. Within a particular Web site, there is a set of Web pages containing some data items. For example, in the book domain, there are many bookstore Web sites. Each of these Web sites contains a set of Web pages and each page displays some items such as title, authors, price, and so on. Sometimes, a Web page is obtained by supplying a keyword to the internal search engine provided by the Web site. Associated with each domain, there exists some content knowledge denoted by α. This content knowledge contains the general information about the data items of this domain. For example, in the book domain, α refers to the knowledge that each book consists of items such as title, authors, and price. Within α, there is more specific knowledge, called item knowledge, associated with the items to be extracted. For instance, the item title is associated with particular item knowledge denoted by β, which it refers to knowledge about the title: for example, a title normally consists of few words and some of the words may start with a capital letter. It is obvious that α and β are domain dependent. For example, the knowledge for the book domain and the consumer electronics appliance domain are different. α and β are also regarded as site-invariant since ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
12
•
T.-L. Wong and W. Lam
such knowledge does not change with different Web sites. There is another kind of knowledge called context knowledge denoted by γ . Context knowledge refers to context information such as the layout format of the Web sites. Different Web sites have different contexts γ . For example, in the book domain, the book title is displayed after the token “Title:” in one Web site, whereas the book title is displayed at the begining of a line in another. In a particular Web page, we differentiate two types of feature. The first type is called the site-invariant feature denoted by f I . f I is mainly related to the item content within the Web page and is dependent on α and β. For example, f I can represent the text fragments regarding the title of a book. Due to the dependence of α and β, f I remains largely unchanged in the Web pages from different Web sites in the same domain. The other type of feature is called the site-dependent feature, denoted by f D . For example, f D can represent the text fragments regarding the layout format of the title of a book in a Web page. Specifically, the titles of the books as shown in Figure 1 are bolded and underlined. f D is dependent on the context knowledge γ associated with a particular Web site. f D is also dependent on β because each item may have different contexts. As the context knowledge γ of different Web sites is different, the resulting f D are also different for the Web pages collected from different sites. However, f D of the Web pages originating from the same site are likely unchanged because they depend on the same γ . In wrapper induction, we attempt to learn the wrapper by manually annotating some training examples in the Web site. These training examples consist of the site-invariant features and the site-dependent features of the Web pages. Wrapper induction is a process of learning information extraction knowledge from the site-invariant and dependent features of the pages from the Web site. The learned wrapper can effectively extract information from the other pages of the same Web site because the site-invariant and dependent features of these Web pages depend on the same α and γ respectively. However, the wrapper learned from a source Web site cannot be directly applied to a new unseen Web site because the site-dependent features of the Web pages in the new unseen site depend on different γ . 4.2 IEKA Framework Description Our IEKA framework tackles the problem by making use of the site-invariant features as clues to solve the wrapper adaptation problem. IEKA first identifies the site-invariant features of the Web pages of the new unseen site. This is achieved by exploiting two pieces of information in the source Web site to derive the site-invariant features. The first piece of information is the extraction knowledge contained in the previously learned wrapper. The other piece of information is the items collected or extracted in the source Web site. To perform information extraction for a new Web site, the existing extraction knowledge contained in the previously learned wrapper is useful since the site-invariant features are likely applicable. However, the site-dependent features cannot be used since they are different in the new site. As mentioned in Section 2, we call such knowledge, weak extraction knowledge. The items previously extracted or collected in the source Web site embody rich information about the item ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
Previously Learned Extraction Knowledge Contained in Wrappe r
Potential Training Text Fragment Identification DOM Analysis
Items Previously Extracted or Collected Source Web Site
Modified K Classification
Potential Training Text Fragment s
Machine Labeled Training Example Discovery Content Classification Model
Machine Labeled Trainin g Examples
Wrapper Learning Component
Lexicon Approximate Matching
•
13
New Wrapper for Target Web Site
Information Extraction Knowledge Adaptation (IEKA) Unseen Targe t Web Site
Fig. 7. The major stages of IEKA.
content. For example, these extracted items contain some characteristics and orthographic information about the item content. These items can be viewed as training examples for the new site. However, they are different from the ordinary training examples because the former only contain information about the site-invariant features, while the latter contain information about both the site-invariant features and site-dependent features. As mentioned in Section 2, we call this property partially specified. By deriving the site-invariant features from the weak extraction knowledge and the partially specified training examples, IEKA employs machine-learning methods to automatically discover some training examples from the new Web site. These newly discovered training examples are called machine-labeled training examples. The next step is to analyze both the site-invariant features and site-dependent features of those machine-labeled training examples of the new site. IEKA then learns the new information extraction knowledge tailored to the new site using a wrapper learning component. Figure 7 depicts the major stages of our IEKA framework. IEKA consists of three stages employing machine-learning methods to tackle the adaptation problem. The first stage of IEKA is the potential training text fragment identification. In this stage, we employ an information-theoretic approach to analyze the DOM structures of the Web pages of the unseen Web site. The informative nodes in the DOM structure can be effectively identified. Next, the weak extraction knowledge contained in the wrapper from the source site is utilized to identify appropriate text fragments in these informative nodes as the potential training text fragments for the new unseen site. This stage considers the site-dependent features of the Web pages as discussed above. Some auxiliary example pages are automatically fetched for the analysis of the site-dependent features. A modified K -nearest neighbours classification model is developed for effectively identifying the potential training text fragments. The second stage is the machine-labeled training example discovery. It aims at scoring the potential training text fragments. Those “good” potential training text fragments will become the machine-labeled training examples for learning the new wrapper for the new site. This stage considers the site-invariant features of the partially specified training examples. An automatic text ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
14
•
T.-L. Wong and W. Lam
root
book_title
repetition (author)
author
price
list_price
final_price
Fig. 8. The hierarchical record structure for the book information shown in Figure 2.
fragment-classification model is developed to score the potential training text fragments. The classification model consists of two components. The first component is the content classification component. It considers several features to characterize the item content. The second component is the approximate matching component, which analyzes the orthographical information of the potential training text fragments. In the third stage, based on the automatically generated machine-labeled training examples, a new wrapper for the new Web site is learned using the wrapper learning component. The wrapper learning component in IEKA is derived from our previous work [Lin and Lam 2000], a brief summary of which is given in the following.
4.3 Wrapper Learning Component A wrapper learning component discovers information extraction knowledge from training text fragments. We employ a wrapper learning algorithm called HISER described in our previous work [Lin and Lam 2000]. In this article, we will only present a brief summary of HISER. HISER is a two-stage learning algorithm. The first stage induces a hierarchical representation for the structure of the records. This hierarchical record structure is a tree-like structure that can model the relationship between the items of the records. It can model records with missing items, multi-valued items, and items arranged in unrestricted order. For example, Figure 8 depicts a sample of a hierarchical record structure representing the records in the Web site as shown in Figure 2. The record structure in this example contains a book title, a list of authors, and a price. The price consists of a list price and a final price. There is no restriction on the order of the nodes under the same parent. A record can also have any item missing. The multiple occurrence property of author is modeled by a special internal node called repetition. Each node in the hierarchical record structure is associated with a set of extraction rules. These extraction rules are automatically learned in the second stage in HISER. An extraction rule consists of three parts: the left pattern component, the right pattern component, and the target pattern component. Table V depicts one of the extraction rules for the final price for the Web document in Figure 2. Both the left and right pattern components make use ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
15
Table V. A Sample of an Extraction Rule for the Final Price for the Web Document Shown in Figure 2. Left pattern component Scan Until(“Our”, TOKEN), Scan Until(“Price”, TOKEN), Scan Until(“:”, TOKEN), Scan Until(“”, SEMANTIC). Target pattern component Contain() Right pattern component Scan Until(“ ”, TOKEN), Scan Until(“ ”, TOKEN), Scan Until(“”, TOKEN), Scan Until(“”, SEMANTIC).
ALL HTML_TAG
TEXT DIGIT PUNCT
HTML_OBJECTS_TAG HTML_SPACE
FLOAT
Our
Price
:
39.99
HTML_FONT_TAG
HTML_IMG_TAG
Fig. 9. Examples of semantic classes organized in a hierarchy.
of a token scanning instruction, Scan Until(), to identify the left and right delimiters of the item. The token scanning instruction instructs the wrapper to scan and consume any token until a particular token matching is found. The argument of the instruction can be a token or a semantic class. For the target pattern component, it makes use of an instruction, Contain(), to represent the semantic class of the item content. An extraction rule-learning algorithm is developed based on a covering-based learning algorithm. HISER first tokenizes the Web document into sequence of tokens. A token can be a word, number, punctuation, date, HTML tag, some specific ASCII characters such as “ ” which represents a space in HTML documents, or some domain-specific contents such as manufacture names. Each token will be associated with a set of semantic classes that is organized in a hierarchy. For example, Figure 9 depicts the semantic class hierarchy for the following text fragments from Figure 5 after tokenization. Our Price:
39.99 HISER learns the extraction rules by performing lexical and semantic ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
16
•
T.-L. Wong and W. Lam
generalization, until effective extraction rules are discovered. The details of HISER can be found in our previous work [Lin and Lam 2000]. 5. POTENTIAL TRAINING TEXT FRAGMENT IDENTIFICATION In our IEKA framework, the first stage is the potential training text fragment identification component. This stage shares some resemblance with the research area of object identification or duplicate detection, which aims at identifying matching objects from different information sources [Bilenko and Mooney 2003; Cohen 1999; Tejada et al. 2001, 2002]. However, it is different from object identification or duplicate detection in three aspects. The first aspect is that IEKA identifies the text fragments within the Web page collected from the new unseen site. On the contrary, object identification determines the similarity between records that are obtained or extracted in advance. The second aspect is that IEKA identifies the text fragments belonging to the items of interest in the new site, while the aim of object identification is to integrate data objects from different information sources. The third aspect is that the source Web site and the new unseen site may not contain any common object. For instance, in the object identification task, it determines if the records “Art’s Deli” and “Art’s Delicatessen” collected from two different restaurant information sources refer to the same restaurant [Tejada et al. 2002]. These two records are stored in a database in advance. However, our approach identifies the text fragments “Practical C++ Programming, 2nd Edition,” which is a substring of the entire HTML text document, in the Web page shown in Figure 2. Moreover, the source site may not simultaneously contain this book displayed in its Web pages. Therefore, the techniques developed for object identification are not applicable. In our IEKA framework, potential training text fragments refer to the text fragments that likely belong to one of the items of interest, collected from the Web pages in the new unseen Web site. Notice that the potential training text fragments identified at this stage are not classified as any particular item of interest. In the next stage, some of the potential training text fragments will then be classified as different items and used as the machine-labeled training examples for learning the new wrapper for the new site in the last stage of IEKA. The idea of this stage is to analyze the site-dependent features and the site-invariant features of the new site. The DOM structure representation of the Web pages is utilized to identify the useful text fragments in the new site. A modified K -nearest neighbours method is employed to select the potential training text fragments. 5.1 Auxiliary Example Pages IEKA will automatically generate some machine-labeled training examples in one of the Web pages in the new unseen Web site. We call the Web page where the machine-labeled training examples are to be automatically collected as the main example page M. Relative to a main example page, auxiliary example pages A(M) are Web pages from the same Web site, but containing different categories of item contents. For example, in the book domain, M may contain items about ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
17
Fig. 10. A portion of a sample Web page about networking books.
programming books. A(M ) may contain items about networking books. Note that the main and auxiliary example pages are collected from the same site and hence the site-dependent features f D of these Web pages are dependent on the same context knowledge γ as described in Section 4.1. As the main example page and the auxiliary example pages contain different item contents, the text fragments regarding the item content are different in different Web pages, while the text fragment regarding the layout format are very similar. This observation gives a good indication for locating the potential training text fragments. Auxiliary example pages can easily be automatically obtained from different pages in a Web site. One typical method is to supply different keywords or queries automatically to the internal search engine provided by the Web site. For instance, consider the book catalog associated with the Web page shown in Figure 2. This Web page is generated by automatically supplying the keyword “PROGRAMMING” to the search engine provided by the Web site. Suppose a different keyword such as “NETWORKING” is automatically supplied to the search engine, a new Web page as shown in Figure 10 is returned. Only a few keywords are needed for a domain and they can easily be chosen in advance. The Web page in Figure 10 can be regarded as an auxiliary example page relative to the Web page in Figure 2. Figures 5 and 11 show the excerpt of the HTML text document associated with the Web page shown in Figures 2 and 10 respectively. The bolded text fragments are related to the item content, while the remaining text fragments are related to the format layout. The text fragments related to the item content are very different in different Web pages, whereas the text fragments related to the format layout are very similar. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
•
18
T.-L. Wong and W. Lam
Fig. 11. An excerpt of the HTML texts for the Web page shown in Figure 10.
... ...
Author: Curtis Frye ... ...
Published:
2004
List Price:
49.99
Microsoft Excel ... ... Fig. 12. Part of the DOM structure representation for the Web page shown in Figure 2.
5.2 DOM Structure Analysis A Web page can be represented by a DOM (Document Object Model)4 structure. A DOM structure is an ordered tree consisting of two types of nodes. The first type of node is called element node, which is used to represent HTML tag information. These nodes are labeled with the element name such as “”, “”, and so on. The other type of node is called text node, which includes the text displayed in the browser and is labeled simply with the corresponding text. Figure 12 shows part of the DOM structure representation for the Web page shown in Figure 2. We develop an algorithm that can effectively locate the informative text nodes in the DOM structure. For each of the text nodes in the DOM structure, we define the path as the string created by concatenating the node labels from the first ancestor to the n-th ancestor where n is a predefined value. For example, as shown in Figure 12, the path for the text nodes labeled with “Published:” and “List Price:” are both equal to “ ” and the path for the text node labeled with “Microsoft Excel 2003 Programming inside Out” is “ | ” when n is set to 4. Note that each path may locate more than one text node in the DOM structure. We define the probability that
4 The
details of the Document Object Model can be found in http://www.w3.org/DOM/.
ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
19
Fig. 13. An outline of the DOM structure path-finding algorithm.
the term wi occurs in the text nodes located by the path p as: N (wi , p) P (wi , p) = j N (w j , p) where N (wi , p) is the number of occurrences of wi in all the text nodes located by p. Next, we define the path entropy, E( p), as follows: E( p) = − P (wi , p) log P (wi , p). (1) i
Note that E( p) can be calculated from more than one DOM structure by treating all the DOM structures as a forest and each P (wi , p) is calculated by considering all the text nodes located by p in the forest. Figure 13 shows an outline of our path-finding algorithm. The objective of this algorithm is to identify the paths that can locate some informative text nodes in the DOM structure. It first creates the DOM structures for the main example page M and all the auxiliary example pages A(M ). Next, all the paths in the DOM structure, d om M , for the main example page M will be identified. For each of these paths, E( p) and E( p) are calculated. If E( p) exceeds E( p) by a threshold, δ, this path will be included in the return path set. The rationale of Step 8 in the algorithm is that entropy is a measure of the randomness of the distribution. Recall that the main and auxiliary example pages consist of different site-invariant features. If the underlying path can locate the text nodes consisting of site-invariant features, the term-distribution under this path will become more complex when more pages are being considered. On the other hand, if the underlying path can just locate the text nodes corresponding to sitedependent features, the term-distribution under this path will likely remain unchanged when more pages are being considered because the site-dependent features largely remain unchanged in different Web pages in a Web site. Hence, the resulting return path will contain the paths that can locate a “complex” text node that highly likely consists of site-invariant features. For example, the path ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
20
•
T.-L. Wong and W. Lam
“ | ” is one of the returned paths found by our path-finding algorithm for the Web page shown in Figure 2. 5.3 Modified K-Nearest Neighbour Classification Model The text fragments within the text nodes located by the returned paths in the above algorithm will become some useful text fragments. Although the paths found by our algorithm can effectively identify useful text nodes containing site-invariant features, these paths may also incorrectly locate some other text nodes because each path may locate more than one text node at the same time. We develop a modified K -nearest neighbours classification model for filtering out these incorrect text fragments. Recall that the previously learned wrapper from the source Web site consists of the left pattern component, the right pattern component, and the target pattern component. This wrapper is not fully applicable in the new unseen target site due to the difference between the site-dependent features f D of the source site and the target site. However, the target pattern component, which contains the semantic classes of the items regarded as the weak extraction knowledge of the source site, can be utilized for discovering useful text fragments in the new target site. Based on the weak extraction knowledge, we can obtain the set UTF(M ) from the main example page M of the new target site. UTF(M ) is the set of useful text fragments in M and each text fragment contains the same set of the semantic classes as the one contained in the target component of the previously learned wrapper. From an auxiliary example page A(M ), we can also obtain the set UTF(A(M )). As explained in Section 5.1, the text fragments regarding the item content in the main example page are less likely to appear in the auxiliary example pages, while the text fragments regarding the layout format will probably appear in both the main example page and the auxiliary example page. Note that our objective is to retain the text fragment corresponding to the site-invariant features in M . Hence, all the elements in UTF(A(M )) are treated as negative instances. Each instance in the modified K -nearest neighbours classification model is represented by a set, ti , containing the unique words in the text fragment. Suppose we have two text fragments t1 and t2 . We define the similarity between these two text fragments sim(t1 , t2 ), as follows5 : sim(t1 , t2 ) =
|t1 ∩ t2 | max(|t1 |, |t2 |)
(2)
where t1 ∩ t2 denotes the intersection of the sets t1 and t2 , and |t| denotes the number of elements in the set t. Some existing methods for object identification, make use of the term frequency-inverse document frequency (TF-IDF) method to assign weight to
5 We
have also tried different similarity measurements such as cosine similarity. We found that the similarity measurement described in Equation 2 has slightly better performance.
ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
21
each term of the attributes of the objects [Cohen 1999; Bilenko and Mooney 2003; Tejada et al. 2001, 2002]. TF-IDF assigns higher weights to the important terms that are frequent in the document and infrequent in the whole corpus. Therefore, matching of important terms shows a higher degree of confidence in matching of the objects. However, TF-IDF is not suitable in our classification problem. As distinct from ordinary classification problems, our goal is to classify the negative instances in UTF(M ), that are the incorrect or useless text fragments. Moreover, we only have a set of negative instances in the training examples. These incorrect or useless text fragments normally contain unimportant terms that repeatedly appear in the main example page and auxiliary example pages. As a result, the TF-IDF weighting method is not suitable to our problem. The goal of our classification model is to classify the potential training text fragments from UTF(M ). To achieve this task, for each element in UTF(M ), we first find the K -nearest neighbours in UTF(A(M )), based on our defined similarity measure given in Equation 2. If the average similarity between the element in UTF(M ) and the K -nearest neighbours in UTF(A(M )) exceeds a threshold, θ , it will be classified as a negative instance. This is because the useful text fragment is unlikely to repeatedly appear in both the main example page and the auxiliary example page. On the other hand, if the similarity is below θ , it will be classified as a potential training text fragment. Once the potential training text fragments for an item are identified, they will be processed by the text fragment-classification model in the machinelabeled training example discovery stage. Those “good” text fragments become the machine-labeled training examples for the new site.
6. DISCOVERY OF MACHINE LABELED TRAINING EXAMPLES As mentioned in Section 4, the partially specified training examples refer to the items previously extracted or collected in the source Web site. The rationale of using the partially specified training examples is that the item content can be represented by the site-invariant features. The partially specified training examples are used to train a text fragment-classification model that can classify “good” text fragments from the potential training text fragments. This text fragment-classification model consists of two components, which consider two different aspects of the item content. One aspect is the characteristics of the item content. For example, in the consumer electronics domain, a model number of a DVD player usually contains tokens mixed with alphabets and digits and starts with a capital letter. The content classification component considers several features, and can be trained. The trained model can effectively characterize the item content. The second aspect is the orthographic information of the item content. For example, the model numbers of the products may share some characters found in the training examples. The approximate matching component makes use of the orthographic information of the item content to help classify the machine-labeled training examples. ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
22
•
T.-L. Wong and W. Lam
6.1 Content Classification Component We identify some features for characterizing the content of the items. A classification model can then be learned to classify the “good” potential training text fragments. The features used are as follows: F1 : F2 : F3 : F4 : F5 : F6 : F7 : F8 : F9 : F10 : F11 : F12 :
the number of characters in the content the number of tokens in the content the average number of characters per token the proportion of the number of digit number to the number of tokens the proportion of the number of floating point numbers to the number of tokens the proportion of the number of alphabet characters to the number of characters the proportion of the number of upper case characters to the number of characters the proportion of the number of lower case characters to the number of characters the proportion of the number of punctuation marks to the number of characters the proportion of the number of HTML tags to the number of tokens the proportion of the number of tokens starting with capital letter to the number of tokens whether the content starts with a capital letter
These features attempt to characterize the format of the items. Some of the features were also used in Kushmerick [2000b]. With the above feature design, a classification model can be learned from a set of training examples. The content classification model will return a score, f 1 , which indicates the degree of confidence of its being a “good” potential training text fragment. f 1 will be normalized to a value between 0 and 1. The content classification model is trained from a set of training examples composed of a set of positive item content examples and negative item content examples. The set of positive item content examples are the partially specified training examples in the source site. In the main page of the source Web site, M s , we can obtain UTF(M s ) as described in Section 5. Those elements in UTF(M s ) which are not in the set of positive item content examples are collected to become the negative item content examples. Next, the values of the features Fi (1 ≤ i ≤ 12) of each positive and negative item content examples are computed. To learn the content classification model, we employ Support Vector Machines (SVM) [Vapnik 1995]. 6.2 Approximate Matching Component French et al. [1997] discussed the effectiveness of approximate word matching in information retrieval. As mentioned above, the objective of the approximate ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
23
matching component is to boost the confidence of those potential text fragments that share some text similarities with the previously collected items. To enhance the robustness, we make use of edit distance [Gusfield 1997] and design a twolevel approximate matching algorithm to compare the similarity between two strings. At the lower level, we compute the character-level edit distance of a given pair of tokens. At the upper lever, we compute the token-level edit distance of a given pair of text fragments. We will illustrate our algorithm by an example. Suppose we obtain a potential training text fragment of the model number “PANASONIC DVDCV52’ and a particular previously collected item content “PAN DVDRV32K”. (Actually these two model numbers are obtained from two different Web sites in our consumer electronics domain experiment. They refer to the same brand of products, but with different model numbers.) At the lower level, we compute the character-level edit distance between two tokens with the cost of insertion, deletion, and modification of a character all being equal to one. Then the character-level edit distances computed are normalized by the longest length of the tokens. For example, the normalized character-level edit distance between “PAN” and “PANASONIC” is 0.667. At the upper level, we compute the token-level edit distance between a potential training text fragment and a partially specified training example, with the cost of insertion and deletion of a token being equal to one, and the cost of modification of a token being equal to the character-level edit distance between the tokens. The token-level edit distance is then normalized by the largest number of tokens among the potential training text fragment and the partially specified training example. For instance, the normalized token-level edit distance between “PANASONIC DVDCV52” and “PAN DVDRV32K” is 0.521. Both the character-level and token-level edit distances can be efficiently computed by dynamic programming. The score, f 2 , of a potential training text fragment is then computed as follows: f 2 = max{D (c, l i )} i
(3)
where D (c, l i ) = 1 − D(c, l i ) and D(c, l i ) is the normalized token-level edit distance between the potential training text fragment, c, and the i-th partially specified training example. 7. NEW WRAPPER LEARNING FOR THE UNSEEN WEB SITE In the machine-labeled training example discovery stage, the scores from the content classification component and approximate matching component are computed. The final score Score(c) of each potential training text fragment c is given by: Score(c) = λ f 1 + (1 − λ) f 2
(4)
where f 1 and f 2 are the scores obtained in the content classification component and the approximate matching component respectively; λ is a parameter controlling the relative weight of the content classification and approximate matching components and 0 < λ < 1. After the scores of the potential training text fragments are computed, IEKA will select the “good” potential training text fragments as machine-labeled ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
24
•
T.-L. Wong and W. Lam
training examples for the new unseen site. The N -best potential training text fragments will be selected as the machine-labeled training examples. Users could optionally scrutinize the discovered training examples to improve the quality of the training examples. However, in our experiments, we did not conduct any manual intervention and the adaptation was conducted in a fully automatic way. After obtaining a set of machine-labeled training examples, IEKA makes use of the wrapper learning component HISER, derived from our previous work [Lin and Lam 2000] and briefly presented in Section 4.3, to learn the wrapper tailored to the new Web site. A small refinement on HISER is performed to suit the new requirement. The set of machine-labeled training examples is different from the set of user labeled training examples because the former may contain inaccurate training examples. The noise in the training example set tends to exhibit overgeneralization of the extraction rules due to the scoring criteria of the extraction rule-learning algorithm in Lin and Lam [2000]. To cope with this effect, we introduce a metarule for restricting the number of token generalizations of the extraction rule induction algorithm. Each extraction rule may then cover fewer training examples due to the restriction on the generalization power. This will not degrade the extraction performance of the learned wrapper as each extraction rule set in the wrapper may contain more extraction rules to broaden its coverage. This metarule can avoid the overgeneralization effect. The newly learned wrapper is tailored to the new Web site and it can be applied to the remaining pages in the new site for information extraction. 8. CASE STUDY In this section, we present a complete case study to illustrate the steps in the process of adapting the wrapper from the source to a new unseen site using IEKA. Refer to the scenario mentioned in Section 2. We first train a wrapper for the source site shown in Figure 1 using training examples similar to the one depicted in Table I. The trained wrapper is then applied to other pages from the same Web site to automatically extract items. The extraction performance of the learned wrapper is very promising. Both precision and recall for extracting the title from other pages of the Web site are 99.0%. The precision and recall for extracting the author are 80.2% and 97.0% respectively. Both precision and recall for extracting the price are 100%. Though the learned wrapper can effectively extract items from the Web pages of the same site, it cannot extract any item when we apply directly the learned wrapper to a new site, shown in Figure 2. We apply our IEKA framework to adapt the extraction knowledge to this new site. At the potential training text fragment identification stage, IEKA first analyzes the DOM structure of the page as presented in Section 5.2 to generate the useful text fragments from the Web page of the new site. This is achieved by identifying the paths in the DOM structure that can locate the informative text nodes in the Web page of the new site. Table VI shows the samples of the paths discovered and the associated useful text fragments generated. After that, the modified K -nearest neighbours classification method is applied to identify the potential training ACM Transactions on Internet Technology, Vol. 7, No. 1, Article 6, Publication date: February 2007.
Adapting Web Information Extraction Knowledge
•
25
Table VI. Samples of the Paths Discovered and the Associated Useful Text Fragments Generated for the Web Page Shown in Figure 2 Path: Associated useful text fragments:
Path: Associated useful text fragments:
Path: Associated useful text fragments:
Stephen Randy Davis Herb Schildt Steve Oualline 2003 2002 | | |