An unsupervised method for joint information extraction and feature ...

5 downloads 126746 Views 1MB Size Report
Sep 19, 2008 - Section 6 describes the application of our framework for tackling another task, namely, hot item feature min- ...... S6. JR.com (www.jr.com). S7. 42Photo.com (www.42Photo.com). S8 ... nificantly better than that of ROADRUNNER, which obtains average recall and ... Apple iPod 20 GB 4th ... Samsung YP-T7Z.
Data & Knowledge Engineering 68 (2009) 107–125

Contents lists available at ScienceDirect

Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

An unsupervised method for joint information extraction and feature mining across different Web sites Tak-Lam Wong a,*, Wai Lam b a b

Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong

a r t i c l e

i n f o

Article history: Received 23 November 2007 Received in revised form 6 August 2008 Accepted 26 August 2008 Available online 19 September 2008

Keywords: Text mining Web mining Machine learning Graphical models

a b s t r a c t We develop an unsupervised learning framework which can jointly extract information and conduct feature mining from a set of Web pages across different sites. One characteristic of our model is that it allows tight interactions between the tasks of information extraction and feature mining. Decisions for both tasks can be made in a coherent manner leading to solutions which satisfy both tasks and eliminate potential conflicts at the same time. Our approach is based on an undirected graphical model which can model the interdependence between the text fragments within the same Web page, as well as text fragments in different Web pages. Web pages across different sites are considered simultaneously and hence information from different sources can be effectively leveraged. An approximate learning algorithm is developed to conduct inference over the graphical model to tackle the information extraction and feature mining tasks. We demonstrate the efficacy of our framework by applying it to two applications, namely, important product feature mining from vendor sites, and hot item feature mining from auction sites. Extensive experiments on real-world data have been conducted to demonstrate the effectiveness of our framework. Ó 2008 Elsevier B.V. All rights reserved.

1. Introduction With the rapid growth of Internet technology which banishes the geographical barrier between customers and vendors, online shopping is becoming increasingly common. Vendors set up their own Web sites providing basic information for a product and typically highlighting certain important product features for marketing purpose. Customers can shop over the Internet by browsing different vendor Web sites to learn more about their desired products. For example, Fig. 1 depicts a Web page collected from a vendor Web site1 showing information about a digital camera. This page contains several blocks of information. In the uppermost block next to the photo, it provides a list of special or outstanding features of the camera. For instance, one major product feature is ‘‘12.8-megapixel”. In the table below the photo, there are some basic product features. Though online shopping brings much convenience to customers, information conveyed by a single vendor may be biased. For example, Fig. 2 depicts a Web page collected from a Web site2 different from the one shown in Fig. 1. This page also describes some characteristics about the same digital camera shown in Fig. 1. Although the two pages refer to the same product and share some common product features, they also convey some different product features in the content. For example, ‘‘2.5 in.

* Corresponding author. Tel.: +852 3163 4259; fax: +852 2603 5024. E-mail addresses: [email protected] (T.-L. Wong), [email protected] (W. Lam). 1 The URL of this Web site is http://www.circuitcity.com. 2 The URL of this Web site is http://www.crutchfield.com. 0169-023X/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2008.08.009

108

T.-L. Wong, W. Lam / Data & Knowledge Engineering 68 (2009) 107–125

Fig. 1. A sample of a Web page showing the details of a digital camera collected from a vendor Web site.

LCD Screen Size” and ‘‘12.8-megapixel” are common product features found in both vendor sites. On the other hand, the product feature ‘‘Included software” is found in Fig. 1, but not in Fig. 2. Typically, customers need to browse the Web sites from multiple vendors manually to obtain some commonly found important product features. From the vendor’s point of view, it is beneficial to be aware of the important product features displayed in other vendors’ site because it is advantageous to know how competitors view the products. Obviously, manually browsing the product information from a large number of different sites is time-consuming and tedious. This problem raises the need for information extraction which aims at automatically extracting precise and useful text fragments from textual documents on different sites. Such extraction methods can be applied to extract the product features from different vendor Web sites. Apart from information extraction, there is also a need to conduct mining in order to achieve the goal of the problem. For example, we wish to obtain product features which are important. Both tasks of information extraction and mining from multiple sites can be done in a reinforcement manner. One common assumption of existing approaches to information extraction from Web pages is that the information contained in each page in different sites is considered independently. Hence the information extraction task is conducted separately without any interaction among different sites. For example, consider the Web pages shown in Figs. 1 and 2. These Web pages are collected from two different Web sites and they contain the product feature information for the same digital camera. To extract the product features, existing methods treat these pages separately. Considering these two pages jointly, however, has mutual benefit since they contain useful clues which allow each extraction task to help the other. For instance, one product feature extracted from Fig. 1 is ‘‘12.8-megapixel”. Such discovered product feature strengthens the confidence in extracting the text fragment ‘‘Effective Megapixel Count 12.8” in Fig. 2 and classifying it as a product feature. From the layout format of Fig. 2, we can then infer that some product features are organized in a table format and one can extract more product features such as ‘‘LCD Screen Size (inches) 2.5 in.” and ‘‘Manual Focusing Yes”. Similarly, the extraction results in Fig. 2 can again guide the extraction in Fig. 1. Consequently, the extraction process can interact with each other conducting extraction and mining collectively. Another shortcoming of many existing information extraction learning techniques is the need for manually provided training examples. A substantial amount of human effort is required to prepare a large set of training examples in advance. This task is obviously tedious and time-consuming, and requires a high level of expertise. Moreover, owing to the difference between the layout format of the Web pages, a model trained for a particular Web site is not able to extract information from Web pages collected from other Web sites. To extract information from other Web sites, one requires a separate human effort to prepare another set of training examples.

T.-L. Wong, W. Lam / Data & Knowledge Engineering 68 (2009) 107–125

109

Fig. 2. A sample of Web page, collected from a vendor Web site different from the one in Fig. 1, showing the details of a digital camera.

We propose an unsupervised learning approach which can jointly extract information and conduct feature mining from a set of Web pages across different sites. One characteristic of our approach is that it allows tight interactions between the tasks of information extraction and feature mining. The decisions involved in information extraction and feature mining can be done in a coherent manner assigning solutions satisfying the quality of both tasks and at the same time eliminating potential conflicts. Our framework is designed based on an undirected graphical model namely Conditional Random Fields (CRF) [29]. CRF is a probabilistic model which can model the interdependence between the neighbouring text fragments within the same Web page, as well as text fragments in different Web pages. We formulate the problem as a label assignment problem on this graphical model. Multiple Web pages can be considered using a unified model and the information can be extracted by labeling the tokens collectively in a single framework. The graphical model is automatically constructed by analyzing several clues from the Web documents. One useful clue is the layout format information of the Web pages. For example, the product features in the vendor Web sites are normally organized in a certain regular format such as list or table. Referring to Fig. 2, the product features are organized in a list format and in the same block. Such layout format information can help identify the text fragments to be extracted. We also consider the DOM3 structure of the Web pages. The DOM structure can effectively represent the structure of the Web page and provide very useful information for product feature extraction. Our approach also allows easy incorporation of external knowledge. Sometimes, users already have some prior knowledge of the domain from which the information is extracted, or knowledge about the mining tasks. For example, in the important product feature mining application, customer reviews can be treated as a form of prior knowledge in the modeling and learning process. We derive a set of features related to the layout format, the prior knowledge and the DOM structure in our framework. An expectation–maximization (EM) based adaptive learning method is designed for parameter estimation in the model. To demonstrate the robustness and efficacy of our framework, we apply our framework to two different applications, namely, important product feature mining from vendor Web sites and hot item feature mining from auction Web sites. For each application, we have conducted extensive experiments using a number of real-world Web sites in two different domains. The results are very encouraging demonstrating the effectiveness of our framework.

3

DOM refers to the Document Object Model designed by W3C. The details of DOM can be found in http://www.w3.org/DOM/.

110

T.-L. Wong, W. Lam / Data & Knowledge Engineering 68 (2009) 107–125

Wong et al. developed a framework for feature mining [47,48]. The technique proposed in this paper is more robust capable of dealing with more sophisticated interactions between different sites. This framework is more general and can be easily applied to different applications. Moreover, we have conducted a more complete set of experiments to evaluate our framework. The remaining content of this paper is organized as follows. Section 2 introduces two applications of our framework. Section 3 presents some related work. Section 4 describes the modeling and the learning algorithm employed in our framework. Section 5 introduces the application of our framework to a task, namely, important feature mining, and presents the experimental results. Section 6 describes the application of our framework for tackling another task, namely, hot item feature mining from different auction Web sites, and present the corresponding experimental results. We draw conclusions and discuss some potential extensions of our framework in Section 7. 2. Applications of our framework We apply our framework to two applications in this paper. The first application is important product feature mining which aims at extracting product features and discovering important product features for a particular product from different vendor Web sites. Important product features refer to the product features that are either located in the foremost viewable position, in bold text, italic, or in colour different from the majority of normal texts and the features are found in most of the vendor Web sites. Foremost viewable position refers to the position of the Web page which can be seen by users without scrolling. For marketing purpose, vendors are likely to place in a noticeable position, or display in some perceivable format, the product features to which most customers pay attention. The layout format of the product features can therefore be an evidence for the determination of important product features. Customer reviews of products in online discussion forums can also be treated as prior knowledge for discovering important product features since they may contain clues reflecting the customers’ needs. For example, Table 1 shows sample Web pages of customer reviews collected from an online discussion forum. This review mentions some important product features such as the LCD size of the same digital camera model shown in Fig. 1. Such reviews can be collected automatically by information extraction wrappers [9,18]. The important product features discovered can be utilized in other intelligent tasks. Specifically, we make use of such information for mining similar products. This can reduce the human effort in analyzing the characteristics of different products. We use a running example to illustrate our first application. It considers a set of Web pages collected from different vendor sites. Figs. 1 and 2 are two samples taken from the set of vendor sites. Some texts of customer reviews are collected such as Table 1 and they are used as prior knowledge. Our approach can automatically extract a list of product features and the associated feature values such as ‘‘Megapixel gross (13.3)”, ‘‘LCD Screen Size (2.5 in.)”, etc. from Fig. 1 and ‘‘Standard Resolution Capacity (446 images/512 MB)”, ‘‘CCD Size (3:2 CMOS)”, etc. from Fig. 2. At the same time, it can also identify that ‘‘LCD Screen Size (2.5 in.)” is relatively more important via our learning framework from a set of Web sites. One of the evidences is related to the fact that the product feature ‘‘LCD size” is repeatedly mentioned in many Web sites. This LCD size feature is also of interest to customers as exemplified in Table 1. The set of product features discovered is very useful for customers and vendors to understand more about the characteristics of the product. In our application, we make use of these discovered important product features to conduct mining for similar product of digital cameras. One advantage of using important product features is that the similarity is less likely to be affected by ‘‘uninformative” product features such as ‘‘Standard Resolution Capacity”. We apply our framework to build the second application, namely, hot item feature mining from auction Web sites. Hot items refer to the items that have a number of bids from potential buyers in the auction sites. Hot item feature mining aims at discovering the product features that most hot items possess. The features discovered can help sellers to decide the starting bid price of their items. They can also help potential buyers to make sensible decisions in bidding for items. The details of this application are presented in Section 6. 3. Related work Various approaches have been proposed to extract information from semi-structured documents such as Web pages [1,5,10,18,25,28,40,42]. Information extraction wrapper is one of the promising techniques for extracting precise text fragments from Web documents [27]. A wrapper is normally composed of a set of extraction rules. Recently, wrapper induction methods have been proposed to automatically learn extraction rules from a set of training examples [8,36]. One major shortcoming of the existing wrapper induction techniques is that the learned wrapper can only extract information from the same information source where the training examples are collected. For example, if the extraction rules are learned from the

Table 1 Sample of customer reviews automatically collected from an online discussion forum (Digital Photography Review found at http://www.dpreview.com) 1 2 3

Easy, extreme power, extreme AF, extreme quality Beautiful large LCD, large image buffer Very low noise, photos at 1600 ISO are like those at ISO 200-400 of other cameras. . .

T.-L. Wong, W. Lam / Data & Knowledge Engineering 68 (2009) 107–125

111

training examples collected from a particular Web site, the learned extraction rules cannot be applied to extract information from other sites. Flesca et al. proposed a method to classify Web pages based on HTML tag structure and choose appropriate learned wrappers to extract information [17]. New wrappers can then be learned for the unclassified Web pages. Their method can only partially solve the problem. Moreover, the extraction rules learned can only extract the information specified in the training examples. For instance, if we just annotate the start time, end time, location and speaker in the training example in the seminar announcement domain, the learned wrapper can only extract these four attributes. Other useful information such as the title of the seminar will be lost. Existing supervised wrapper learning approaches require human effort to prepare labeled examples for training wrappers. Several techniques have been proposed to reduce the need for training examples. Embley et al. proposed an approach to extracting information from Web pages by making use of a predefined ontological model of the domain [14]. Though their approach does not require training examples, manual effort and domain knowledge are still needed to define the domain ontology. Wong and Lam proposed a method for reducing the human effort by adapting the extraction knowledge from one information source to other previously unseen information sources [44]. However, the human work in preparing training examples cannot be completely eliminated. Various techniques have been developed for fully automatic information extraction from Web pages without the use any training examples. IEPAD is a system aiming at extracting information by recognizing the repeated patterns inside the Web pages [4]. A system known as MDR can discover the data region in a Web page by making use of the repeated pattern in HTML tag trees [30]. Heuristics are then applied to extract useful information from the data region. However, both IEPAD and MDR assume that the input Web pages contain multiple records and repeated patterns. ROADRUNNER is another system making use of repeated patterns for information extraction [11]. The idea of ROADRUNNER is that Web pages of some Web sites are generated by an automatic Web page generation program. Their layout format are similar although the content of the Web pages is different. It exploits such evidence and recognizes the repeated patterns appeared in the Web pages. Hedley et al. proposed a two-phase sampling method to extract information from hidden Web documents whose data are stored in back-end databases [22]. The idea is to supply queries to search engines to sample a set of Web pages of a site. The templates which generate the Web pages are detected from repeated patterns and the query-related data are extracted. One common limitation of the two approaches is that the Web pages are required to be generated from same template or have similar layout format. This may not be true in Web pages collected from different sources. Crescenzi et al. partially solve this problem by developing an approach to clustering Web pages based on their structures [12]. Their approach, however, focuses on the Web pages originating from the same site and information extraction from different sources is not applicable. Chuang et al. proposed a method to synchronize the data extracted by wrappers from multiple sources [7]. The idea is to make use of unsupervised wrapper learning techniques to extract data from different Web pages. Since the data extracted may contain inconsistent information,4 the inconsistency is reduced by minimizing an objective function concerning the segmentation of the text fragments from different sources by means of encoding techniques. TextRunner is a system aiming at automatically extracting relation from Web pages on the Internet [2]. Grenager et al. applied hidden Markov model and exploited prior knowledge to extract information in an unsupervised manner [21]. However, the quality of the extracted data in fully automatic mode is unlikely to be suitable for subsequent data mining tasks. Another system called Armadillo has been developed aiming at extracting information from different sources without training examples [6]. However, the method proposed relies on the seeds extracted from structured source or user-defined lexicons. One common limitation of the existing information extraction approaches is that they can only extract information, but cannot conduct any mining task [16,24]. To discover new pattern or knowledge, a separate mining process is required to be conducted. The error encountered in the extraction step is unavoidably accumulated in the mining step. Recently, various techniques for collectively extracting information and conducting data mining have been proposed [32]. For example, Wellner et al. proposed an approach for extracting different fields in citation and solving the citation matching problem using Conditional Random Fields (CRF) [43]. McCallum and Wellner also proposed an approach for extracting proper nouns and linking the extracted proper nouns using a single model [33]. Culotta et al. developed a method exploiting CRF to extract entities from text and discover their relations [13]. Bunescu and Mooney proposed the use of relational Markov networks collectively to extract information from documents [3]. One major difference between these methods and our approach is that our approach is an unsupervised method and does not require any training examples. Some existing methods related to product feature mining and auction Web site mining have been developed. Probst et al. [38] proposed a semi-supervised algorithm to extract attribute value pairs from text description. Their approach aims at handling free text descriptions by making use of natural language processing techniques. Hence, it cannot be applied to Web documents which are composed of a mix of HTML tags and free texts. Morinaga et al. proposed an approach to mining product reputation from online customer reviews [34]. Their idea is to determine the polarity of the sentences describing the opinion of the customers by making use of a set of syntactic and linguistic rules. However, one limitation of their approach is that it cannot extract the product features of products. Hu and Liu [23,31] proposed a system for summarizing customer

4 For example, a field of a book record from a particular Web site may contain the publication date information, while a field of a book record from another Web site may contain both the published date information and the book format information. According to Chuang et al. [7], two extracted fields of the records should be matched to refer to the same field.

112

T.-L. Wong, W. Lam / Data & Knowledge Engineering 68 (2009) 107–125

reviews on a product posted on Web sites. They aim at classifying sentences with subjective orientation by making use of the subjective words such as ‘‘good”, ‘‘prefect”. Popescu and Etzioni [37] also investigated this research problem. They first made use of the extraction system called KnowItAll [15] to extract the explicit features of the product. Next the extracted explicit features are utilized to identify the opinion or orientation of the reviews. Both of these methods apply linguistic techniques and focus on the sentences which are largely grammatical. Moreover, existing opinion mining techniques fail to utilize the collected opinions to predict automatically the importance of the product features. Regarding the problem of mining from auction Web sites, Ghani and Simmons proposed an approach for predicting the price of the items at the end of the bidding period by making use of different extracted features from the auction site [19,20]. They consider different kinds of features including seller features, item features, auction features and temporal features in their approach. Machine learning techniques are then employed to predict the end-price of the items. Wong and Lam attempted to extract and summarize the product feature and the associated feature values of the hot items from multiple auction Web sites [46]. The idea of their work is to extract the product feature and the associated values using hidden Markov models. Next a graph mincut algorithm is employed to identify the hot items in the auction sites by considering the extracted data. All of these existing methods suffer from one major shortcoming in that the tasks of extraction and mining are conducted without interaction. Error made in the extraction process is likely to be accumulated in the mining task.

4. Model description and learning approach 4.1. Problem definition Consider a set of N Web pages denoted as P ¼ fP1 ; P2 ; . . . ; P N g. Each page Pi can be represented by a DOM structure denoted as Di 2 D. A DOM structure is an ordered tree structure containing two types of nodes. The first type of nodes are tag nodes that contain the presentation structure and layout format of HTML documents. The second type of nodes are text nodes that contain the text fragments to be displayed in the browser. As a result, a set of text fragments represented by X i ¼ fX i1 ; X i2 ; . . . ; X iF i g can be collected based on the text nodes in Di , where F i refers to the number of text fragments in Pi . Each text fragment X ij can be considered as a sequence of tokens represented by a sequence of observable variables ðX ij;1 ; X ij;2 ; . . . ; X ij;Li Þ, where Lij denotes the number of tokens in the text fragment. In the sequence, the kth token is associated j with an unobservable variable Y ij;k , which represents the label of the token. In other words, each token can be represented by i the tuple ðX j;k ; Y ij;k Þ. Given a particular domain, we define R to be the prior knowledge about the domain. 4.1.1. Information extraction problem Suppose the Web page shown in Fig. 1 and the corresponding DOM structure are denoted by P i 2 P and Di 2 D, respectively. The Web page can be broken down into a set of F i text fragments X i . They are collected from the text nodes in Di . For example, ‘‘Canon EOS 5D Digital Camera” and ‘‘12.8-megapixel” are samples of text fragments in Fig. 1. The kth token in the sequence X ij , where 1 6 j 6 F i , is represented as X ij;k and associated with a label denoted as Y ij;k . Each Y ij;k can be equal to one of the predefined values. In the important product feature mining application, each token can be labeled with one of the three labels – product feature, product feature value and normal text. For instance, the token observed as ‘‘12.8-megapixel” is labeled as product feature value. The prior knowledge of the domain is represented by a set of observable variables denoted by R. For example, customer reviews can be treated as some prior knowledge in the important product feature mining application because they may mention some product features of a digital camera. The terms contained in the customer review can be represented by R. The goal of information extraction is to predict the labels of tokens based on the observation. In principle, we aim at determining Y  such that the probability PðY ¼ Y  jX; R; DÞ is maximum, where X, Y, refer to the observations and labels of tokens from the Web pages in P; D denotes the set of DOM structures; R refers to the prior knowledge of the domain. 4.1.2. Feature mining problem Let V be a set of unobserved variables related to some non-trivial knowledge about a domain. Suppose the values of V depend on the observations and the labels of all the tokens ðX ij;k ; Y ij;k Þ. For instance, in Fig. 1, the sequences ‘‘12.8-megapixel” and ‘‘Megapixel gross 13.3” are about the product feature values. Each sequence can be associated with an unobservable variable in V showing whether the sequence is related to important product features. The variables in V are interdependent since similar product features are likely to have similar importance. They are also interdependent on the observations and the labels of the tokens because normally the important product features are mentioned by many different Web sites, positioned in a perceivable area, or displayed with some special layout format such as bold text. For example, ‘‘12.8-megapixel” and ‘‘2.5 in. LCD screen” are listed on the topmost portion next to the camera photo in Fig. 1. We define the feature mining problem as determining the values of V  such that the probability PðV ¼ V  jX; YÞ is maximized. 4.1.3. Joint information extraction and feature mining problem Intuitively, the information extraction and feature mining problems can be tackled in separate steps. However, since the values for both Y and V are unknown, the prediction of Y in the information extraction problem may contain errors and

T.-L. Wong, W. Lam / Data & Knowledge Engineering 68 (2009) 107–125

113

these errors will be accumulated in the prediction of V in the feature mining problem. As a result, we define the joint information extraction and feature mining problem as below. The objective of joint information extraction and feature mining is to find the values for both Y  and V  such that the joint probability PðV ¼ V  ; Y ¼ Y  jX; R; DÞ is maximized. The joint information extraction and feature mining problem can solve the two problems together and obtain a solution satisfying both tasks. In this paper, we propose a probabilistic undirected graphical model to address this problem. 4.2. Modeling via undirected graph In our framework, we formulate the tasks of joint information extraction and feature mining as a single graph labeling problem using Conditional Random Fields (CRF). CRF is a discriminative framework based on undirected graphical model [29]. One advantage of CRF is that it can model the interdependence between entities, without the need to know their actual causality. Another advantage is that unlike the Naive Bayesian model or other generative models which assume that the features are independent, CRF allows the use of a number of overlapping or dependent features. Moreover, much of the literature also shows that discriminative models can achieve promising performance in practice. Each node in the graph represents a variable and each edge represents the interdependence between the connected variables. Consider our first application of important product feature mining from different vendor Web sites. Fig. 3 shows a simplified CRF model automatically constructed given a set of Web pages concerned with the same product. The size of the graph is much larger when dealing with real data. There are two kinds of nodes. The shaded nodes represent observable variables while the unshaded nodes represent unobservable variables. In page Pi 2 P, the jth text fragment is represented by ðX ij ; Y ij Þ. In the sequence, each Y ij;k is connected to Y ij;k1 ; X ij and Di as shown in Fig. 3, where 2 6 k 6 Lij , since the label of each token, the labels of the neighbouring tokens, the observation of the sequence, and the DOM structure of the page are interdependent. For presentation clarity, we introduce another set of unobservable variables denoted by W i and referring to the set of mentioned product features in the Web page P i . For instance, in Fig. 1, ‘‘12.8-megapixel”, ‘‘Megapixel gross 13.3” are the mentioned product features. Obviously, the mentioned product features are interdependent on the observation of the sequences and labels of the tokens. The important product features are represented by the set of unobservable variables denoted by V. It is interdependent on the mentioned product features, as well as the observation of the Web pages because normally the important product features are mentioned by many different Web sites, positioned in a perceivable area, or displayed with some special layout format such as bold text. The prior knowledge R of the domain is connected to all the Y ij;k and X ij since they are interdependent. For example, the customer reviews can be treated as some prior knowledge in the important product feature mining application because they may mention some product features of a digital camera. As a result, R can be a set of random variables representing the occurrence of a word which exists in some customer reviews. It is connected to all the Y ij;k and X i since they are interdependent. Fig. 3 also shows another sequence ðX il ; Y il Þ collected from 0 0 0 0 0 0 the same page P i , two different sequences, ðX ip ; Y ip Þ and ðX iq ; Y iq Þ collected from the page P i 2 P where i–i . Once the undirected graph is constructed, the conditional probability of a particular configuration of the hidden variables, given the values of all the observed variables, can be written as follows:

PðyjxÞ ¼

1 ZðxÞ

Y

UðCðx; yÞÞ

ð1Þ

Cðx;yÞ2Cðx;yÞ

Fig. 3. Our proposed graphical model for important feature mining from multiple Web pages. Note that R should be connected to all X; Y. The edges between R and each of X and Y are not fully shown to avoid cluttering for clear presentation.

114

T.-L. Wong, W. Lam / Data & Knowledge Engineering 68 (2009) 107–125

where x and y denote the set of observable variables and the set of unobservable variables, respectively; Cðx; yÞ refers to the set of cliques of the graph. A clique is defined as the maximum complete subgraph. UðCðx; yÞÞ denotes the clique potential for Cðx; yÞ. ZðxÞ is called the partition function defined as

ZðxÞ ¼

X

Y

y

Cðx;yÞ2Cðx;yÞ

UðCðx; yÞÞ

ð2Þ

We define the clique potential as a linear exponential function as follows:

UðCðx; yÞÞ ¼ exp

X

ci fi ðx; yÞ

ð3Þ

i

where fi ðx; yÞ and ci are the ith binary feature and the associated weight, respectively. For example, in the digital camera domain fi ðx; yÞ is equal to one if the underlying token is ‘‘megapixel” and the class label is product feature. It equals zero otherwise. Hence, Eq. (1) can be written as follows:

PðyjxÞ ¼

X 1 ci fi ðx; yÞ exp ZðxÞ i

ð4Þ

Given the set of ci , one can find the optimal labeling of the unobserved variables of the graph via conducting inference. The graph typically consists of a large number of combinations for the labels of all the unobservable variables. Hence, direct computation of the probability of a particular labeling of the unobservable variables is infeasible. The inference can be computed by a message passing algorithm, known as the sum–product algorithm, by transforming the graph into junction tree or factor graph [26]. In this paper, we employ the factor graph approach to infer the values of the hidden variables. Given a factor graph without cycle, each node can be treated as the root of a tree. The depth d of a node is defined as the maximum number of edges for a message passed from the node to the furthest leaf node. The depth dmax of a tree is then defined as the maximum depth of any message. Exact inference can be conducted in a single pass of messages and hence dmax iteration is required. For a factor graph with cycles, the algorithm can achieve approximate inference and converge in few iterations in practice [35,41]. Suppose each variable node in the factor graph has M states, the computation of a message, denoted by lf !t ðxÞ, from a variable node t to a factor node f requires ðNðf Þ  1Þ  M multiplications, where Nðf Þ is the number of variable nodes connected to f. The computation of the message, denoted by lt!f ðxÞ, from a factor node f to a variable node t involves summation of terms of NðxÞ  1 variables, each of which can take M different states. Each term in the summation is the product of all lf !t0 ðxÞ, where t0 denotes the node connected to f except t. As a result, the computation requires ðNðtÞ  1Þ  MNðf Þ operations. The overall complexity of the sum–product algorithm of the factor graph is exponential to Nðf Þ. Recall that Nðf Þ is the number of variable nodes connected to the factor node f. During the construction of the factor graph from the original graph of our framework, Nðf Þ is bounded by the largest clique, whose size is 5 (e.g., the clique connecting Y ij;1 ; Y ij;2 ; X ij ; W i and R). Moreover, the values of the observable variables are known and we do not need to enumerate all the combinations of the observable variables. As a result, inference of our framework can be conducted efficiently. In addition, since the computation of the message of each node only considers information from the neighbouring nodes, the computation of different nodes can be done in parallel. We can apply distributed computation to accelerate the inference. By finding the configuration of the hidden variables achieving the highest conditional probability stated in Eq. (1), the desired important product feature and the product feature values can then be discovered. 4.3. Incorporating prior knowledge As described before, prior knowledge is helpful for discovering important information in the domain. For example, in the important product feature mining application, important features are normally placed in foremost viewable position or displayed in some special formats such as bold text. Recall that CRF is characterized by a set of binary features and the associated weights in Eq. (4). We can then easily incorporate prior knowledge by choosing the initial value of some weights before invoking our EM-based voted perceptron algorithm described in Section 4.4. For instance, suppose we know that the term ‘‘resolution” is likely to belong to the product feature of a digital camera. We can set a larger initial value for ci if fi ðx; yÞ denotes the binary feature that is equal to one if the token is ‘‘resolution” and its label is product feature. We introduce a set of such feature functions about the relationship between tokens, prior knowledge, and labels in our framework. There are two types of prior knowledge used in our framework. The first type is derived from a set of customer reviews about the same domain collected from online discussion forums. Consider a term denoted by e appearing in the texts from the set of customer reviews. We can compute the normalized edit distance between e and tok, namely distðe; tokÞ, where tok refers to a particular token in the sequence. For each distinct term e in the texts, we define one binary feature function. This function equals one if distðe; tokÞ is less than a predefined threshold h and tok is considered as an important product feature. This function equals zero otherwise. For example, the review in Table 1 consists of the term ‘‘LCD”. Then we design one binary feature function that equals one if distð\LCD"; tokÞ is less than h and the token is considered as important product feature. The initial weights for these associated feature functions are then set to a higher value. Another type of prior knowledge is related to the layout format of the Web pages. Similarly, we define a set of feature functions capturing the relationship between tokens, layout formats, and labels. For example, we define a function that

T.-L. Wong, W. Lam / Data & Knowledge Engineering 68 (2009) 107–125

115

equals one if the token is in some special layout format such as bold and colored, and is labeled as an important product feature. It equals zero otherwise. The initial weights for such kind of feature functions are set to a higher value. 4.4. Unsupervised learning algorithm Recall that our approach is an unsupervised learning method. The actual labels of the unobservable variables are not known. We cannot apply existing CRF learning algorithms which can estimate the value of the weights ci associated with each fi in Eq. (4) [29,39]. To tackle this problem, we develop an expectation–maximization (EM) based voted perceptron algo^ðjÞ;k and cki denote the number of training examples, the jth training rithm as shown in Fig. 4. In this algorithm, jTraj; xðjÞ ; y example, the predicted label for the jth training example in the kth iteration, and the weight for the ith feature function in the kth iteration, respectively. In the E-step of our algorithm, we estimate the probability of the labeling of the unobservable variables. In the M-step, we employ the voted perceptron algorithm augmented with the following weight updating function: kþ1 i

c

( X

k i

c þq

) ðjÞ

0

0

ðjÞ

k iÞ

ðjÞ

^ðjÞ;k

fi ðx ; y ÞPðy jx ; c  fi ðx ; y

Þ

ð5Þ

y0

where q denotes the learning rate of the algorithm. The rationale of our learning approach is described below in detail. Suppose we have a set of training examples denoted by Tra for which the actual labels of the variables are known. We define the log likelihood function as follows:

Lðci Þ ¼

j

Suggest Documents