Learning to extract and summarize hot item features from ... - CiteSeerX

9 downloads 6257 Views 403KB Size Report
Apr 12, 2006 - reasons accounting for the popularity of the online auction business. One reason is that sellers ... According to Auction Software. Review, about ...
Knowl Inf Syst DOI 10.1007/s10115-007-0078-2 R E G U L A R PA P E R

Learning to extract and summarize hot item features from multiple auction web sites Tak-Lam Wong · Wai Lam

Received: 12 April 2006 / Revised: 14 November 2006 / Accepted: 26 January 2007 © Springer-Verlag London Limited 2007

Abstract It is difficult to digest the poorly organized and vast amount of information contained in auction Web sites which are fast changing and highly dynamic. We develop a unified framework which can automatically extract product features and summarize hot item features from multiple auction sites. To deal with the irregularity in the layout format of Web pages and harness the uncertainty involved, we formulate the tasks of product feature extraction and hot item feature summarization as a single graph labeling problem using conditional random fields. One characteristic of this graphical model is that it can model the inter-dependence between neighbouring tokens in a Web page, tokens in different Web pages, as well as various information such as hot item features across different auction sites. We have conducted extensive experiments on several real-world auction Web sites to demonstrate the effectiveness of our framework. Keywords Information extraction · Web mining · Conditional random fields

The work described in this paper is substantially supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Nos: CUHK 4179/03E and CUHK4193/04E) and the Direct Grant of the Faculty of Engineering, CUHK (Project Codes: 2050363 and 2050391). This work is also affiliated with the Microsoft-CUHK Joint Laboratory for Human-centric Computing and Interface Technologies. T.-L. Wong (B) Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong e-mail: [email protected] W. Lam Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Shatin, Hong Kong

T.-L. Wong, W. Lam

1 Introduction The fast development of the Internet technology can effectively vanish the geographical barrier of different communities. Especially, the easily accessible World Wide Web creates a profit-generating market place and convenient shopping environment for sellers and customers respectively. One example is the online auction business such as the Web site ebay.com. In auction sites, sellers can place their items such as a brand new digital camera, or a second hand MP3 player for bidding. Potential buyers can then browse auction sites and bid their favorite items by asking a price that they are willing to pay. Once the bidding period ends, the potential buyer who has asked for the highest bidding price can eventually purchase the item. In the past decade, online auction Web sites have been becoming increasingly popular. According to the press release from ebay.com, they currently have 147 million community members and approximately 50 million items for sale at any given time.1 There are several reasons accounting for the popularity of the online auction business. One reason is that sellers can reduce their cost by making use of the online auction environment. Since a huge number of visitors browse online auction sites per day, sellers do not need to set up their own Web sites and promote their products. On the other hand, potential buyers can ask the price of items depending on their budgets and have a chance to successfully purchase an item at a lower price if they can bid the item with the right asking price and at the right time. In every minute, a huge number of sellers and potential buyers are attracted to participate in online auction Web sites. These auction sites contain a tremendous amount of items from different categories listed for bidding with continuously changing price. Moreover, there exists mutual influence among different participants and items in auction sites. For example, the number of bids for a particular item can be seriously affected if another similar product is listed for bidding with a lower bidding price. As a result, online auction sites become fast changing, highly dynamic, and complex systems. For instance, a digital camera may receive a large number of bids ranging from few US dollars to few hundred US dollars in just 1 or 2 days. Therefore, acquiring the up-to-date and accurate information in auction Web sites offers many benefits to both sellers and potential buyers. Though online auction Web sites bring much benefit and convenience to sellers and potential buyers, the massive amount of continuously changing information contained in auction sites poses a serious difficulty for sellers and potential buyers to digest and analyze. For example, when a seller intends to place an item for bidding, he/she is required to set a start bidding price. Some sellers may set the start bidding price with their subjective expectations. This can easily result in either that the start bidding price is set too high and hence the chance of the item being sold may be very slim, or that the start bidding price is set too low and hence the return may decrease. Some other sellers may manually analyze the items currently listed for bidding and their price before setting the start bidding price. However, this manual process for analyzing the vast amount of information is time-consuming and tedious. Besides sellers, it is also beneficial for potential buyers to obtain up-to-date, detailed, and accurate information to assist the decision. For example, before bidding for a particular item, the potential buyer may study the description of the item, and other similar items 1 The article was posted on May 25th, 2005 and was accessible at http://biz.yahoo.com/bw/ 050525/255399.html?.v=1.

Extracting and summarizing features from auction Web sites

listed for bidding. After certain investigation, he/she can then decide on the amount of money for this bid. Due to the highly dynamic and fast changing nature of online auction Web sites, rapid decision is essential. If a potential buyer spends too long time for analysis, he/she may either lose the opportunity for successfully buying the item, or need to pay a higher cost. We develop a framework which can automatically extract and summarize hot item features across different auction Web sites to assist sellers and buyers in decision making. One objective of our framework is to characterize the popularity of an item listed for bidding. Intuitively, a hot item is the item which attracts many potential buyers for bidding. However, such measurement of popularity, which only considers the number of bids, for an item is not appropriate. As described before, the number of bids of a particular item can be severely affected by other similar items listed for bidding but with a lower price. Although the former item attracts less interest from potential buyers, both of them should be considered as hot items because they are actually similar to each other. Another reason is that potential buyers like to bid a listed item just before the end of the bidding period. According to Auction Software Review, about one-fourth of the items receive only one bid at the end of the auction period and potential buyers like to ask the bid in the last minute [2]. Consequently, if hot items are only characterized by the number of bids, it is likely that some actual hot items will be ignored. Therefore, we characterize the popularity based on product features of the items. For example, a product feature of a digital camera can be “4 megapixel resolution”. Normally, items listed for bidding come with descriptions provided by sellers. Figure 1 shows a sample of Web page containing an item collected from an auction site. It contains a list describing the features of the digital camera. Our approach can automatically discover product features from different descriptions provided by sellers. However, the diversified format of descriptions can range from regular format such as tables to unstructured free texts, making the extraction task difficult. For example, Fig. 2 depicts another Web page collected from the auction site same as the one in Fig. 1. Though these two Web pages are about digital camera, the layout format of descriptions provided by sellers is very different.

Fig. 1 A sample of Web page about the auction of a digital camera collected from ebay.com

T.-L. Wong, W. Lam

Fig. 2 Another sample of Web page about the auction of a digital camera collected from ebay.com

To deal with the irregularity of Web pages and harness the uncertainty involved, we formulate the product feature extraction task and the hot item feature summarization task as a single graph labeling problem using conditional random fields [19]. One characteristic of this graphical model is that it can model the inter-dependence between neighbouring tokens in the Web page, as well as tokens in different Web pages. As a result, Web pages collected from different Web sites can be considered under a coherent model improving the extraction quality. This also leads to another characteristic that various information such as the hot item feature information can be easily integrated in the graphical model structure. We have conducted extensive experiments on several real-world auction Web sites to demonstrate the effectiveness of our framework. We have reported a preliminary investigation on the problem of extracting and summarizing hot item features across different auction sites in our previous work [30]. The work described in this paper is substantially extended from our previous work. The remaining paper is organized as follows. Section 2 presents some existing works related to hot item features extraction and summarization. Next, an overview of our framework is described in Sect. 3. Section 4 describes the modeling and the learning algorithm employed in our framework. Section 5 describes our text fragment identification method which segments a HTML document into a set of text fragments. Section 6 presents the experimental results on hot item feature extraction and summarization. Finally, we draw the conclusions and present several directions of future work in Sect. 7.

2 Related work Ghani and Simmons propose a closely related work on end-price prediction from auction Web sites [13,14]. They predict the price of items at the end of the bidding

Extracting and summarizing features from auction Web sites

period using four different kinds of features. The first kind of features is related to sellers such as the seller rating. The second kind of features is related to auction such as the first bid price of the item. The third kind of features is related to the item. This kind of features consists of the indicators of the occurrence of certain phrases such as “like new” in the title. The last kind of features is called temporal features which are obtained from the recent history of the same item. They compared different machine learning techniques such as neural network and decision tree for the end-price prediction based on these features. Our proposed framework is different from their work in several aspects. First, the objective of their approach is to predict end-price whereas our framework is to extract and summarize hot item features. Second, their approach assumes that each item placed for bidding is independent. However, as mentioned in Sect. 1, items actually have a substantial level of mutual influences. Hui and Liu [15] investigate the task of summarizing customer reviews posted on the Web sites. Their work is similar to sentiment classification [32]. Their objective is to classify sentences with subjective orientation. They make use of opinion terms such as “prefect”, “good” as clues and extract frequent features of the product from reviews. Popescu and Etzioi [26] conduct research on this problem. They first make use of the extraction system called KnowItAll [10] to extract explicit features of the product. Next the extracted explicit features are utilized to identify the opinion or orientation from reviews. Both of these two methods apply linguistic techniques and focus on sentences which are largely grammatical. In contrast, the proposed work in this paper is to discover product features of hot items from descriptions provided by sellers in auction Web sites. Such descriptions can vary largely in layout format ranging from rigid tables to free texts. Our work is also different from the research work on text summarization [22], whose objective is to produce text summary from text documents, and the work of summarizing databases [27], which is basically to mine the frequent item set in a transaction database. For semi-structured documents such as Web pages, different information extraction techniques have been proposed [3,11,18,25]. Wrapper is a popular information extraction method and it usually consists of a set of extraction rules which can identify attributes of interest from Web documents. Several machine learning methods have been developed for automatic wrapper generation by learning extraction models from training examples and achieve promising results [1,6,8,12,17,31]. However, all these methods suffer from one common shortcoming that the learned wrapper can only extract the attributes specified in training examples. For example, if we just annotate the start time, end time, location, and speaker in the set of training examples in the seminar announcement domain, the learned wrapper can only extract these four attributes. Some other useful information such as the seminar title will not be extracted. In our previous work, we extend the traditional information extraction technique for discovering new attributes in Web pages [29]. It should be noted that our objective of hot item feature extraction and summarization is different from the objective of ordinary information extraction since our goal is not only to extract product features, but also to generate the summary for hot items across different auction Web sites. Some techniques have been developed for fully automatic information extraction from Web pages without using any training examples. IEPAD [5] and MDR [21] make use of the repeated patterns in a Web page for extraction. A recent method proposed by Li et al. [20] represents a Web document as a DOM structure and discover the data schema by detecting the largest common subtrees. These method can only handle Web pages containing multiple records. However, a Web page normally

T.-L. Wong, W. Lam

consists of one item for bidding in auction sites. Roadrunner does not require Web pages contain multiple records [9]. However, Web pages are required to have similar layout format and this is rare in auction Web sites. Recently, various techniques have been proposed for collectively conducting information extraction and data mining [23]. For example, Wellner et al. propose an approach for extracting different fields in citation and solving the citation matching problem using conditional random fields [28]. McCallum and Wellner also propose an approach to extracting proper nouns and linking the extracted proper nouns using a single model [24]. Bunescu and Mooney study the use of relational Markov networks to collectively extract information from documents [4]. One major difference between these methods and our approach is that our approach considers Web pages collected from different auction Web sites rather than documents from a single information source.

3 Overview of our framework We develop a unified framework which can extract product features and discover hot item features from multiple auction Web sites collectively. Our model is designed based on the undirected graphical model known as Conditional Random Fields (CRF) [19] which is a discriminative probabilistic model. A strength of CRF is that it can model the inter-dependence between different entities without knowing the actual casualty between them. Unlike naive Bayesian approach, another advantage of CRF is that it allows the use of dependent and overlapping features. Moreover, many recent works have been conducted and show that discriminative approaches generally achieve a better performance compared with generative approaches. A Web page can be considered as a set of text fragments which correspond to sentences, rows in a table, or items in a list, etc. Each of these text fragments can be regarded as a sequence of tokens. In information extraction, tokens are labeled with different tags. For example, we can design tag labels such as “product feature”, and “normal text”, etc to denote different semantic meaning of tokens. The problem is then reduced to assigning the appropriate tag for each token. An undirected graph can be automatically constructed representing the conditional dependence among the observation and the tag labels of tokens. Next, the labeling can be accomplished by conducting inference on the graph structure. Table 1 shows a sample of text fragment from a Web page of an auction site. In this example, the first row is the original text fragment extracted from the Web page and the second row is the tag labels of tokens where f and n represent the product feature and the normal text respectively. To extract the product features, the goal is to identify the tags of tokens. One challenging issue of this problem is that descriptions provided by different sellers are organized in different formats as shown in Figs. 1 and 2. Therefore, it poses Table 1 A sample of text fragment collected from a Web page of an auction site with each token labeled Token



Movie

Mode



:

Quick

Time

Motion

JPEG

Tag

n

f

f

n

n

f

f

f

f

Extracting and summarizing features from auction Web sites

a difficulty for extracting accurate text fragments related to product features. Another characteristic is that product features to be extracted are not specified in advance. For example, the product feature “Movie mode” in the example depicted in Table 1 may be a previously unseen product feature and only be found in very few items listed for bidding. We tackle this problem by considering the clues embodied in the layout format of the individual Web page. For example, the Web page shown in Fig. 1 consists of some product features arranged in a quite regular format. This regularity provides a very useful information for solving the extraction task. However, as mentioned before, since the format of descriptions provided by sellers varies greatly, such clue cannot be obtained from other Web pages or obtained through training. We solve this problem by designing an Expectation-Maximization (EM) based training algorithm to discover new product features. The idea of this EM based training algorithm is that we treat tokens to be labeled in the testing set as unlabeled data characterizing by their content and context characteristics. For instance, the word “movie” and “mode” are the content characteristics. The context characteristics include layout format such as boldness, font size, or capitalization of tokens in Web pages. For example, the text fragment “4.0 megapixel CCD . . .” is formatted in a list in Fig. 1. Our EM based training algorithm can make use of both content and context characteristics to discover previously unseen product features from different Web pages. Another property of our undirected graphical model is that besides modeling the conditional dependence between the observation and the tag labels of tokens, it can also be easily integrated with various information such as the hot item feature information in the auction Web sites. This is achieved by representing such information with nodes in the graph. A single graph can be automatically constructed. The product feature extraction and the hot item feature mining can then be carried out under this unified model by collaboratively conducting inference on this automatically generated graphical structure. The advantage of using a unified model for the two tasks is that it allows tight interaction between the two tasks removing the unnecessary boundary between them. A global solution can then be obtained by optimizing the quality of both tasks and at the same time eliminating conflicts. Figure 3 shows the system overview of our framework which consists of two major components. Our framework takes a set of Web pages, such as the ones in Figs. 1 and 2, collected from different auction sites as the input. Each of these Web pages is firstly segmented into a set of text fragments by the first component, namely, the text fragment identification. One sample of the text fragment identified is shown in Table 1. Each text fragment corresponds to a line, a row in a table, an item in a list,

Web pages from multiple auction sites

Text Fragment Identification

Hot Item Feature Extraction and Summarization

DOM structure analysis

Undirected graphical model Adaptive parameter training

Fig. 3 The system overview of our framework

Hot item features summary

T.-L. Wong, W. Lam Table 2 Samples of text fragments contained in the summary produced by our framework in the digital camera domain 1. 2. 3. 4. 5. 6. 7.

Digital Zoom 8X For Mac or Windows 15–25 fps (for 640 × 480 pixels) 2.0 TFT LCD Screen About 120 g (without battery and SD card) Add SD flash memory cards up to 1 GB to store over 1,000 pictures Bundled Kits: Camera Bag

etc in Web pages. This component utilizes the Document Object Model2 (DOM) structure representation of a HTML document to identify such text fragments from Web pages. After obtaining the text fragments from Web pages, the second component, namely, the hot item feature extraction and summarization, is invoked. The objective of this component is to extract product features from Web pages and produce a summary of hot item features. Among the text fragments identified from Web pages by the first component, some of them may contain tokens about product features, whereas some of them may contain uninformative or unrelated texts. We employ the graphical model and technique described above to identify hot item features. A summary is then generated for assisting users in decision making. Table 2 shows samples of text fragments contained in the summary produced by our framework in the digital camera domain.

4 Graphical model for hot item feature mining 4.1 Model formulation In CRF, each node in the graph represents a variable and each edge represents the inter-dependence between the connected variables. Suppose we collect a set of Web pages from auction Web sites and we wish to discover hot item features. Figure 4 shows a simplified CRF model automatically constructed for the hot item feature mining task. The size of the graph is much larger when dealing with real data. There are two kinds of nodes. Shaded nodes represent observable variables while unshaded nodes represent unobservable variables. Suppose we have a collection of Web pages P. As mentioned above, a Web page, M ∈ P, can be regarded as a set of text fragments denoted by SM and each text fragment is considered as a sequence of tokens. For a particular sequence A ∈ SM , each token is actually composed of two kinds of information. The first kind of information is the observation of tokens such as their content characteristics and context characteristics. This information can be observed and is represented by the observable variable X A . The second kind of information is the labeling information of the tokens. In product feature extraction, each token is labeled with either product feature or normal text. This information is hidden and is represented by the unobservable variable Y A . Notice that X A and Y A actually represent a sequence of variables XiA and YiA respectively where 0 < i < L and L denotes the number of tokens in the sequence A. A node denoted by W A represents the 2 The details of the Document Object Model can be found in http://www.w3.org/DOM/.

Extracting and summarizing features from auction Web sites B

X

B

Y

B

W

A

X

Z

A

B

Y

α

N

A

W

Z Z

C

A C

W α

M C

Y

C

X

M

N

Fig. 4 Our proposed conditional random fields model for product feature extraction and hot item feature mining across different auction Web sites

A , YA , identified product features in the sequence A. Each YiA is connected to Yi−1 i+1 X A , and W A as shown in Fig. 4 since the tag label of each token is inter-dependent with the tag labels of neighbouring tokens, the observation of the sequence, and product features. There is another unobservable node called ZA which refers to the hot item features found in the sequence. An observable variable denoted by α M in Fig. 4 represents the number of bids for the item listed in page M. In page M, ZA is connected with W A and X A because a hot item feature is related to the observation and the product feature found in the sequence. ZA is also connected to α M because a hot item feature is inter-dependent with the number of bids of the item listed in page M. For example, it is likely that the product feature is a hot item feature if the item receives a high number of bids from potential buyers. In Fig. 4, the sequence B, C ∈ SN are collected from the same page N ∈ P and N  = M. As mentioned in Sect. 1, a hot item is not only related to its number of bids, but also related to other items listed for bidding. Therefore, X B and ZB , as well as X C and ZC in the page N are also connected to ZA in the page M. Once the undirected graph is constructed, the conditional probability of a particular configuration of hidden variables, given the values of all observed variables can be written as follows:

P(y|x) =

1 Z

 C(x,y)∈C(x,y)

(C(x, y))

(1)

T.-L. Wong, W. Lam

where x and y are the set of observable variables and the set of unobservable variables respectively, C(x, y) refers to the set of cliques of the graph. A clique is defined as the maximal complete subgraph. (C(x, y)) refers to the clique potential for C(x, y). Z is called the partition function and defined as:   Z= (C(x, y)) (2) y C(x,y)∈C(x,y)

We define the clique potential as a linear exponential function as follows:  (C(x, y)) = exp γi fi (x, y)

(3)

i

where fi (x, y) and γi are the ith binary feature and the associated weight respectively. For example, fi (x, y) equals to one if the underlying token is “resolution” and the tag label is product feature and equals to zero otherwise in the digital camera domain. Hence, Equation 1 can be written as follows: P(y|x) =

 1 exp γi fi (x, y) Z i

(4)

Given the set of γi , one can find the optimal labeling of unobserved variables of the graph via conducting inference. The graph typically consists of a large number of combination for the labels of all unobservable variables. Hence, direct computation of the probability of a particular labeling of unobservable variables is infeasible. The inference can be carried out by the message passing algorithm, also known as the sumproduct algorithm, by transforming the graph into junction tree or factor graph [16]. By finding the configuration of hidden variables achieving the highest conditional probability stated in Eq. 1, hot item features can then be discovered from these Web pages. 4.2 Adaptive training of CRF Learning in CRF refers to estimating the value of the weight γi associated with each fi in Eq. 4. Suppose we have a set of training examples denoted by Tra for which the actual labels of variables are known. We define the log likelihood function as follows:   |Tra|   (j) (j) L(γi ) = γi fi (x , y ) − log(Z) (5) j=1

i

where |Tra| and (x( j ) , y( j ) ) denotes the number of training examples and the jth training example respectively. Maximum likelihood approach aims at finding the set of γi which maximizes Eq. 5. It can be shown that Eq. 5 is convex and achieves maximum when the following condition holds: |Tra| |Tra|  ∇ L(γi ) (j) (j) (j)   (j) y fi (x , y )P(y |x ) ∇γi = j=1 fi (x , y ) − j=1 (6) =0 Therefore, one can obtain the set of γi achieving the maximum of Eq. 5 by using iterative methods such as conjugate gradient methods or the voted perceptron algorithm [7]. In particular, Fig. 5 shows the outline of the voted perceptron algorithm

Extracting and summarizing features from auction Web sites Fig. 5 The outline of the supervised voted perceptron learning algorithm for CRF

for learning the parameters. In essence, the voted perceptron algorithm estimates the weight by iteratively minimizing the following expression:    Tra  Tra    (j) (j) (j) (j)   fi (x , y ) − fi (x , yˆ )   j=1  j=1

(7)

where yˆ ( j ) is the predicted labeling using the current weighting. However, recall that one objective of our framework is to extract previously unseen product features contained in Web pages. To achieve this, we exploit the clue embodied in the context characteristic such as the layout format of the extracted data. However, the extracted data cannot be directly used because they involve uncertainty. To tackle this problem, we treat the extracted data as unlabeled data and develop an expectation-maximization (EM)-based voted perceptron algorithm as shown in Fig. 6. In the E-step of our algorithm, we estimate the probability of the labeling of unobservable variables. In the M-step, we employ the voted perceptron algorithm augmented with the following weight updating function: γik+1 ← γik + ρ

Fig. 6 The outline of our EM based voted perceptron learning algorithm for CRF



y

 fi (x( j ) , y )P( y |x( j ) ; γik ) − fi (x( j ) , yˆ (j),k )

(8)

T.-L. Wong, W. Lam

Compared with the algorithm stated in Fig. 5, our EM based voted perceptron algorithm estimates the weight by iteratively diminishing the following expression:     Tra Tra     (j)   (j) ∗ (j) (j)   (9) fi (x , y )P(y |x ; γi ) − fi (x , yˆ )    j=1 y j=1 The first term of Eq. 9 (i.e.,

Tra 

y fi (x

( j ) , y )P(y |x( j ) ; γ ∗ )) i

is the expectation value Tra of fi and it approaches to the first term of Eq. 7 (i.e., j=1 fi (x( j ) , y( j ) )) when the data set is sufficiently large. (x( j ) , y )

j=1

5 Text fragment identification As mentioned in Sect. 3, the objective of the text fragment identification component is to segment a HTML document into a set of text fragments. Each text fragment corresponds to a line, a row in a table, an item in a list, etc. To achieve this, we first make use of the DOM structure representation of the HTML document to identify a set of text fragment candidates. However, some of the text fragment candidates identified may contain information related to item features, while some of them may contain uninformative texts such as advertisement, navigation menu, footer, or copyright statement. We develop a method for filtering those uninformative text fragment candidates and the remaining ones become the text fragments for conducting inference described in Sect. 4. Suppose there are two Web pages originated from the same auction site, but coming from two different domains. We observe that uninformative texts normally appear in both pages, while those text fragments about item features appear in only one page. For instance, a text fragment coming from a Web page in the digital camera domain may contain tokens about the resolution of a particular digital camera. It is not likely that such tokens also appear in Web pages coming from the MP3 player domain. This provides a very useful clue for filtering those uninformative text fragments in Web pages. Figure 7 shows a portion of the DOM structure representation for the Web page depicted in Fig. 1. A DOM structure is an ordered tree consisting of two kinds of nodes. The first kind of nodes, namely, element nodes, contains the HTML tag information of the Web page. These nodes are labeled with HTML tag names such as “” and “
  • ”. The second kind of nodes is called text nodes which are labeled with the texts displayed in browsers. For example, Fig. 7 contains text nodes labeled with “4.0megapixel CCD . . .” and “1.8-inch color LCD monitor”. We identify a set of HTML tags which are useful for segmenting the HTML document into text fragments. Table 3 shows the HTML tags used and their functions. For example, the HTML tag “
    ” denotes a new line in an HTML document and can be used to signal the beginning of a new text fragment. Our approach for identifying text fragment candidates first traverses the DOM structure of the HTML document in a depth-first manner. If the visited node belongs to one of the tags in Table 3, it signals the start of a new text fragment. After the traversal, we can obtain a set of text fragment candidates representing the document. For example, “Basic Features”, “4.0-megapixel CCD . . .”, “1.8-inch color LCD monitor”, “Copyright © 1995–2006 eBay Inc.” are samples of text fragments identified from the Web page shown in Fig. 1.

    Extracting and summarizing features from auction Web sites






  • ... ... ...

    ... ... ...











    • 4.0megapixel CCD ... ...

      1.8inch color LCD monitor





    • Realimage optical viewfinder

      Fig. 7 A portion of the DOM structure for the Web page shown in Fig. 1

      Table 3 HTML tags used in identifying text fragments in a HTML document

      HTML Tag

      Functions

      ,

      , ...




    Add a headings Start a new paragraph Start a new line Add a list item Add a table heading Add a row in a table

    Each text fragment candidate can be represented by a set of distinct tokens. Let ti (p) denote the ith text fragment candidate in the Web page p. We define the similarity between the ith text fragment from page p and the jth text fragment from page q as follows: |t i (p) ∩ t j (q)| (10) sim(t i (p), t j (q)) = max {|t i (p)|, |t j (q)|} where |t| denotes the number of elements in the set t. The outline of text fragment identification algorithm is depicted in Fig. 8. Our method first collects Web pages from two different domains in the same auction site. This can be easily achieved by making use of the search engines provided by auction sites and querying with different keywords such as “Digital Camera” and “MP3 Player”. Let p and q be the two Web pages collected in different domains. These Web pages can then be represented by DOM structures. Next, sets of text fragment candidates representing p and q respectively can be obtained as described above. Recall that normally uninformative texts repeat in both pages, while those text fragments about item features only appear in one page. Therefore, we compute the similarity between all candidates obtained from p and q. If the similarity is less than a predefined threshold θ , they will be removed from the sets of text fragment candidates. Finally, the remaining candidates are those dissimilar

    T.-L. Wong, W. Lam

    Fig. 8 The outline of our text fragment identification algorithm

    text fragments and they are selected for discovering hot item features as described in Sect. 4.

    6 Experimental results We conducted extensive experiments on three real-world auction Web sites in two domains, namely, the digital camera domain and the MP3 player domain, to demonstrate the effectiveness of our framework. The three auction Web sites are www.ebay. com, auctions.yahoo.com, and www.ubid.com. In each domain, we collected 50 Web pages from each of the auction sites for evaluation. Each Web page contains an item listed for bidding and the remaining bidding period is less than an hour. We conducted two sets of experiments to evaluate our approach to product feature extraction and hot item feature summarization. We manually annotated the product features in the Web pages. These annotated product features were served as the gold standard in our evaluation. In each domain, we randomly chose 5 pages whose items received at least one bid from potential buyers in each of the Web sites (a total of 15 Web pages) to produce the set of training examples to train our model as described in Sect. 4.2. The trained model is then applied to the remaining Web pages to extract product features of items. Recall(R), precision(P), and F-measure(F) are adopted as evaluation metrics. Recall is defined as the number of items for which the system correctly identified divided by the total number of actual items. Precision is defined as the number of items for which the system correctly identified divided by the total number of items it extracts. F-measure is defined as 2PR/(P + R). Table 4 depicts the extraction performance of our approach. Our approach achieves about 81 and 75% for average precision and recall respectively in the digital camera domain and achieves 76 and 74% for average precision and recall respectively in the MP3 player domain. This shows that our approach can effectively leverage the content and context characteristics to extract product features. Next, we employ our framework to generate the summary of hot item features in the digital camera and MP3 player domains. To increase comprehensibility, we generate the summary by outputting text fragments containing hot item features instead of

    Extracting and summarizing features from auction Web sites Table 4 Experimental results of our approach to extracting product features. (P, R, and F refer to the precision, recall, and F-measure, respectively) Digital camera

    MP3 player

    P

    R

    F

    P

    R

    F

    ebay yahoo ubid

    0.77 0.87 0.78

    0.62 0.88 0.75

    0.69 0.87 0.76

    0.71 0.75 0.81

    0.64 0.78 0.79

    0.67 0.76 0.80

    Ave.

    0.81

    0.75

    0.78

    0.76

    0.74

    0.75

    Ave. refers to the average of extraction performance

    Table 5 Some of the text fragments containing hot item features in the digital camera and MP3 player domains Domain

    Text fragments about hot item features

    Digital camera domain

    Digital Zoom 8X For Mac or Windows 15–25 fps (for 640 × 480 pixels) 2.0 TFT LCD Screen About 120 g (without battery and SD card) Add SD flash memory cards up to 1 GB to store over 1,000 pictures Bundled Kits: Camera Bag

    MP3 player domain

    Charge and sync via one simple USB 2.0 connection Apple iPod Battery life: Up to 10 h Condition: New USB 2.0 cable PC and Mac Compatible 3.5 mm Headphone Jack

    individual token. Table 5 shows some text fragments extracted. We manually investigate items listed for bidding in auction Web sites and find that over 70% of the items receiving at least one bid from potential buyers contain at least three of the reported product features mentioned in the summary. This demonstrates that the summary generated is very helpful for the auction Web site participants.

    7 Conclusions and future work We have developed a unified framework which is able to extract and summarize hot item features across different auction Web sites. Our system can assist sellers and potential buyers in decision making. One challenge of this problem is to extract information from product descriptions whose layout format vary greatly among different sellers. We formulate the problem as a single graph labeling problem employing conditional random fields. The solution is then obtained by conducting inference on the graph. One characteristic of our framework is to extract previously unseen

    T.-L. Wong, W. Lam

    product features by making use of the clues embodied in the layout format of text fragments. We have designed an EM based voted perceptron algorithm to conduct training. Extensive experiments from several real-world auction Web sites have been conducted to demonstrate the effectiveness of our framework. We intend to extend our framework in several directions. One possible extension is to incorporate the prior knowledge of users into our framework. Very often, users may have some prior knowledge about the domain in advance. For example, they may know that the resolution and optical zoom are some common features of digital cameras. Such domain knowledge may be represented in the form of an ontology, and incorporated in the training and inference in our framework. We intend to develop a mechanism which allows users to incorporate domain knowledge easily. Another possible direction is to apply our framework to other data mining problems. The undirected graphical model described in this paper is a general model capturing the dependence between different variables. It can be applied to some other data mining problems such as important product feature mining. Normally, in an online vendor Web site, each product has a list of features. Some of these features can be regarded as important features because they are some special and characteristics of the particular product. Important feature mining aims at extracting product features from different online vendor Web sites and identifying those important features.

    References 1. Agichtein E, Ganti V (2004) Mining reference tables for automatic text segmentation. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), pp 20–29 2. Auction Sotware Review (2003) In http://www.auctionsoftwarereview.com/article-ebaystatistics.asp 3. Aumann Y, Feldman R, Liberzon Y, Rosenfeld B, Schler J (2006) Visual information extraction. Knowl Inform Syst 10(1):1–15 4. Bunescu R, Mooney R (2004) Collective information extraction with relational markov networkds. In: Proceedings of the 42nd annual meeting of the association for computational linguistics (ACL), pp 439–446 5. Chang C, Lui SC (2001) IEPAD: information extraction based on pattern discovery. In: Proceedings of the tenth international conference on world wide web (WWW), pp 681–688 6. Ciravegna F (2001) (LP)2 an adaptive algorithm for information extraction from web-related texts. In: Proceedings of the seventeenth international joint conference on artificial intelligence (IJCAI), pp 1251–1256 7. Collins M (2002) Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In: Proceedings of the annual meeting of the association for computational linguistics (ACL), pp 489–496 8. Crescenzi V, Mecca G (2004) Automatic information extraction from large websites. J ACM 51(5):731–779 9. Crescenzi V, Mecca G, Merialdo P (2001) ROADRUNNER: Towards automatic data extraction from large web sites. In: Proceedings of the 27th very large databases conference (VLDB), pp 109–118 10. Etzioni O, Cafarella M, Kok S, Popescu A, Shaked T, Soderland S, Weld D, Yates A (2005) Unsupservised named-entity extraction from the web: an experimental study. Artif Intell 165(1): 91–134 11. Feldman R, Rosenfeld B, Fresko M (2006) TEG - a hybrid approach to information extraction. Knowl Inform Syst 9(1):1–18 12. Freitag D, McCallum A (2000) Information extraction with HMM structures learned by stochastic optimization. In: Proceedings of the seventeenth national conference on artificial intelligence (AAAI), pp 584–589

    Extracting and summarizing features from auction Web sites 13. Ghani R (2005) Price prediction and insurance for online auctions. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), pp 411–418 14. Ghani R, Simmons H (2004) Predicting the end-price of online auctions. In: International workshop on data mining and adaptive modelling methods for economics and management 15. Hu M, Liu B (2004) Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), pp 168–177 16. Kschischang F, Frey B, Loeliger H (2001) Factor graphs and the sum-product algorithm. IEEE Trans on Inform Theory 47(2):498–519 17. Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell 118(1–2): 15–68 18. Kushmerick N, Thomas B (2002) Adaptive information extraction: core technologies for information agents. In: Intelligents information agents R&d in europe: An agentLink perspective, pp 79–103 19. Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of eighteenth international conference on machine learning (ICML), pp 282–289 20. Li Z, Ng WK, Sun A (2005) Web data extraction based on structural similarity. Knowl Inform Syst 8(4):438–491 21. Liu B, Grossman R, Zhai Y (2003) Mining data records in web pages. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD), pp 601–606 22. Mani I, Maybury M (1999) In advances in automatic text summarization. MIT press, Cambridge 23. McCallum A, Jensen D (2003) A note on the unification of information extraction and data mining using conditional-probability, relational models. In: Proceedings of the IJCAI workshop on learning statistical models from relational data 24. McCallum A, Wellner B (2003) Toward conditional models of identity uncertainty with application to proper noun coreference. In: Proceedings of the IJCAI workshop on information integration on the web 25. Muslea I, Minton S, and Knoblock C (2001) Hierarchical wrapper induction for semistructured information sources. J Auton Agents Multi-Agent Syst 4(1–2):93–114 26. Popescu A, Etzioni O (2005) Extracting product features and opinions from reviews. In: Proceedings of the human language technology conference conference on empirical methods in natural language processing, pp 339–346 27. Wang J, Karypis G (2005) On efficiently summarizing categorical databases. Knowl Inform Syst 9(1):19–37 28. Wellner B, McCallum A, Peng F, Hay M (2004) An integrated, conditional model of information extraction and coreference with application to citation matching. In: Proceedings of the 20th conference on uncertainty in artificial intelligence (UAI), pp 593–601 29. Wong TL, Lam W (2004) A probabilistic approach for adapting information extraction wrappers and discovering new attributes. In: Proceedings of the 2004 IEEE international conference on data mining (ICDM), pp 257–264 30. Wong TL, Lam W, Chan SK (2006) Extracting and summarizing hot items features across different auction web sites. In: The tenth Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 334–345 31. Wong TL, Lam W (2007) Adapting web information extraction knowledge via mining siteinvariant and site-dependent features. ACM Trans Internet Technol (in press) 32. Yi J, Niblack W (2005) Sentiment mining in web fountain. In: Proceedings of the 21st international conference on data engineering (ICDE), pp 1073–1083

    T.-L. Wong, W. Lam Tak-Lam Wong received B.Eng., M.Phil., and Ph.D. degrees from the Chinese University of Hong Kong in 2001, 2003, and 2006 respectively. He is currently with City University of Hong Kong. His research interests lie in the areas of Web mining, data mining, information extraction, machine learning, and knowledge management

    Wai Lam received a Ph.D. in Computer Science from the University of Waterloo. He obtained his BSc. and M.Phil. degrees from the Chinese University of Hong Kong. After completing his Ph.D. degree, he conducted research at Indiana University Purdue University Indianapolis (IUPUI) and University of Iowa. He joined the Chinese University of Hong Kong, where he is currently a professor. His research interests include intelligent information retrieval, text mining, digital library, machine learning, and knowledge-based systems. He has published articles in IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Knowledge and Data Engineering, ACM Transactions on Information Systems, etc.

    Suggest Documents






    Basic Features