Hot Item Mining and Summarization from Multiple Auction ... - CiteSeerX

5 downloads 0 Views 123KB Size Report
hot items from multiple auction Web sites to assist de- cision making. ... tential buyers, online auction Web sites are fast chang- ing, highly dynamic ..... so that the graph constructed is a complete graph and prevent from ... This summary reports the frequent ... are discovered by the system is adopted as the evalua- tion metric.
Hot Item Mining and Summarization from Multiple Auction Web Sites Tak-Lam Wong and Wai Lam Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong, Shatin, Hong Kong wongtl,wlam @se.cuhk.edu.hk

Abstract

ences. For instance, the chance of an item being sold may be seriously affected if another similar item is placed for bidding with a lower bidding price. Therefore, acquiring the up-to-date and accurate information in the auction Web sites offers many potential benefits. We develop a framework which can automatically mine and summarize hot items from multiple auction Web sites to assist the sellers and the buyers in decision marking. One objective of our framework is to characterize the popularity of an item listed for bidding. Intuitively, a hot item is the item which attracts many potential buyers for bidding. As mentioned before, it is quite common that a potentially hot item may have a few or no bids because of another similar item listed for bidding but with lower price. Both of these items actually attract many buyers’ interest and should be considered as hot items. As a result, the popularity of the items cannot be only measured by the number of bids. Our approach for characterizing the popularity of the items is based on the product features and the associated product feature values of the items. Our approach can automatically extract the product features and the product feature values from the description provided by the sellers. This extraction task is a challenging problem since the format of the description is greatly different ranging from regular format such as tables to unstructured free texts. To reliably extract the product features and the product feature values, we employ Hidden Markov models (HMM) to achieve this task. One property of our HMM is that we make use of two different kinds of states. The first kind of states is called the content states which model the content characteristics of the product feature and the product feature values such as the words or terms used in the description. The second kind of states is called the context states which model the context characteristics such as the formatting and visual layout used in the description. Our framework is able to extract product features and the associated product feature values from multi-

Online auction Web sites are fast changing, highly dynamic, and complex as they involve tremendous sellers and potential buyers, as well as a huge amount of items listed for bidding. We develop a two-phase framework which aims at mining and summarizing hot items from multiple auction Web sites to assist decision making. The objective of the first phase is to automatically extract the product features and product feature values of the items from the descriptions provided by the sellers. We design a HMM-based learning method to train an extended HMM model which can adapt to the unseen Web page from which the information is extracted. The goal of the second phase is to discover and summarize the hot items based on the extracted information. We formulate the hot item mining task as a semi-supervised learning problem and employ the graph mincuts algorithm to accomplish this task. The summary of the hot items is then generated by considering the frequency and the position of the product features being mentioned in the descriptions. We have conducted extensive experiments from several real-world auction Web sites to demonstrate the effectiveness of our framework.

1. Introduction Due to the tremendous number of sellers and potential buyers, online auction Web sites are fast changing, highly dynamic, and complex systems. A great number of items from different categories are listed for bidding at any time and their prices change quite frequently. For example, it can be observed that some items in the digital camera category can have a large number of bids ranging from few US dollars to few hundreds US dollars in just one or two days. Moreover, the items placed for bidding have mutual influThe work described in this paper was substantially supported by grants from the Research Grant Council of the Hong Kong Special Administrative Region, China (Project Nos: CUHK 4179/03E and CUHK 4193/04E) and CUHK Strategic Grant (No: 4410001). This work was also partially supported by Microsoft-CUHK Joint Laboratory.

1 Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

1550-4786/05 $20.00 © 2005 IEEE

ple auction Web sites and conduct analysis on such a collection of information. We make use of such information to identify the popularity of the items. In the auction Web sites, the items with high number of bids are automatically regarded as hot items, while the items with a few or no bids are modeled as unlabeled data. In essence, hot item mining can then be formulated as a semi-supervised classification problem. We employ the graph mincuts algorithm to accomplish this mining task. Another objective of our framework is to summarize the features of the hot items. Our summarization approach considers two kinds of information. The first kind of information is the frequency of the occurrence of the product features and the product feature values mentioned in the description of the hot items. The second kind of information is the location of the product feature in the description. A summary of the hot items is then generated according to an importance value. We have conducted extensive experiments on several real-world auction Web sites to demonstrate the effectiveness of our framework.

2. Related Work A closely related work is end-price prediction from auction Web sites [3]. The objective of end-price prediction is to make use of different features to predict the price of the items at the end of the bidding period. Our proposed framework is different from their work in several aspects. First, the objective of their approach is to predict end-price whereas our framework is to mine and summarize hot items. Second, they assumes that each item placed for bidding is independent. However, as mentioned in Section 1, the items actually have a lot of mutual influences. The summarization of the product features of hot items is related to the research topic of text summarization [6]. The objective of text summarization is to produce text summary. Another related work, done by Hui and Liu [4], aims at summarizing customer reviews posted on the Web sites. Their work is also similar to sentiment classification [8]. They aim at extracting sentences with subjective orientation and make use of subjective words such as “good”, “prefect” as the clues. Different from the above summarization work, the goal of summarization presented in this paper is to generalize and digest the product features of the hot items from multiple auction Web sites. In our proposed framework, there is a task of extracting the product features and the associated product feature values automatically from the description provided by the sellers in the auction Web sites. This task is related to the problem of relation extraction from textual documents. A major characteristic of relation extraction is that it extracts the attributes of in-

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

1550-4786/05 $20.00 © 2005 IEEE

————————————————————————– # Two stage HMM learning algorithm Input: A set of training examples collected from different Web pages A Web page, , from which the product features and product feature values are extracted Output: A set of extract product features and product feature values Algorithm: 1. Stage 1: 2. Train an HMM with the structure shown in Figure 2 and use Equations 1and 2 for the calculation of emission and transition probabilities The product features and product feature values 3. extracted from using the HMM trained at stage 1 Stage 2: 4.1.The product features in are labeled as both product feature content state and product feature context state 4.2.The product feature values in are labeled as both product feature value content state and product feature value context state 5. Extend the HMM model structure to the one shown in Figure 3 6. Train the extended HMM by using: , to calculate the emission probabilities and transition probabilities 6.1 form content states using Equations 1 and 2 respectively to calculate the emission probabilities and transition probabilities 6.2 form context states using Equations 3 and 4 respectively 7. The product features and product feature values extracted from using the HMM trained at stage 2 8. Return ————————————————————————–

Figure 1.

The outline of the two-stage HMM learning algo-

rithm.

terest from the documents and the relation between the attributes. For example, Snowball [1] and the method proposed by Zelenko et al. [9] attempt to extract person-affiliation as well as organization-location relations from free texts which mostly grammatically correct. However, in the Web environment, the texts are usually not grammatically correct and hence these two methods cannot be directly applied. Recently, various techniques have also been proposed to extract information from semi-structured documents such as Web pages [5, 7]. However, the objective of ordinary information extraction is still quite different from the issue of relation extraction of product features and product feature values.

3. Product Feature Extraction Our approach is a two-phase framework for mining and summarizing hot items in multiple auction Web sites. The objective of the first phase is to extract the product features and product feature values of the items. We employ Hidden Markov Models to achieve this task. The objective of the second phase is to mine and summarize hot items from multiple auction Web sites by making use of the extracted product features and feature values. It is formulated as a semisupervised learning problem and we tackle it by the graph mincuts algorithm. Figure 1 depicts the outline of our HMM-based learning algorithm. The first stage corresponds to Steps 2 and 3 of the learning algorithm. At this stage, we consider the HMM model structure as shown in as the emisFigure 2. We denote sion probability that a particular word denoted by is generated from a particular state denoted by . The to state is detransition probability from state

vcontent

vcontent

fcontent

fcontent

start

end

n

start

Figure 2.

The HMM model structure used at stage one of our , , and denote the prodlearning algorithm. uct feature content state, product feature value content state, and the normal text state respectively. The and states denote two special states representing the start and end of the token sequence.

end

fcontext

vcontext

noted by where is the position of the token in the sequence. Laplacian smoothing method is used to calculate the emission probability and transition probability given as follows: count of observed at state total count of words observed at state

(1) number of transition from state to state total number of transition leaving state

(2) is the number of possible destination state where as shown in Figure 2. The second stage corresponds to Steps 3 - 7 of the algorithm shown in Figure 1. Figure 3 depicts the extended HMM model. The newly added product feature context state can utilize the context information to extract the unseen product features. Similar reasoning is also applied to the product feature value content state and the design of product feature value context state. The emission probability and transition probability of the content states are calculated using Equations 1 and 2 using the sets and respectively in Step 5. Suppose there are distinct layout format in the Web page and let be the -th layout format. Step 6 calculates the emission and transmission probabilities from the context state as follows: count of observed at state total count of layout format observed at state

(3) number of transition from state to state total number of transition leaving state

(4) The learned HMM model is then re-applied to the unseen Web page to extract the product features and product feature values which are previously not extracted in the first stage.

4. Hot Item Mining and Summarization We formulate the hot item mining as a semisupervised learning problem. The idea is that items with high number of bids are automatically classified as hot items, while the items with little or no bids are regarded as unlabeled data. We employ the graph mincuts algorithm to achieve this task [2]. Each item in the auction Web sites can be represented by a set of extracted product features and the associated product feature values. Suppose there are

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

1550-4786/05 $20.00 © 2005 IEEE

n

Figure 3. The extended HMM model structure used at stage and denote the two of our learning algorithm. product feature context state and the product feature value context state. These two states model the layout format such as the boldness of the tokens, the font size of the tokens, etc. distinct product features extracted. Let denote where the -th product feature value of an item . equals to either the extracted -th if the value for the product feature value, or -th feature is not present in the underlying Web page. and , denoted by The similarity between , is calculated as follows: (5) where is the count that , and both of and do not equal to . One is added so that the graph constructed is a complete graph and prevent from any isolation of vertices. The procedure for our hot item mining approach is as follows: Step 1: A graph in which each vertex representing an item and the weight of the edges representing the similarity between the items is constructed. Step 2: Next, two more vertices called the hot vertex and cold vertex are added in the graph. Step 3: Suppose we choose items with the highest number of bids as hot items. Edges with infi nite weight connecting the hot vertex and these hot items are added to the graph. Step 4: We fi nd items which have zero number of bids and have the lowest average similarity between the hot items. These items are regarded as cold items. Edges with infi nite weight connecting the cold vertex and these cold items are added to the graph. Step 5: After that, a cut with the smallest total sum of weight is found. A cut is defi ned as the removal of some edges so that the graph is separated into two partitions and one partition contains the hot vertex while the other contains the cold vertex. The items in the partition containing the hot vertex are then considered as hot items, while the others are considered as cold items.

After mining the hot items from multiple auction Web sites, a summary will be generated from the hot items discovered. This summary reports the frequent and important product features and product feature values of the hot items. Suppose there are distinct product features and distinct product feature val. ues for the -th product feature where , let In the Web page containing the item be the order of the -th product feature and the associated product feature value extracted. For example,

Digital Camera Precision Recall ebay 0.75(0.42) 0.57(0.08) yahoo 0.86(0.53) 0.93(0.09) ubid 0.82(0.97) 0.68(0.15) Average.0.81(0.64) 0.73(0.11)

MP3 Player Precision Recall 0.68(0.31) 0.55(0.12) 0.73(0.23) 0.83(0.11) 0.81(0.54) 0.77(0.12) 0.76(0.36) 0.71(0.12)

Table 1. The experimental results of our approach to extracting product feature and product feature values. The fi gures outside the brackets are the average extraction performance of applying the HMM model which is re-trained in the second stage of our learning algorithm to the Web site depicted in the fi rst column. The fi gures inside the brackets are the extraction performance of applying the traditional HMM model obtained in the fi rst stage and without retraining in the second stage. and if and are the first and the second extracted product feature and product if the -th feafeature value extracted, and ture is not found in the Web page. For each distinct for the -th product feature, product feature value as the number of items containing the we define -th product feature and as the associated product feature value. the importance of a pair of product fea, is defined as ture and product feature values, follows: (6) The top number of product features and the associated product feature values constitute the summary of the hot items.

5. Experimental Results We conducted extensive experiments on three realworld auction Web sites in two domains, namely, digital camera domain and MP3 player domain, to demonstrate the effectiveness of our framework. The three auction Web sites are www.ebay.com, auctions.yahoo.com, and www.ubid.com. In each domain, we collected 50 Web pages from each of the auction sites for the evaluation. Each Web page contains an item listed for bidding and the bidding period of that item ends within an hour. We conducted three sets of experiments to evaluate our approach to product feature and product feature value extraction, hot item mining, and hot item summarization. We manually annotated the product features and product feature values in the Web pages. These annotated product features and product feature values were served as the gold standard in our evaluation. In each domain, we randomly chose 5 pages from each of the Web sites (a total of 15 Web pages) to produce the set of training examples to train the HMM model as described in Section 3. The trained HMM model is then applied to the remaining Web pages to extract the product features and product feature values for testing. Table 1 depicts the extraction performance of our approach. It illustrates that our approach can leverage the valuable context information in extracting the product feature and product feature values.

Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05)

1550-4786/05 $20.00 © 2005 IEEE

Product Feature resolution camera type condition exposure compensation white balance focal length self - timer flash modes

Product Feature Value 3.2 Megapixel point & shot new +/-2.0ev in 0.5ev step increments auto , daylight , tungsten , fluorescent 5 . 1 - 15 . 3 millimeters 10 seconds auto / forced on / forced off / slow synchro

Table 2. Some of the important product features and the product feature values in the digital camera domain using our approach. To evaluate our hot item mining approach, we first identify the items having more than five bids in each domain before the end of the bidding period. They are regarded as the true hot items in this experiment. Next, we apply our hot item mining method described in Section 4 to discover hot items. The mined hot items are then compared with the true hot items. Coverage which is defined as the potion of the true hot items that are discovered by the system is adopted as the evaluation metric. In our mining method, we need to fix the value of and in Steps 3 and 4 respectively. We let both of and equal to 5. Our approach achieves a very satisfactory results with the coverage in the digital camera domain and the MP3 player domain are 1.00 and 0.97 respectively. A summary containing the top important product features and product feature values of the mined hot items is then generated according to Section 4. We fix to 100 and each of the reported important product features and product feature values is evaluated manually to determine their importance. We find that over 70% of the reported product features and product feature values are useful. Table 2 shows some of the important product features and the product feature values in the digital camera domain using our approach.

References [1] E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth International Conference on Digital Libraries, pages 85–95, 2000. [2] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001), pages 19–26, 2001. [3] R. Ghani. Price prediction and insurance for online auctions. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2005), To appear, 2005. [4] M. Hu and B. Liu. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2004), pages 168– 177, 2004. [5] N. Kushmerick and B. Thomas. Adaptive information extraction: Core technologies for information agents. In Intelligents Information Agents R&D In Europe: An AgentLink Perspective, pages 79–103, 2002. [6] I. Mani and M. Maybury. Advances in Automatic Text Summarization. MIT Press, Cambridge, MA, 1999. [7] T. L. Wong and W. Lam. A probabilistic approach for adapting information extraction wrappers and discovering new attributes. In Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM-2004), pages 257–264, 2004. [8] J. Yi and W. Niblack. Sentiment mining in WebFountain. In Proceedings of the 21st International Conference on Data Engineering (ICDE2005), 2005. [9] D. Zelenko, C. Aone, A. Richardella, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083–1106, 2003.