A Shopping Agent That Automatically Constructs ... - CiteSeerX

0 downloads 0 Views 225KB Size Report
unit, and recognizes the position and the structure of product descrip- tions by nding the most frequent pattern from the sequence of logical line information in ...
A Shopping Agent That Automatically Constructs Wrappers for Semi-Structured Online Vendors Jaeyoung Yang1 , Eunseok Lee2 , and Joongmin Choi1 1

2

Department of Computer Science and Engineering, Hanyang University 1271 Sa-1-dong, Ansan, Kyunggi-do 425-791, Korea fjyyang, [email protected]

School of Electrical and Computer Engineering, Sungkyunkwan University 300 Chunchun-dong, Suwon, Kyunggi-do 440-746, Korea

Abstract. This paper proposes a shopping agent with a robust induc-

tive learning method that automatically constructs wrappers for semistructured online stores. Strong biases assumed in many existing systems are weakened so that the real stores with reasonably complex document structures can be handled. Our method treats a logical line as a basic unit, and recognizes the position and the structure of product descriptions by nding the most frequent pattern from the sequence of logical line information in output HTML pages. This method is capable of analyzing product descriptions that comprise multiple logical lines, and even those with extra or missing attributes. Experimental tests on over 60 sites show that it successfully constructs correct wrappers for most real stores.

1 Introduction A shopping agent is a mediator system that extracts the product descriptions from several online stores on a user's behalf. Since the stores are heterogeneous, a procedure for extracting the content of a particular information source called a wrapper must be built and maintained for each store. A wrapper is generally consists of a set of extraction rules and the code to apply those rules[5]. In some systems such as TSIMMIS[4] and ARANEUS[2], extraction rules for the wrapper are written by humans. Wrapper induction[5] has been suggested to automatically build the wrapper through learning from a set of resource's sample pages. However, most previous systems were unable to cover many real stores since they relied on some strong biases, imposing too much restrictions on the structure of documents that can be analyzed. For example, ShopBot[3] assumes that product descriptions reside on a single line, and HLRT[5] can not handle the cases with noises such as missing attributes. STALKER[6] algorithm deals with the missing items or out-of-order items, but it is not fully automatic in the sense that users need to be involved in the preparation of training examples. ARIADNE[1] is a semi-automatic wrapper generation system, but its power of automatic wrapper learning is limited since heuristics are obtained mainly from the users rather than through learning.

In this paper, we propose a shopping agent with a simple but robust inductive learning method that automatically constructs wrappers for semi-structured online stores. Strong biases that have been assumed in many systems are weakened so that real-world stores can be handled. Product descriptions may comprise multiple logical lines and may have extra or missing attributes. Our method treats a logical line as a basic unit, and assigns a category to each logical line. The HTML page of a product search result is converted into a sequence of logical line categories. The main idea of our wrapper learning is to recognize the position and the structure of product descriptions by nding the most frequent pattern containing the price information. This pattern is regarded as the extraction rule of the wrapper.

2 Overview of Comparison Shopping Agent Our wrapper learning method is implemented in a prototype comparison shopping agent called MORPHEUS. The overall architecture of MORPHEUS is shown in Fig. 1. It consists of several modules including the wrapper generator, the wrapper interpreter, and the uniform output generator.

Fig. 1. The overall architecture of MORPHEUS The wrapper generator is the main learning module that constructs a wrapper for each store. In fact, the wrapper generator learns two things. First, it learns how to query a particular store by recognizing its query scheme. An HTML page containing a searchable input box is analyzed and a query template is generated. Second, it learns how to extract a store's content. Product descriptions in the store's search result pages are recognized and their repeating pattern is determined. The wrapper interpreter is a module that executes learned wrappers to get the current product information. This module forms several actual queries by combining each store's query template with the keywords that the user actually typed in, and sends them to the corresponding shopper sites. The search results from the stores are then collected and fed to the uniform output generator module. The uniform output generator integrates search results from several stores and generates a uniform output.

3 Learning Wrappers for Online Stores One key function of the wrapper generator is to learn the format of product descriptions in result pages from successful searches. Each page contains one or more product descriptions that matched the sample query. A product description is composed of a sequence of items that describe the attributes of the product. For example, a bookstore displays search results in which the attributes of a product include the booktitle, the author, the price, and/or the reader's review. Wrapper learning has to nd the starting and ending position of the list of product descriptions in the entire result page, and to recognize the pattern of a product description. To do this, our method is divided into three phases. In the rst phase, the HTML source of the page is broken down into logical lines. A logical line is conceptually similar to a line that the user sees in the browser, so the algorithm recognizes each logical line by examining HTML's delimiter tags such as
,

,

,
, , . The second phase of the algorithm is to categorize each logical line and assign it the corresponding category number. Currently, we maintain 5 categories including TEXT, PRICE, LTAG, TITLE, and TTAG, and their category numbers are 0, 1, 2, 3, and 8, respectively. Here, TITLE denotes the product name, PRICE denotes the price, TTAG denotes table tags such as , LTAG denotes the HTML tags other than TTAG used in logical line breaking, and TEXT denotes a general string that is not recognizable as one of the above four categories. We use simple heuristics for this category assignment. For example, TITLE is assigned to a logical line when the line contains one of the keywords in the sample query, and PRICE is assigned by recognizing a dollar sign $ (or some other symbol that represents the price unit) and a digit. Fig. 2 shows the HTML source of a product description that is obtained from the Amazon bookstore by the query \Korea", along with assigned category numbers for logical lines. Advances in Cryptology-Asiacrypt '96 : International Conference on the Theory and Applications of Cryptology and Information Security Kyongju, Korea,)
by K. Kim(Editor), Tsutomu Matsumoto (Editor). Paperback (November 1996)
, and
Our Price:$73.95

3 2 0 8 8 8 8 1

Fig. 2. An HTML source for a book and the categories for logical lines After the categorization phase, the entire page can be expressed by a sequence of category numbers. The third phase of our algorithm then nds a repeating

pattern in this sequence. It rst nds the pattern of each product description unit(PDU) and counts the frequency of each distinct pattern to get the most frequent one. Finding the next candidate PDU is done by searching for PRICE rst, and then backtracks in the sequence to search for TITLE, despite it is generally assumed that the TITLE attribute appears before the PRICE attribute in a PDU. This is because the reliability of the heuristics for correctly recognizing PRICE is higher than that of the heuristics for recognizing TITLE. The subsequence of logical lines between TITLE and PRICE becomes the resulting pattern of a PDU. A pseudocode for this algorithm is given below. seq the input sequence of logical line categories; seqStart 1; /* initial position for pattern search */ numCandPDUs 0; /* number of distinct PDU pattern */

(true) f /* nd the next candidate PDU */

while

priceIndex findIndex(seq, seqStart, PRICE); titleIndex findIndexReverse(seq, priceIndex, TITLE); currentPDU substring(seq, titleIndex, priceIndex); if (currentPDU == NULL) then exit the while loop; /* no more PDUs if (currentPDU is already stored in candPDUs array) then the frequency count of currentPDU is incremented by 1; else save currentPDU in candPDUs array; increment numCandPDUs by 1; /* a new PDU pattern */ seqStart priceIndex+1; /* starting point for searching next PDU

f

g

mostFreqPDU

g

*/

*/

the element of candPDUs array with maximum frequency;

return(mostFreqPDU);

For Amazon, the learned PDU pattern is 32088881 as shown in Fig 2. In the shopping stage, the wrapper interpreter module applies the learned PDU pattern to modify noisy PDUs with di erent attributes by ignoring extra attributes or putting dummy values for missing attributes.

4 Experimental Results We implemented MORPHEUS and built a Web interface as shown in Fig 3(a) so that the user can select a store that is to be learned. Learned stores are added to the store list from which the comparison shopping can be done. To evaluate the performance, we have tested MORPHEUS for 62 real online stores as to whether correct wrappers can be generated. We assume that a proper test query is given in the learning phase so that the output page with reasonably many matched products can be produced. In order to verify whether the correct wrapper is generated, the result of wrapper learning is displayed as in Fig. 3(b). In this display, the learned PDU pattern along with the product names and their corresponding prices are shown. If this data is consistent with the one that is

(a) Main interface

(b) Learned result

Fig. 3. MORPHEUS Interfaces obtained by directly accessing the store site, it can be regarded that the correct wrapper is really generated. Table. 1 shows the test data for some of the 62 sites that have been tested. During the test, we have collected some relevant information for each output page of the site such as the test query used in learning, the learned PDU pattern, the number of PDUs, and the number of MF-PDUs(most frequent PDUs). In this experiment, the proposed wrapper generation algorithm works satisfactorily with succeeding in 58 out of 62 stores. A few sites such as www.dsports.com failed to get a PDU pattern since it contains some unnecessary product information in the header of the output page.

Table 1. Experiment data during wrapper generation for some of 62 sites Store URL

test query

www.more.com www.jewelryweb.com www.softwarebuyLine.com www.1cache.com www.etronixs.com www.egghead.com www.bookbay.com intertain.com www.more.com www.amazon.com .. .

gift ring school video video compaq java java gift java .. .

PDU pattern

No. of No. of total PDUs MF-PDUs 388088088088121 7 7 32022228880881888808821 18 18 320808020808088081 40 29 32088021 10 10 32088021 10 10 38080881 44 36 3202021 17 17 321 133 133 388088088088121 7 7 32088881 50 26 .. .. .. . . .

5 Discussion and Future Work We have developed a robust method for automatic wrapper generation in the domain of comparison shopping, and the test results have shown that it successfully constructs correct wrappers for most real stores. The characteristics of our method in comparison to previous researches are summarized as follows. First, the strong biases assumed in many existing systems are weakened so that the real stores with reasonably complex document structures can be handled. Second, we do not exploit the domain knowledge. This makes the learning algorithm simple and domain independent, and it still works satisfactorily. Third, learning in MORPHEUS is processed quickly since it does not incorporate a separate module for removing redundant fragments such as the header, tail, and advertisements. There are also some limitations in our current system. First, we have assumed that a proper keyword is given for the test query by humans. Heuristics for providing a proper test query automatically should be investigated. Second, each product description must contain the price attribute. We think that this is not a severe restriction since most stores that produce semi-structured product information contains the price attribute, with only a few exceptions. Nonetheless, this restriction may reduce the generality of the algorithm since it cannot be applied to other domains that do not require the price information. One solution might be that the feature attribute that must exist in a product description may be speci ed as a parameter to the algorithm, rather than hard-coded in the program. Third, we only extract the price information from a product description that may contain several other attributes. Extracting non-price information by exploiting proper domain knowledge(or the ontology) is under progress.

References 1. Ambite, J., Ashish, N., Barish, G., Knoblock, C., Minton, S., Modi, P., Muslea, I., Philpot, A., Tejada, S.: ARIADNE: A System for Constructing Mediators for Internet Sources. ACM SIGMOD International Conference on Management of Data (1998) 561{563 2. Atzeni, P., Mecca, G., Merialdo, P.: Semi-structured and Structured Data in the Web: Going Back and Forth. ACM SIGMOD Workshop on Management of Semistructured Data (1997) 1{9 3. Doorenbos, R., Etzioni, O., Weld, D.: A Scalable Comparison-Shopping Agent for the World Wide Web. First International Conference on Autonomous Agents (1997) 39{48 4. Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., Vassalos, V.: Template-based wrappers in the TSIMMIS system. ACM SIGMOD International Conference on Management of Data (1997) 532{535 5. Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper Induction for Information Extraction. International Joint Conference on Arti cial Intelligent (1997) 729{735 6. Muslea, I., Minton, S., Knoblock, C.: A Hierarchical Approach to Wrapper Induction. Third International Conference on Autonomous Agents (1999) 190{197

Suggest Documents