,
, and |
Our Price:$73.95 3 2 0 8 8 8 8 1 Fig. 2. An HTML source for a book and the categories for logical lines After the categorization phase, the entire page can be expressed by a sequence of category numbers. The third phase of our algorithm then nds a repeating pattern in this sequence. It rst nds the pattern of each product description unit(PDU) and counts the frequency of each distinct pattern to get the most frequent one. Finding the next candidate PDU is done by searching for PRICE rst, and then backtracks in the sequence to search for TITLE, despite it is generally assumed that the TITLE attribute appears before the PRICE attribute in a PDU. This is because the reliability of the heuristics for correctly recognizing PRICE is higher than that of the heuristics for recognizing TITLE. The subsequence of logical lines between TITLE and PRICE becomes the resulting pattern of a PDU. A pseudocode for this algorithm is given below. seq the input sequence of logical line categories; seqStart 1; /* initial position for pattern search */ numCandPDUs 0; /* number of distinct PDU pattern */ (true) f /* nd the next candidate PDU */ while priceIndex findIndex(seq, seqStart, PRICE); titleIndex findIndexReverse(seq, priceIndex, TITLE); currentPDU substring(seq, titleIndex, priceIndex); if (currentPDU == NULL) then exit the while loop; /* no more PDUs if (currentPDU is already stored in candPDUs array) then the frequency count of currentPDU is incremented by 1; else save currentPDU in candPDUs array; increment numCandPDUs by 1; /* a new PDU pattern */ seqStart priceIndex+1; /* starting point for searching next PDU f g mostFreqPDU g */ */ the element of candPDUs array with maximum frequency; return(mostFreqPDU); For Amazon, the learned PDU pattern is 32088881 as shown in Fig 2. In the shopping stage, the wrapper interpreter module applies the learned PDU pattern to modify noisy PDUs with dierent attributes by ignoring extra attributes or putting dummy values for missing attributes. 4 Experimental Results We implemented MORPHEUS and built a Web interface as shown in Fig 3(a) so that the user can select a store that is to be learned. Learned stores are added to the store list from which the comparison shopping can be done. To evaluate the performance, we have tested MORPHEUS for 62 real online stores as to whether correct wrappers can be generated. We assume that a proper test query is given in the learning phase so that the output page with reasonably many matched products can be produced. In order to verify whether the correct wrapper is generated, the result of wrapper learning is displayed as in Fig. 3(b). In this display, the learned PDU pattern along with the product names and their corresponding prices are shown. If this data is consistent with the one that is (a) Main interface (b) Learned result Fig. 3. MORPHEUS Interfaces obtained by directly accessing the store site, it can be regarded that the correct wrapper is really generated. Table. 1 shows the test data for some of the 62 sites that have been tested. During the test, we have collected some relevant information for each output page of the site such as the test query used in learning, the learned PDU pattern, the number of PDUs, and the number of MF-PDUs(most frequent PDUs). In this experiment, the proposed wrapper generation algorithm works satisfactorily with succeeding in 58 out of 62 stores. A few sites such as www.dsports.com failed to get a PDU pattern since it contains some unnecessary product information in the header of the output page. Table 1. Experiment data during wrapper generation for some of 62 sites Store URL test query www.more.com www.jewelryweb.com www.softwarebuyLine.com www.1cache.com www.etronixs.com www.egghead.com www.bookbay.com intertain.com www.more.com www.amazon.com .. . gift ring school video video compaq java java gift java .. . PDU pattern No. of No. of total PDUs MF-PDUs 388088088088121 7 7 32022228880881888808821 18 18 320808020808088081 40 29 32088021 10 10 32088021 10 10 38080881 44 36 3202021 17 17 321 133 133 388088088088121 7 7 32088881 50 26 .. .. .. . . . 5 Discussion and Future Work We have developed a robust method for automatic wrapper generation in the domain of comparison shopping, and the test results have shown that it successfully constructs correct wrappers for most real stores. The characteristics of our method in comparison to previous researches are summarized as follows. First, the strong biases assumed in many existing systems are weakened so that the real stores with reasonably complex document structures can be handled. Second, we do not exploit the domain knowledge. This makes the learning algorithm simple and domain independent, and it still works satisfactorily. Third, learning in MORPHEUS is processed quickly since it does not incorporate a separate module for removing redundant fragments such as the header, tail, and advertisements. There are also some limitations in our current system. First, we have assumed that a proper keyword is given for the test query by humans. Heuristics for providing a proper test query automatically should be investigated. Second, each product description must contain the price attribute. We think that this is not a severe restriction since most stores that produce semi-structured product information contains the price attribute, with only a few exceptions. Nonetheless, this restriction may reduce the generality of the algorithm since it cannot be applied to other domains that do not require the price information. One solution might be that the feature attribute that must exist in a product description may be speci ed as a parameter to the algorithm, rather than hard-coded in the program. Third, we only extract the price information from a product description that may contain several other attributes. Extracting non-price information by exploiting proper domain knowledge(or the ontology) is under progress. References 1. Ambite, J., Ashish, N., Barish, G., Knoblock, C., Minton, S., Modi, P., Muslea, I., Philpot, A., Tejada, S.: ARIADNE: A System for Constructing Mediators for Internet Sources. ACM SIGMOD International Conference on Management of Data (1998) 561{563 2. Atzeni, P., Mecca, G., Merialdo, P.: Semi-structured and Structured Data in the Web: Going Back and Forth. ACM SIGMOD Workshop on Management of Semistructured Data (1997) 1{9 3. Doorenbos, R., Etzioni, O., Weld, D.: A Scalable Comparison-Shopping Agent for the World Wide Web. First International Conference on Autonomous Agents (1997) 39{48 4. Hammer, J., Garcia-Molina, H., Nestorov, S., Yerneni, R., Breunig, M., Vassalos, V.: Template-based wrappers in the TSIMMIS system. ACM SIGMOD International Conference on Management of Data (1997) 532{535 5. Kushmerick, N., Weld, D., Doorenbos, R.: Wrapper Induction for Information Extraction. International Joint Conference on Arti cial Intelligent (1997) 729{735 6. Muslea, I., Minton, S., Knoblock, C.: A Hierarchical Approach to Wrapper Induction. Third International Conference on Autonomous Agents (1999) 190{197 Suggest Documents |