Discovering Objects in Dynamically-Generated Web Pages

1 downloads 0 Views 143KB Size Report
engines make static web pages available is a challenge to. Permission to make .... To convert an HTML or XML page into a tag tree re- quires that the page ...
Discovering Objects in Dynamically-Generated Web Pages James Caverlee

David Buttler

Ling Liu

Georgia Institute of Technology College of Computing Atlanta, GA 30332, U.S.A.

Georgia Institute of Technology College of Computing Atlanta, GA 30332, U.S.A.

Georgia Institute of Technology College of Computing Atlanta, GA 30332, U.S.A.

[email protected]

[email protected]

[email protected]

ABSTRACT As the web grows, more and more content is being hidden from the reach of traditional search engines. In this paper, we present THOR, a scalable and efficient tool to mine objects from this hidden web. With precision and recall over 90%, THOR automatically extracts objects of interest from dynamically-generated web pages. Then customized objectidentification algorithms are applied to locate the “interesting” objects in each page. We show that dynamicallygenerated pages tend to be a homogenous subset of pages found on the Web, and that these pages may be separated into distinct clusters of structurally-similar pages. Using this homogeneity across clusters along with traditional information retrieval techniques, we propose a two-phase clustering scheme consisting of a page clustering algorithm and a fragment clustering algorithm. Using this scheme, we can identify object-rich fragments of each page with an average of over 90% precision and over 95% recall.

Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval—clustering, search process, selection process

General Terms Algorithms, Experimentation

Keywords information retrieval, clustering, web structure, hidden web

1.

INTRODUCTION

The exponential growth of dynamic web pages coupled with the inability of most modern search engines to search dynamic content has created a situation in which more and more content is hidden. Accessing, categorizing, and making this information available in the same way current search engines make static web pages available is a challenge to

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2002 ACM X-XXXXX-XX-X/XX/XX ...$5.00.

current search technology. By search engine, we mean the composite of several underlying technologies, including: a crawler to discover information, a page analyzer to categorize and index information, as well as the search front-end to receive queries from the user and deliver relevant results. To unlock the Web’s hidden content, a search engine over dynamic content will require at least three features: an efficient means of generating web pages with dynamic content, a scalable architecture, and a robust method of analyzing and indexing dynamic web pages. The emphasis of this paper is on this third point, though the system design considers all of the points mentioned. Our system, called THOR, focuses on analyzing dynamicallygenerated web pages to identify the “important” portions of each page and to extract the relevant objects in each “important” portion. By “important” portions, we mean the portions of the page that are relevant to the query that generated the page in the first place. Consider that many dynamically-generated web pages tend to contain irrelevant content for a particular search query. For example, a search of “Harry Potter” on Amazon.com yields a results page loaded with content unrelated to Harry Potter, like the standard navigation bar, standard links to partner websites, and other boilerplate. The good news is that the results page does include a series of objects related to the original search, in this case a list of titles of Harry Potter books along with relevant links, prices, and reviews. It is our contention that the relevant information to be indexed by a search engine is not the page in its entirety, but these query-relevant object-rich regions. A system to discover the important objects in a dynamicallygenerated page must contend with at least two major challenges. First, to automatically extract the object-rich regions of a page, a dynamic search system should be independent of any particular manual (or even semi-automated) encoding of the structure of a page and of any particular representation schema. For obvious reasons, it is both inefficient and impractical to tie the performance of a search engine over billions of pages to the manual encoding of potentially millions of sites. Second, a dynamic search system should be robust enough to handle the inherent heterogeneity of the Web. Web pages tend to differ not only in content and links, but also in tag structure and in the underlying technologies that enable the pages (be it HTML, XML, JavaScript, or other web technologies). The Web continues to grow along all of these axes of variability, resulting in ever-increasing complexity. Hoping for a “silver bullet” to bring order to the dynamic web (through XML, for ex-

ample) is naive and impractical. While some technologies may bring relative order to some segments of the Web, it is reasonable to expect that the vast majority of the Web will continue to be a patchwork of heterogeneous pages. In spite of this inherent heterogeneity, the Web – when considered piecemeal – does display significant homogeneity. For example, a series of different keyword searches on Amazon.com will result in a series of results pages, each displaying different query-related content, but each with a similar content layout and a similar page presentation structure. The navigation bar, advertisements, and other boilerplate often co-occur across all of the pages. This homogeneity in layout and structure may be leveraged to automatically and efficiently extract objects from dynamically-generated web pages. THOR is designed to provide a high level of precision and recall and to operate independently of hand-tuned solutions for specific sites. It is successful at the identification of object-rich portions of pages by leveraging the relationships among similar pages through the structural clustering of HTML tag-trees and by exploiting the similarity among content layouts of similar pages within clusters. Traditional information retrieval methods are employed to explore the two level clustering of web pages by their tag structure similarity and content layout similarity. The novelty of THOR in page content analysis and dynamic object identification can be summarized as follows: • THOR considers each page not in isolation nor solely by a page’s text and links, but in the context of all the similarly structured pages from a particular web site; • It is flexible enough to process large numbers of web sites, and specialized enough to generate page-specific information that is helpful in recognizing and parsing the important objects; • THOR can automatically cluster structurally similar pages with respectable success rate; and • It is capable of identifying relevant objects in a page, not just indexing the entire content of the page. In the following sections, we will present the design ideas and technical details behind THOR, beginning in Section 2 with a description of the fundamental representations and algorithms used by THOR. In Section 3, we present the overall system architecture. Experimental results are discussed in Section 4. Related work is in Section 5. We conclude in Section 6 with our final thoughts and suggestions for future extensions.

2. 2.1

THOR PAGE CONTENT ANALYZER Design Ideas

The Web as a whole is a m´elange of content, structure, and organization. In the small, however, particular subsets of the Web tend to be very closely related both in terms of structure and content. Let us explore this idea of relatedness a bit more closely now. A single design firm may produce a web site for a physician and one for an attorney. The sites, though concerning very different subjects, are structured similarly both in site layout and in page structure. Clearly, these sites are related structurally to some degree.

BarnesAndNoble.com and Amazon.com both contain many pages devoted to books common to both sites. Though laid out differently, the two sites are closely related in terms of content. Unfortunately, the cost of identifying and leveraging these forms of relatedness using naive similarity metrics becomes prohibitive when scaling a tool to the size of the Web. Here we present some key insights that make the challenge of incorporating relatedness into our search engine more manageable. It is our experience that many web sites tend to structure dynamically-generated pages in a similar fashion, often reusing a standard template. For now, let us consider all pages generated by a particular HTML form from a single domain as potentially structurally related. But these dynamically-generated pages will tend to differ somewhat in HTML tag structure depending on the category of their output – be it an empty results page, an exception page, or a normal results page. Certainly among larger sites like Google.com, Amazon.com, and IBM.com, this is true. Then, for a particular category of results pages – say, a set of normal results pages from Amazon.com – cross-page content information may yield clues as to which fragments of each page contain the “interesting” data. Some fragments contain information that is similar across all pages in the cluster, while other fragments are dynamically-generated in response to a particular user query. This suggests a two-phase clustering approach that takes advantage of these levels of relatedness. First, we will cluster a collection of dynamically-generated pages into different page clusters based on the page templates discovered – to separate the structurally different exception pages from the normal results pages at Amazon.com, for example. We refer to this phase as the page clustering phase. Second, we will assume the content of page fragments across structurallysimilar page clusters belongs to a single cluster, and use intra-cluster similarity metrics to filter out common content and to hone in more effectively on the “interesting” content fragments. We call this second phase the fragment clustering phase. Now that we have previewed some of the key details of THOR, it is necessary to describe more formally some of the ideas presented so far.

2.2

Modeling Web Pages as Tag Trees

There are two primary web page representations used by THOR. First, a page is transformed into a well-formed tag tree consisting of tag nodes and content nodes. Then the tag nodes and the content of each candidate subtree are separately transformed into a vector space representation. For each page, we may divide the underlying markup source into tags and text. By tag, we mean all of the characters between an opening bracket “”, where each tag has a tag name (e.g. BR, HEAD, or TD) and a set of tag attributes. The text is the sequence of characters between consecutive tags. To convert an HTML or XML page into a tag tree requires that the page conforms to a basic notion of being well-formed [15]. The requirements of a well-formed page include, but are not limited to, the following: all start tags, including standalone tags, must have a matching end tag; all attribute values must be in quotes; tags must strictly nest; etc. These requirements are necessary so that all pages may be consistently transformed into a tag tree representation. Pages that do not satisfy these criteria are automatically

Sample Page

Object A


Object B



an ancestor of w. A minimal subtree will correspond to our notion of the potentially object-rich fragments of a web page.

2.3

Figure 1: Raw HTML for a sample web page HTML Head

Body

Title

H1

Sample Page

Object A

HR

H1 Object B

Figure 2: Tag tree representation

transformed into well-formed pages using a standard conversion tool such as Tidy [11]. Once a page is well-formed, it may be transformed into a tag tree representation consisting of tag nodes and leaf nodes. A tag node consists of all the characters from a particular start tag to its corresponding end tag, and is labeled by the name of the start tag. A leaf node consists of all the characters between a start tag and its corresponding end tag or between an end tag and the next start tag. We label a leaf node by its content. In Figure 1, Sample Page is an example of a tag node. Sample Page is an example of a leaf node. Now we may formally define the notion of a tag tree. Let T = (V, E) be a tag tree of a web page W where V = VT ∪ VC , VT is a finite set of tag nodes and VC is a finite set of content nodes; E ⊂ (V × V ), representing the directed edges. Figure 2 is the tag tree representation of this example page, and each node in this tag tree represents a subtree fragment of the page. To complete the transformation to a tag tree requires that we define the notion of a path from one node to another. From the root node ( in this case), there exists a path to every other node defined by the page. As a result, we may describe any node in a tag tree by its unique path. The path html[1].head[1].title[1] uniquely identifies the path from the root node to the title node, where the numbers in brackets indicate the order of the child in the tag tree. Similarly, the node corresponding to

Object A

may be denoted by the path html[1].body[2].h1[1]. Critical to the analysis in the following sections is the notion of a minimal subtree. We call a subtree anchored at node u, u ∈ V , a minimal subtree with the property P , if it is the smallest subtree that meets the following condition: there is no other subtree, say subtree(w), w ∈ V , which satisfies both the property P and the condition that u is

Transforming Tag Trees into Vector Space

Using techniques common to the information retrieval community, we may represent both the tags and the content of each web page as a vector of terms and weights ([12, 13]). THOR uses two different vector space representations: a tag signature of each page for clustering structurally-similar pages, and a content signature for each subtree within a page for the cross-page content analysis. The vector space representation of the tags allows for measurements of the similarity between tags in terms of tag structure. The vector space representation of the component subtrees allows for measurements of the similarity between component subtrees in terms of size, shape, depth, and breath of the component subtrees. In typical information retrieval applications, a document is usually preprocessed to remove any commonly occurring words (called stop words; examples often include “the”, “a”, and “you”) and to transform each word into its base form by removing prefixes and suffixes (called stemming [10]). The document can then be represented as a vector of terms and weights, where the weight is initially assigned to be the frequency of the term’s occurrence within the document. A document containing n words could be described by:   (term1 , weight1 )        (term2 , weight2 )  documenti =     ···     (termn , weightn ) The vector for a particular document may then be normalized using term-frequency inverse-document-frequency (TFIDF), a technique that reweighs all term vectors based on the characteristics of all the documents across the entire document space. In the case of clustering web pages by tag signatures, the space of all documents would be the set of all pages generated by a particular search form. The space of terms for a given document would be the set of representative tags of the document. For the case of clustering the content, the set of all pages would be the set of structurally-similar pages. The space of terms for a document would be keywords that can summarize the document and distinguish the document from others in the collection. We use a variation of a fairly standard version of TFIDF that prescribes that the weight for term k in document i be:   N tfik wik = log( + 1) tfk,max nk where tfik = frequency of termk in documenti ; tfk,max = maximum frequency of termk across the document space; N = total number of documents in the document space; and nk = number of documents in the document space that contain termk . We then normalize each vector. TFIDF weights terms highly if they frequently occur in relevant pages, but infrequently occur across the corpus of documents. Conversely, if a term occurs in every document then its weight will be low and, hence, the term will be a poor discriminator. We use the cosine similarity metric (or normalized dot

product) over these tag and content signatures, which is essentially an inverse distance metric over the vector space. X similarity(documenti , documentj ) = wik wjk k

where wik and wjk are the weights for term k in document i and document j respectively. Orthogonal vectors, for example, will be completely dissimilar (i.e. similarity(documenti , documentj ) = 0). Combining this similarity metric with the TFIDF weight system leads to a situation in which terms like “Price” and “Order” that occur across many Amazon.com pages will not perversely force two otherwise dissimilar vectors to be considered similar. In the next two subsections, we first describe the design idea of the page clustering algorithm and its role in separating content-rich pages from exception or error pages. Then we describe the fragment-based clustering algorithms for cross-page content analysis and how the fragment-level clustering discovers object-rich content regions.

2.4

Clustering Pages by Page Templates

Empirically, we have observed that dynamically-generated pages from a particular site tend to fall into one of a number of closely related structural patterns. By structural patterns we mean the tag-tree layout of a page. Often, a site may use several templates – one for error pages, one for results that span several pages, and one for results that span a single page. Knowing this, THOR segments all of the pages from a particular site into related clusters. Since THOR is designed to be a fully automated system, it is infeasible to build a supervised learning classifier for each site – the Web is too large and dynamic, and the cost of labeling training data is prohibitive. As a result, it is necessary to rely on unsupervised clustering techniques. In the pathological case, each of n pages would belong to a distinct cluster. In this case, the assumption of relatedness across pages generated by a particular site or even by a particular dynamic-page generating mechanism would fail and the benefits of our system would be nullified. But for extreme cases like this, a regular indexing techniques should suffice. There are several possible options for the representation used to cluster pages – from converting trees into simple strings so string similarity algorithms can be used to direct tree comparisons where nodes and links are weighted differently. In the first prototype implementation of THOR, we choose a simple and yet effective representation - the normalized tag signature of a page. For a page, we define its normalized tag signature as the normalized vector of tags and tag occurrences. Continuing our example from Figure 1, the normalized tag signature of the sample page would be:   (< html >, 0.33)       (< head >, 0.33)       (< title >, 0.33) T agSignatureN ORM =   (< body >, 0.33)          (< h1 >, 0.67)  (< hr >, 0.33) Given a collection of tag signatures representing a set of domain-specific dynamically-generated pages, there are several options for clustering [8]. We have chosen k-means since it is simple yet effective and efficient. K-means works by

initially generating k random tag signature cluster centers. Each tag signature is then assigned to the cluster with the closest center. New centers are calculated based on the centroid of each cluster. See section 2.5 for the details of centroid calculation. The cycle of calculating centroids and assigning tag signatures to clusters repeats until no cluster centroid changes. In Section 4, we present experimental results that support our choice of representation and our choice of clustering algorithm. We omit the concrete algorithm for construction of normalized tag signature and the detail of the k-means algorithm for clustering web pages in this paper due to the space restriction. Readers who are interested in further details may refer to our technical report [5].

2.5

Discovering Object-Rich Content Fragments

Once the collection of dynamically-generated pages has been clustered into several groups of structurally-similar pages based on the page templates discovered, the structurally different exception pages can now be separated from the set of normal results pages. For the clusters of normal results pages, cross-page content analysis can be conducted to yield clues as to which fragments of each page contain the “interesting” data, namely the fragments that are dynamicallygenerated in response to a particular user query. We refer to such fragments as the query-based object-rich content regions in a page. Formally, such fragments are the minimal subtrees of the page, which contain the “interesting” data. Therefore, the problem of discovering the object-rich content regions in a page can be reduced to the problem of accurately identifying the minimal object-rich subtrees of a page. In THOR, we conduct cross-page content analysis of pages in a page-cluster by analyzing subtrees using an average similarity metric based on the cosine-similarity measure. Similar to the normalized tag signature representation used in the page clustering described in Section 2.4, we use the normalized content signature of a subtree as the representation for the subtree analysis. If we again consider Figure 1, note the subtree rooted at , denoted by the path html[1].body[2]. Its content consists of two occurrences of “Object”, one of “A”, and one of “B”. We may represent this content signature as:

ContentSignatureN ORM

   (“Object00 , 0.82)  (“A00 , 0.41) =   (“B 00 , 0.41)

To exploit the content of each subtree, it is necessary to define a canonical representation of each candidate subtree for a particular cluster of web pages. In a vector space, the canonical representation can be defined in terms of the centroid of a group of subtrees, where the centroid is defined as the average vector of the group of subtrees. Let i denote the cluster identifier of a cluster of web pages. There are n terms used to define the centroid of the cluster i. Each term is shared among k pages in the cluster, each page gives a weight (such as term frequency) to the term. The centroid of the cluster i can be defined as a vector of n elements, each consists of a term i (i = 1, . . . , n) and the average weight of

the term over k pages. That is:  P (term1 , k1 Pk w1k )    1 (term2 , k k w2k ) centroidi = · · ·P    (termn , k1 k wnk )

      

To calculate the centroid, one must identify common subtrees across all pages of the cluster. The centroid for a particular subtree would be meaningless if the subtree existed in only a subset of the cluster’s pages. For a particular cluster of similarly-structured domainspecific pages, it is reasonable to expect some subtrees (like the subtrees corresponding to advertisements and the navigation bar) to be relatively static; the object-rich portions should vary dramatically from page to page. That is, in the vector space, most non-object-rich subtrees should be clustered closely around their centroid; object-rich subtrees should be more widely dispersed. Using the cosine-similarity metric based on the centroid, the object-rich portion may be defined as the subtree that is most dissimilar to its centroid. It is relatively easy to extend the analysis above to allow for the identification of multiple object-rich subtrees. Unlike sites like Google.com that only have one object-rich subtree, many sites like Amazon.com and CNN.com tend to have multiple “interesting” subtrees. Some of the efforts in the past have been quite successful at identifying one objectrich portion for indexing, but have failed to extend to the more general case. A similarity cutoff may be determined experimentally, below which any subtree may be classified as object-rich. Readers who are interested in further details may refer to our technical report [5]. In this section we have discussed our two-phase clustering scheme for clustering pages by their page templates and for cross-page content analysis using fragment-level clustering to discover object-rich regions in web pages. In the next section, we briefly describe the THOR system and its implementation architecture that supports this two-phase clustering scheme. Then in section 5, we report our experiments showing the effectiveness of the two-phase clustering approach and the benefits of the THOR system.

3.

THOR SYSTEM OVERVIEW

3.1

System Architecture

The system architecture is illustrated in Figure 3. The system works in four major steps. • Web page generation, cleaning, and transformation; • Clustering of structurally-similar pages; • Identification of object-rich fragments; and • Identification of object separators for each subtree. Page Generation and Cleaning Although the efficient and representative generation of dynamic pages is not the focus of this paper, it is an important feature in the development of any search engine over dynamic content. For this work, a random selection of dictionary words is used as input to the search forms located on various web sites. Of course, we can imagine potentially more effective schemes to generate pages using perhaps a selection of terms from related or linked pages, but we leave

this as an avenue of future research. Each page generated is then placed in a set corresponding to the site and form from which it was generated. So book pages from Amazon.com’s book search form are distinct from DVD pages from the DVD search form. Each page is then cleaned using a markup normalization tool, such as Tidy [11] so that it is well-formed. Finally, each page is converted into a tag-tree. Clustering Structurally-Similar Pages Each tag tree is then converted into its corresponding tag signature as described in Sections 2.2 and 2.3. The k-means clustering algorithm is applied and each page is assigned to its appropriate cluster. Identifying Object-Rich Fragments For a particular cluster, each subtree for each page is transformed into its content vector. Subtrees are ranked in descending order of their similarity to the centroid for that particular subtree. We identify the object-rich fragments as the subtrees most dissimilar to their centroids. Identifying Object Separators Once object-rich subtrees have been identified, it is necessary to determine how the actual objects are divided within the subtree. In a table, the tag may be the relevant object separator, but often, the separation of objects is not so trivial. We have previously developed a methodology for finding the correct object separator tag using a two-step process. First, certain tags within the minimal subtree are labeled as candidate object separator tags. Then, the characteristics (tag patterns, paths, size, and so forth) of these candidate tags are analyzed to determine the correct object separator tag. Second, there are a number of tags that are frequently used to identify the object boundaries in various types of content structure of HTML pages. For example, the paragraph separator tag

for paragraph structure, the table row separator for table structure, and the list item separator

  • . We have designed a set of algorithms that produces a ranked list of object separators based on characteristics like tag appearance counts, standard derivation, identifiable tags, partial path count, or sibling count. Once the separator tags have been identified, objects are extracted from the raw text data of the web page. An object is defined as the fragment between two adjacent object separator tags. The object construction algorithm analyzes the common hypertext structure of the list of candidate objects extracted and determines if any adjacent fragments should be combined into a single object accordingly. See [4] for a detailed discussion of these techniques.

    3.2

    Implementation Details

    The software underlying the THOR architecture is written in Java 1.4, and relies on two key libraries: the JTidy implementation of the Tidy cleaning algorithm and the University of Waikato’s WEKA k-means clustering package [11, 14].

    3.3

    Parameters

    In THOR, there are several parameters that influence the effectiveness and the final precision and recall of the algorithms for identification of the query-specific object-rich content regions from dynamically-generated web pages. First, there is the number of clusters used to group pages from a single web site. Typically we expect this number to be small; most forms only generate two types of data: a list

    Step 1 Prepare documents HTML

    Clean pages

    Files well−formed documents

    Step 2 Cluster Pages by Type

    Step 3 Cluster subtrees

    Tag extraction and weighting

    Step 4 Extract objects

    Subtree extraction and weighting

    Omini object extraction

    extracted objects

    Clustered Pages Construct tag tree

    Set of weighted content term vectors

    tag tree

    Cluster elimination

    Data rich subtrees

    Subtree centroid Set of weighted tag vectors

    Cluster Tool

    Figure 3: THOR System Architecture of results matching the input terms, or an exception page when no results match. Some web sites may support additional intermediate non-query-related pages such as the NCBI BLAST delay page. Experimentally, we have found that we can obtain good results by setting the number of clusters between two and five. The second parameter is the number of candidate subtrees in a page to examine within a page-cluster to discover the minimal object-rich subtrees. The simplest solution is to examine all possible subtrees, however it is computationally prohibitive to match and analyze every possible subtree in a page. Our experiments have shown that an average web page contains in excess of 1,000 nodes. It is expensive to execute the subtree clustering algorithm for each node of a page in the given cluster of web pages. In the first prototype implementation of THOR we have limited the analysis to the subtrees that are considered as most promising in terms of a set of tree characteristics (fanout, content size, tag count, link density, and so on).

    4.

    EXPERIMENTS

    In this section we report three sets of experiments designed to evaluate the THOR system. The first set of experiments is dedicated to studying the effectiveness of the first phase of the algorithm – the page clustering phase. We assess five different page representations and various clusterer settings. In the second set of experiments we investigate the effectiveness of the fragment detection and object extraction phase. In our final experiment, we examine THOR’s flexibility in identifying multiple object-rich fragments on a single page. For all of the experiments, we generated 100 words from the standard Unix dictionary, plus 10 nonsense words. Using a breadth first crawl of the Web starting at the author’s homepage and Google.com, we identified over 3,000 unique search forms. We randomly selected 50 of the 3,000 search forms. We then passed each word into the search forms, and the resulting dynamically-generated pages were cached locally, resulting in a set of 5,500 pages for analysis and testing. Since the focus of this research is on the page content analyzer not object separator identification, we have focused

    our experimental results on assessing the quality of THOR at identifying the object-rich subtrees. Our previous research [4] into discovering good object separators in a subtree shows that there is a direct correlation between the minimal objectrich subtree identification phase and the object boundary discover and object extraction phase: if the correct subtree is not identified, the object separation and extraction will definitely fail, and if the correct subtree is identified, the object separation and extraction will succeed over 90% of the time.

    4.1

    Page Clustering

    In the first set of experiments we examine the page clustering phase of our two-phase scheme. Since the success of any clusterer is driven by the underlying representations, we have chosen to investigate five different page representations: • the tags of the page • the top-10 keywords of the page • the entire text of the page • a mixture of tags and the top-10 keywords of the page • a mixture of tags and the entire text To evaluate each approach, we generated the vector space representation as described in Section 2, and then ran each collection through a k-means clusterer set for two clusters. We summarize the results in Figure 1. We labeled a page as correctly clustered if it was in a cluster with at least half the pages being of the same type (either a normal results page or an exception page). Our results are a reflection of the power of each representation to differentiate the pages. As we would expect, using only the text of the page to cluster yields very poor results. We expect pages with dynamic content to vary greatly in their content – using only text creates a sparse scatter of pages in vector space from which it is difficult to find natural clusters. On the other hand, the tags of a group of pages tend to better indicate the kind of page – a “No Results” page from IBM.com will have a substantially different tag signature from a normal results page. Using the top-10 keywords performs adequately, as

    Table 3: Single Object-Rich Fragment Precision Recall Average 92.2% 96.3%

    Table 1: Page Clustering Comparison Pages Correctly Std Dev Clustered Tags Only 95% 5% Top-10 Keywords 88% 12% Text Only 81% 18% Tags and Top-10 Words 91% 8% Tags and All Text 87% 11% Random 78% – Table 2: Average Tag Signatures for IBM Normal No Results 0.10 0.23 0.00 0.18 0.30 0.28 0.30 0.28

    0.61 0.00 0.30 0.28 0.30 0.28 0.00 0.34 0.00 0.36 0.20 0.23 0.15 0.26 vector length 1.00 1.00

    do the mixture representations, but in all cases these alternatives underperform the tag signature. We also note that the poorer performers also possess a larger standard deviation, indicating that the clusterer was more scattershot from site to site. As a baseline, we also indicate the results of a random clustering of pages. In our collection, one cluster had on average 85 members, the other cluster had 25. So a completely random assignation would on average mis-cluster 22% of the pages. Based on these results, we choose a tag signature as our fundamental page representation. As a further illustration of the effectiveness of the tag signature, consider Figure 2. We have listed the average tag signature for the 34 “No Results” pages and the 76 normal results pages for IBM.com. On inspection of these tag signatures, we can see why the clustering was so successful. The tag signatures differ in both the presence of certain tags and in the relative contribution of each tag to the overall signature. For example, the normal results pages do not include three tags: the , , or tags. Conversely, the “No Results” pages do not include the

    tag. Additionally, we conducted extensive experiments with various clusterers – from k-means to a version of expectation maximization to a version of CLASSIT. We also experimented with various cluster settings – varying the number of clusters, etc. We found that k-means performed as good or better than all other clusterers, and that varying the cluster number resulted in only minor changes to the overall performance of the system. Additionally, we are encouraged by the simplicity and efficiency of k-means. If we set the number of clusters greater than the number of actual clusters, the clustering algorithm will merely generate

    Table 4: Multiple Object-Rich Fragments Precision Recall Average 90.5% 95.5%

    more refined clusters. This is not a problem in our context, since object extraction is dependent only on the quality of each cluster; a sufficiently good cluster will yield reasonable results regardless of the grain of the cluster.

    4.2

    Fragment Detection

    In our second set of experiments, we assessed precision and recall over the pages with actual query-related objects present. We measured precision as the percentage of recommended subtrees that THOR correctly labeled as objectrich. We measured recall as the percentage of all recommended subtrees that contained the object-rich subtree at least as a subset. Table 3 summarizes our results for the single most object-rich subtree case. We expect that these performance figures may be improved significantly in future versions of THOR through algorithmic and parameter optimizations. On inspection of the mis-labeled pages, we discovered that THOR was sometimes confused by pages with a region of dynamic non-query-related data. For example, some pages generate an advertisement region that varies somewhat across the space of pages. As a result, the intra-cluster content analysis may incorrectly identify the dynamic advertisement as an object-rich region. It is interesting to note that five of the 50 sites analyzed had previously been incorrectly labeled by Omini, a heuristics-based object extraction system [4]. For each of these five sites (clockstop, goto, newsblip, booksite, and k9country), THOR was successful at identifying the appropriate object-rich subtree.

    4.3

    Extracting Multiple Object-Rich Fragments

    In our final experiment, we further evaluated THOR’s effectiveness at identifying multiple object-rich portions. This experiment runs over five of the 50 page collections. Each of the five sets of pages contained at least two object-rich regions. As a result, we considered THOR’s top two recommendations instead of just the top-ranked one. We summarize our results in Table 4. For Amazon.com books, for example, we found that the top-two object-rich subtrees identified by THOR were the main results section and the “See Related Items” sidebar. The “See Related Items” sidebar is filled with query-related items from other Amazon.com stores, like the toys, DVDs, and music sections of Amazon.com. Finding these non-obvious object-rich regions greatly improves the power of THOR and the success rate of THOR for automated object extraction from dynamicallygenerated web pages.

    5.

    RELATED WORK

    Object extraction is closely related to the answer extraction problem in the context of natural language question answering [1, 3]. Efficiently extracting objects from the

    Web provides a foundation for answer extraction through an increased scope of possible data sources to query and by filtering out the noise inherent in the Web – noise like the navigation bars, advertisements, and other boilerplate. The object-extraction problem has been previously explored by [4]. Rather than rely on inter-page similarities and differences as THOR does, Omini analyzes each page independently of all others. The drawback of this work is that it concentrates on a single fragment of the page from which to draw data objects from, and cannot distinguish relevant content from other types of complex page regions, such as navigation bars. Current pages are more complex, and contain multiple data regions and many complex nondata regions, which THOR is able to accurately detect to extract from or discard as appropriate. The WHIRL system [6] uses tag patterns and textual similarity of items stored in a deductive database to extract simple lists or hot lists (lists of hyperlinks). The authors present several methods for identifying interesting structures in a web page. The system relies on previously acquired information in its deductive database in order to recognize data in target pages. For data extraction across the heterogeneity of the Web, this approach is infeasible. Bar-Yossef and Rajagopalan have identified the distinct subsections of a web page as pagelets, an idea somewhat similar to our minimal subtree formulation [2]. They suggest that the pagelet is the proper unit for information retrieval, due to the pervasive use of templates in generating web pages. Their identification of pagelets relies on very close similarity. In contrast, our subtree identification process finds equivalent subtrees across a constrained set of pages where the content is dissimilar. Crescenzi, et al., have presented the RoadRunner algorithm for automatically extracting data from web pages [7]. Their algorithm compares two pages generated by the same query form and constructs a regular expression based on the differences between the two pages. When applied to many sites, this approach quickly breaks down because of minor variations in non-data portions of pages that are subsequently identified as attributes. The problem of classifying web pages has been previously tackled by Glover, et al, and others. Glover, et al, use web structure to classify page in the larger space of all web documents [9]. While their approach is appropriate for the textual relevance of the content of document to a query, it is inappropriate for clustering pages from the same domain, or for identifying the appropriate sections of a document.

    6.

    CONCLUSIONS AND NEXT STEPS

    The identification and extraction of relevant objects from dynamically-generated web pages is a powerful tool necessary for the realization of a dynamic search engine. The main thrust of this paper is the two-level clustering of web pages based on tag structure similarity and content layout similarity. THOR’s two-phase clustering scheme capitalizes on many IR techniques to exploit local homogeneity in dynamically-generated web pages through the relationships that hold across pages. As our extensive experiments have shown, THOR is an effective platform from which to power a next-generation search engine for searching and indexing dynamic web content. Our research continues along the direction of improving the coverage and success rate of THOR. We are interested in

    improving the sampling quality of the dynamically-generated web pages. Simply supplying random words to a search form is a reasonable first step, but we would expect that a more context-sensitive selection of search terms could yield better coverage of the entire space of possible dynamicallygenerated pages. We are also interested in further exploring other page representations and clustering methods, as well as the further refinement of the IR tools utilized.

    7.

    ACKNOWLEDGMENTS

    Special thanks to Ashwin Ram for many helpful suggestions and to Chris Sprague for advice in implementing THORs clustering engine. The third author is partially supported by NSF, DoE, and DARPA.

    8.

    REFERENCES

    [1] S. Abney, M. Collins, and A. Singhal. Answer extraction. In Proceedings of ANLP’00, April 2000. [2] Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In Proceedings of WWW’02, May 2002. [3] J. Burger et al. Issues, tasks and program structures to roadmap research in question & answering Q&A, March 2002. [4] D. Buttler, L. Liu, and C. Pu. A fully automated object extraction system for the Web. In Proceedings of ICDCS’01, April 2001. [5] J. Caverlee, D. Buttler, and L. Liu. Discovering objects in dynamically-generated web pages. Technical report, Georgia Institute of Technology, January 2003. [6] W. Cohen. Recognizing structure in web pages using similarity queries. In Proceedings of AAAI’99, July 1999. [7] V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In Proceedings of VLDB’01, September 2001. [8] R. Duda, P. Hart, and D. Stork. Pattern Classification. John Wiley and Sons, New York, 2001. [9] E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pennock, and G. W. Flake. Using web structure for classifying and describing web pages. In Proceedings of WWW’02, May 2002. [10] M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. [11] D. Raggett. Clean up your web pages with HTML TIDY, 1999. [12] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. In Readings in Information Retrieval. Morgan Kauffman, San Francisco, CA, 1997. [13] G. Salton, A. Wong, and C. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1971. [14] I. H. Witten and E. Frank. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco, 1999. [15] World Wide Web Consortium. Well formed XML documents, 2000.

    0.14 0.25 0.30 0.28 0.30 0.28