Document not found! Please try again

Mapping the semantics of Web text and links - Semantic Scholar

14 downloads 57514 Views 2MB Size Report
as well as by areas (automotive, health care, telecom, etc.). Furthermore, the ..... reveals. For example if one page about email marketing is classified under the ...
Mapping the semantics of Web text and links Filippo Menczer School of Informatics and Department of Computer Science Indiana University, Bloomington

[email protected]

ABSTRACT

1.

Content and link information from the Web is used by search engines to crawl, index, retrieve, and rank pages. The correlations between similarity measures based on these cues and on semantic associations between pages is therefore crucial in determining the performance of any search tool. A great deal of research is under way to understand how to automatically extract semantic information from Web pages by mining their text and links. Here I quantitatively analyze the relationship between content, link, and semantic similarity measures across a massive number of Web page pairs. Maps of semantic similarity across textual and link similarity domains help visualize the potential and limitations of content and link analysis for relevance approximation, and provide us with a way to analyze whether and how text and link based measures should be combined. Highly heterogeneous topical maps suggest that links and content analysis should be specialized based on search context. Finally I show how semantic maps can be used to evaluate the performance of search engines in a semi-supervised fashion, by identifying a single relevant page for a given query. The methodology is illustrated by graphing precision-recall plots for three commercial search engines based on TREC queries.

Search engines must match a user’s information needs with the meaning of Web pages. This requires a number of conditions to be satisfied: (i) the pages that are relevant to the user must have been crawled and indexed; (ii) the engine must be able to recognize and retrieve all of the relevant pages from its index; and (iii) the engine must rank the relevant pages to show the best ones to the user. The above conditions all assume correlations between page meaning and observable cues such as keywords and hyperlinks. Obviously such assumptions are true to varying degrees for different users, queries, and search engines. To what extent can search engines rely on these assumptions, in general and for specific topics? In this paper we address these questions by modeling the two classes of cues used by virtually all search engines — text and links — and the similarity relationships that exist between pages with respect to such cues. Then we study empirically how these relationships map onto the semantic similarity between pages. This way we explore in a single framework what page content and links say about each other, and what they say about meaning. Consider two objects p and q. An object can be a page or a query; for simplicity I will refer to objects as pages in this paper. To satisfy the conditions listed above, a search engine must compute a semantic similarity function σs (p, q) to establish the degree to which the meanings of the two pages are related. While people are good at computing σs , i.e. assessing relevance, we can only approximate this function with computational methods. The performance and success of a search engine depend in great part on the sophistication and accuracy of the σs approximations implemented in its crawling, retrieval, and ranking algorithms. Therefore understanding the limitations of such approximations is crucial for the design of better search tools for the Web. Semantic similarity is generally approximated from two main classes of cues in the Web: lexical cues (textual content) and link cues (hyperlinks). Content similarity metrics traditionally used by search engines to rank hits are derived from the vector space model [39], which represents each document or query by a vector with one dimension for each term and a weight along that dimension that estimates the term’s contribution to the meaning of the document. The cluster hypothesis behind this model is that a document lexically close to a relevant document is also semantically relevant with high probability [42]. The latest generation of search engines integrates content and link metrics to improve ranking and crawling per-

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; H.3.4 [Information Storage and Retrieval]: Systems and Software— Performance evaluation (effectiveness)

Keywords Web search, semantic maps, content and link similarity, precision, recall, semi-supervised evaluation

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2004 author X-XXXXX-XX-X/XX/XX ...$5.00.

INTRODUCTION

formance through better models of relevance. The best known example is Google: pages are retrieved based on their content and ranked based on, among other factors, the PageRank measure, which is computed offline by queryindependent link analysis [5]. Links are also used in conjunction with text to identify hub and authority pages for a certain subject [24], determine the reputation of a given site [37], and guide search agents crawling on behalf of users or topical search engines [34, 35, 11, 31, 36, 41]. Finally link analysis has been applied to identify Web communities [19, 26, 15]. The hidden assumption behind all of these retrieval, ranking, crawling and clustering algorithms that use link analysis to make semantic inferences is a correlation between the graph topology of the Web and the meaning of pages, or more precisely the conjecture that one can infer what a page is about by looking at its neighbors. This link-cluster hypothesis has been implied or stated in various forms [19, 3, 9, 14, 13] and confirmed empirically [33].

1.1

Contributions

In this paper I quantitatively explore the relationships between content, link, and semantic topology of the Web at a fine level of resolution. The basic idea is to measure the correlations between σs , σc , and σl where the two latter functions are similarity metrics based on lexical content and link cues, respectively. This study has the following goals: 1. Study the distribution and correlation of similarity measures based on content, links, and (human-assessed) meaning of Web pages. 2. Estimate the quality of cues about meaning that one can obtain from local text and link analysis. 3. Explore whether and how σc and σl should be combined to better approximate σs . 4. Analyze the sensitivity of these relationships to the topical context of a user’s information needs. 5. Illustrate a practical application of mapping σc and σl into σs , namely a semi-supervised evaluation methodology for search engines.

1.2

Background

This is by no means the first effort to draw a connection between Web topologies driven by content and link cues, or between either of these and semantic characterizations of pages. Recently, for example, theoretical models have been proposed to unify content and link generation based on latent semantic and link eigenvalue analysis [1, 12]. The approach presented here is more empirical and simply aims to discover the actual correlations between these different sources of evidence. Navigation models for efficient Web crawling have provided a context for the study of functional relationships between link probability and forms of content [30] or semantic similarity [25, 11, 30]. The author has also analyzed the dependence of link topology on content similarity to interpret the Web’s emergent link degree distribution through local, content-driven generative models [30, 32]. The findings in the present study are more geared toward tangible applications to Web information retrieval. Several studies have related link and content similarity in the context of hypertext document classification [9, 8, 26,

18]. Here I am not asking how to classify Web pages, but rather what the text and links of a pair of pages tell us about their semantic relatedness. Closely related questions are explored in [10, 22]. In previous work [33] I have looked at the decay in content similarity as one crawls away from a seed page, showing that there is a strong negative correlation between link distance and content similarity. That type of analysis has important limitations, however. For one, knowledge of link distance requires exhaustive breadth-first search, which makes it expensive to crawl very far from the seed page (say, more than three links away). In addition, the choice of seed pages can bias the crawl dynamics considerably; for example starting from a popular hub or authority page such as a Yahoo category will give quite different results than starting from some obscure personal homepage.

1.3

Outline

In part to overcome the above difficulties, and in part to analyze the relationship between text, links and meaning in the Web at finer granularity, here I treat content and link similarity as independent sources of evidence to estimate semantic similarity. These metrics, along with the experimental setup for data collection and analysis, are discussed in Section 2. Section 3 reports on summary correlation statistics and background distributions of the three similarity metrics. In the remainder of the paper I use both projections and maps to visualize the correlations between content, link and semantic topology (Section 4) and to study the dependence and distribution of semantic similarity as a function of content and link similarity (Sections 5 and 6). The heterogeneity of semantic maps across topical domains is discussed in Section 7. Finally, I illustrate the semi-supervised evaluation methodology in Section 8.

2.

METHODOLOGY

Given the goal of mapping pairwise relationships, the first step is to sample a set of pages that are representative of the Web at large and for which independent semantic information is available along with the content and link data that is locally accessible by crawling the pages. The second step is a brute force approach: for each pair of pages p, q measure σc (p, q), σl (p, q), and σs (p, q). Before describing each of these three measures in detail, let us review how our data was collected. The Open Directory Project1 (ODP) classifies a large number of URLs in a topical hierarchy. Compared to Yahoo, the ODP is less biased toward commercial content or toward any type of content because it is maintained by a large number of self-organized, volunteer editors; compared to other Web directories based on volunteer editors such as About.com and LookSmart, the ODP makes all of its data freely available through periodic RDF dumps. We started from all the URLs in an RDF dump of the ODP. For language consistency we eliminated the “World” branch, which classifies non-English pages. For classification consistency we also eliminated the “Regional” branch, which replicates the main topical tree for many geographical locations. This left 896,233 URLs organized into 97,614 topics. We sampled 10,000 URLs uniformly from each of the 15 top-level branches, resulting in a final set of 150,000 1

http://dmoz.org

URLs belonging to 47,174 topics. All of these URLs corresponded to working links to HTML pages available via the HTTP protocol. The pages were crawled, preprocessed and stored locally. For efficiency, only the first 20 KB of each page were downloaded, with a timeout of 10 seconds.

2.1

Content similarity wkp1 wkp2 ” “P

2 k∈p2 wkp2

(1) ”

where (p1 , p2 ) is a pair of Web pages and wkp is the frequency of term k in page p. This is the “cosine similarity” function, which is traditionally used in information retrieval because of a number of nice mathematical properties. For example, it does not suffer from the dimensionality bias that makes Lnorms inappropriate as distance metrics in high dimensional, sparse spaces such as the word vector space [39]. The use of simple term frequency in place of more sophisticated TFIDF weighting schemes in Equation 1 is due to the need for a document representation insensitive to different page samples and to different topic subsets. We do not want to bias the content similarity measure by any assumption about specific word frequency distributions, since we are sampling pages from the Web at large. Noise words are eliminated [16] and other words are conflated using the standard Porter stemmer [38]. For each pair of pages in our sample, σc ∈ [0, 1] is readily computed from their textual representation, without any global knowledge of the collection of pages in the sample. One could of course explore alternative content similarity measures, however our preliminary experiments indicate that other commonly used measures such as the Jaccard coefficient do not affect the analysis in the reminder of the paper.

Link similarity

Let us define a link similarity σl (p1 , p2 ) =

Semantic similarity σs (p1 , p2 ) =

k∈p1 ∩p2

2 k∈p1 wkp1

2.2

2.3

Let us define a semantic similarity

Let us define a content similarity P σc (p1 , p2 ) = r“ P

(in this case σl is analogous to co-reference). However, mirrors are not present in the sample studied here. On the other hand co-citation and direct links are intuitively important to capture the semantic clues that authors encode in links.

|Up1 ∩ Up2 | |Up1 ∪ Up2 |

(2)

where Up is the set containing the URLs of p’s outlinks, inlinks, and of p itself. The outlinks are obtained from the pages themselves, while a set of at most 20 inlinks to each page in the sample is obtained by submitting a link query with the page URL to a search engine.2 Link similarity is really a neighborhood function, measuring the local clustering between two pages. A high value of σl indicates that the two pages belong to a clique of pages. Related measures are often used in link analysis to identify a community around a topic. This measure includes co-citation [40] and co-reference (or bibliographic coupling [23]) as special cases, but also counts directed paths of length ` ≤ 2 links. Such directed paths are important because they could be navigated by a user or crawler. For each pair of pages in our sample, σl ∈ [0, 1] is readily computed from their link sets. One could of course explore alternative link similarity measures. For example, using only outlinks in Up would yield σl = 1 between mirrored pages, which is intuitively correct 2 We used the Google (http://www.google.com/apis/).

Web

API

2 log Pr[t0 (p1 , p2 )] log Pr[t(p1 )] + log Pr[t(p2 )]

(3)

where t(p) is the topic containing p in the ODP, t0 is the lowest common ancestor of p1 and p2 in the ODP tree, and Pr[t] represents the prior probability that any page is classified under topic t. In practice we compute Pr[t] offline for every topic t in the ODP by counting the fraction of pages stored in the subtree rooted at node t, out of all the pages in the tree (that is, all the 896,233 unfiltered pages, not just the 150,000 sampled pages). The path from the root to t0 is a measure of the meaning shared between the two topics, and therefore of what relates the two pages. Conversely the paths between t0 and the two page topics is a measure of what distinguishes the meanings of the two pages. This semantic similarity measure clearly relies on the existence of a hierarchical organization that classifies all of the pages being considered. It is a straightforward extension of the information-theoretic similarity measure [28], designed to compensate for the fact that the tree can be unbalanced both in terms of its topology and of the relative size of its nodes. For a perfectly balanced tree σs corresponds to the familiar tree distance measure. One could of course explore semantic similarity measures based on alternative Web directories or ontologies. Sampling pages from the ODP guarantees that semantic information for each page is available and reliable — it is assessed by human editors rather than estimated by automatic content or link analysis methods. There is a caveat. The ODP ontology is more complex than a simple tree. Some categories have multiple criteria to classify subcategories. For example, the “Business” category is subdivided by types of organizations (cooperatives, small businesses, major companies, etc.) as well as by areas (automotive, health care, telecom, etc.). Furthermore, the ODP has various types of cross-reference links between categories, so that a node may have multiple parent nodes, and even cycles are present. Finally, the structure is constantly being updated.3 Therefore, one could extract many parallel tree structures from the ODP ontology at any given time. Unfortunately, while semantic similarity measures based on trees are well studied [17], the design of well-founded similarity measures for objects stored in the nodes of arbitrary graphs is an open problem. A few empirical measures have been proposed, for example based on minimum cut/maximum flow algorithms [29], but no known information-theoretic measure is known. For the present study we have simplified our analysis by eliminating all cross-reference links and thus reducing the directory to a single, prototypical tree. For each pair of pages in our sample, σs ∈ [0, 1] is readily computed by looking up the topics of the two pages based on such a tree.

3 See http://rdf.dmoz.org/rdf/tags.html for the RDF tags used in the latest ODP ontology.

1

1e+10

σc σl σs

1e+09

0.8 4

0.6

1e+08

σl

0.4

1e+07 0 0.2

pairs

1e+06 0 0

100000

0.2

0.4

0.6

0.8

1

σc

1

10000 0.8 4

1000 0.6

100

σs

0.4

10 0

0.2

0.4

0.6

0.8

1 0

similarity

0.2

Figure 1: Histograms plotting the distributions of content, link, and semantic similarity.

0 0

0.2

0.4

0.6

0.8

1

σc

1

0.8 4

3.

CORRELATIONS AND DISTRIBUTIONS 0.6

Of all the pairs of pages in the ODP sample, only those with all three well defined similarity measures are considered valid. For example a pair is discarded if a page times out making σc and σl undefined, or if the inlinks of a page are not available making σl undefined. There are over 3.8 × 109 valid pairs. For each of the three similarity metrics, the unit interval is divided into 100 bins. For each of the resulting 106 bins, the number of pairs with values corresponding to the bin’s (σc , σl , σs ) tuple is counted. From this information a number of interesting statistics and visual maps can be derived. Let us start with some simple numbers. The Pearson’s correlation coefficients between pairs of similarity metrics are ρ(σc , σl ) = 0.10, ρ(σc , σs ) = 0.11, and ρ(σl , σs ) = 0.08. Interestingly, these are not as strong correlations as one might have predicted. However, when considering the number of pairs one easily realizes that they represent very significantly positive correlations. Incidentally, the latter two numbers quantify the validity of the cluster and link-cluster hypotheses. Figure 1 shows the distribution of each individual similarity metric. All three metrics appear to have a roughly exponential distribution, with a striking fit in the case of content similarity. Semantic similarity values in the range 0.01 ≤ σs ≤ 0.13 are not represented because of the limited depth of the ODP tree. These distributions tell us that most pairs tend to have very small values for all similarity measures, with each distribution peaked at zero. This is not surprising; given two random pages we do not expect them to be lexically similar, closely clustered, or semantically related. The very small number of pairs with high similarity values is the main reason for the low correlation coefficients.

4.

σs

0.4

0 0.2

0 0

0.2

0.4

0.6

0.8

1

σl

Figure 2: Joint distribution maps for pairs of similarity metrics. Colors represent the number of pairs in each bin, on a logarithmic scale. The legend displays the color scale from black (100 pairs) to yellow (104 pairs and above). White represents missing data (no pairs).

some correlation structure between σc and σl that is not accounted simply by the individual distributions. Specifically, for σl > 0 the peak density of content similarity occurs at σc > 0. For example the peak is around 0.4 . σc . 0.6 at σl ≈ 0.6. Further, the local peak around σc ≈ σl ≈ 0.9 would not be predictable if the background distributions were independent, based on the rarity of those values. The map therefore sheds light on the positive correlation between σc and σl ; pages that are similar in content do tend to be clustered in link space. This result confirms our previous measurements [33], however it also demonstrates that the correlation is weak. The correlations between σc and σs and between σl and σs are not as obvious from the other joint distribution maps in Figure 2. These maps on one hand refine the cluster and link-cluster hypotheses, but on the other hand demonstrate that the correlations are difficult to detect given the background similarity distributions.

JOINT DISTRIBUTION MAPS

A much richer picture of the similarity distributions and their relationships can be obtained considering the joint distributions, by plotting 2-D histograms as maps. For each pair of similarity metrics, Figure 2 maps the joint distribution of those two metrics. The first observation is that, predictably given the individual distributions, most pairs are near the origin (σc = σl = σs = 0). However there is

5.

COMBINING CONTENT AND LINK SIMILARITY

In information retrieval the effectiveness of a document ranking system can be assessed, if the relevant set is known, using the standard precision and recall measures and the tool of precision-recall plots. While it would be extremely

(σc+σl)/2 1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

σc σ l

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

1

0.8

1

1

0.8 0

0.8

0.6 0.2

0.4

0.4 0.6

0 σl

0.6 0.2

0.2 0.8

0.4

0.4 0.6

σc

1 0

σl

0.2 0.8

0.6

1 0

P

σc

best σl 2/3 σl + 1/3 σc 1/2 σl + 1/2 σc 1/3 σl + 2/3 σc σc

max(σc,σl) 1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

σl if σc > 0.5 1

0.8

0.8

0.6

0.6

0.4

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0.4

0.4

0.2

0.2

0

0.2

0 1

1

0.8 0

0.8

0.6 0.2

0.4 σc

0.4 0.6

0 σl

0.2 0.8

0.6 0.2

0.4

0.4 0.6

σc

1 0

σl

0.2 0.8

1 0

0 1e-05

0.0001

0.001

0.01

0.1

1

R

Figure 3: Illustrations of three combinations of σc and σl . Clockwise from top left: linear, product, max, and link-after-content-filtering. The contour lines correspond to various β thresholds.

Figure 4: Functional precision-recall plots for rankings based on various linear combinations of content and link similarity. 1 best σl σ c max(σl, σc) σl if σc > 0.5 0.8

0.6 P

interesting to evaluate how effectively Web pages could be ranked based on, say, content or link similarity, this is generally impossible because relevant sets are unknown in the Web. However, one can think of using a page as a query (as in “query by example” retrieval systems) and ranking all other pages, then using semantic similarity data to assess the ranking. Our data supports this approach, assuming the above procedure is repeated with every page used as an example. Let us define functional precision and recall as follows:

0.4

0.2

X P (f, β)

R(f, β)

=

=

σs (p, q)

p,q:f (σc (p,q),σl (p,q))≥β

|p, q : f (σc (p, q), σl (p, q)) ≥ β| X σs (p, q) p,q:f (σc (p,q),σl (p,q))≥β

X

(4)

0 1e-05

0.0001

0.001

0.01

0.1

1

R

(5)

σs (p, q)

p,q

where f is some function of σc and σl that expresses how content and link similarity are to be combined in order to estimate semantic similarity. Pairs are then ranked by f . The threshold β is used as an independent rank parameter, which allows to compute P and R for an increasing number of pairs (as β decreases). The simplest way to combine content and link similarity is a linear combination as expressed by the function f (σc , σl ) = ασl + (1 − α)σc . The intersection of the f (σc , σl ) plane with the (σc , σl ) plane is a line with equation ασl +(1−α)σc = β, where α represents the slope and β represents the intercept. The special cases α = 0 and α = 1 correspond to pure content-based and link-based ranking, respectively. Figure 3 illustrates the case α = 0.5 with 10 lines corresponding to various values of β. Content and link similarity can also be combined in nonlinear ways. Here we consider three non-linear functions also illustrated in Figure 3: the product f (σc , σl ) = σc σl , the maximum f (σc , σl ) = max(σc , σl ), and a strategy whereby first we filter based on content similarity (we keep pairs (p, q) such that σc (p, q) > 0.5) and then we rank by link similarity.

Figure 5: Functional precision-recall plots for rankings based on various non-linear combinations of content and link similarity.

Such a “link-after-content-filtering” strategy is expressed by  σl σc > 0.5 f (σc , σl ) = 0 otherwise. The functional precision-recall plots in Figure 4 are based on linear combinations of link and content similarity for various values of the slope α. Since most of the σs “mass” occurs near the origin (σc = σl = 0) and the most interesting region is that of low recall (the “top hits”), recall is better visualized on a logarithmic scale. The plots based on just one cue (α = 0 and α = 1) show that ranking by link similarity produces better precision at low recall levels, while ranking by content similarity produces better precision at high recall levels. This is consistent with the use of link analysis in ranking by search engines, since most users only look at a few hits and perceive precision as a more immediate performance indicator than recall. However, combinations of content and link similarity yield better compromise rankings. In particular, using any amount of content information in addition to link analysis improves precision at both low

recall levels (where link information alone is noisy) and high recall levels (where link information alone is useless). This demonstrates that if a search engine could efficiently rank pages based not only on query-independent link analysis but also on content analysis (or on content-dependent link analysis [3]), they could achieve significantly higher precision at low recall, i.e., more relevant hits in the first page. The functional precision-recall plots in Figure 5 are based on the three non-linear functions of link and content similarity described above. The max function is not competitive, while the product function performs about as well as the best linear combination (α = 2/3). The “filtering” strategy loses some pairs and therefore performs more poorly at high recall, but for this price it achieves the best precision at low recall. This strategy is vaguely reminiscent of the approach used by search engines like Google, where pages are first filtered based on whether they contain the query terms, and then ranked by some combination of factors that include content-independent link analysis, namely PageRank. For comparison, Figures 4 and 5 also show the best performance achievable. This is obtained by ranking (σc , σl ) bins by their localized precision (defined in Section 6). It can be seen that at any given recall level the best combination of content and link similarity does quite well compared to the optimal strategy. However, which strategy is best depends on the recall level. At high recall, text similarity is important; around R ≈ 0.01 pure link similarity is almost optimal; and at low recall the best strategies combine content and link similarity. One last observation on the functional precision-recall plots in Figures 4 and 5 is that these curves are not monotonically decreasing. This is due to the fact that many pairs ranked highest by content and link similarity are not semantically related. There is significant noise. As more pairs are considered, precision may increase before it decreases. This is not unusual for the Web, where interpolation is often used to obtain monotonically decreasing precision-recall curves (cf. [7], p. 55). Interpolation is not used here to highlight the noisy nature of similarity cues.

6.

SEMANTIC MAPS

To visualize how accurately semantic similarity can be approximated from content and link cues, we need to map the σs landscape as a function of σc and σl . There are two different types of information about σs that can be mapped for any given (σc , σl ) coordinates. Averaging within a bin, i.e. normalizing local semantic similarity by the number of pairs in each bin, highlights the expected values of σs and is akin to precision. Summing within a bin, i.e. normalizing local semantic similarity by the total number of pairs, captures the relative mass of semantically similar pairs and is akin to recall. Let us therefore define localized precision and recall as follows:

P (sc , sl )

=

R(sc , sl )

=

S(sc , sl ) N (sc , sl ) S(sc , sl ) Stot

(6) (7)

where S(sc , sl )

X

=

σs (p, q)

(8)

p,q:σc (p,q)=sc ,σl (p,q)=sl

N (sc , sl )

=

Stot

=

|p, q : σc (p, q) = sc , σl (p, q) = sl | X σs (p, q)

(9) (10)

p,q

and (sc , sl ) is a coordinate value pair for (σc , σl ). Figure 6 maps localized R and P as functions of content and link similarity coordinates. These semantic maps provide for a much richer and more complex account of the information about meaning that can be inferred from text and link cues. A number of observations are in order. Note first that in general recall maps are qualitatively similar to distribution maps (cf. Figure 2) because the distributions of content and link similarity are so heavily skewed toward the origin. Since the majority of pairs occur near the origin, the same holds for most of the semantically related pairs, as shown by the high recall. However all this relevant mass is washed away in a sea of unrelated pairs, so precision near the origin is negligible. This creates a serious challenge for search engines: achieving high recall costs dearly in terms of precision and leads to user frustration. While emphasis on precision is a very reasonable approach for a search engine, it must be stressed that the cost of this choice in terms of recall is very high in the Web. An analogous, but much weaker effect is observable for very high content and link similarity (σc ≈ σl ≈ 0.9). Here there is a moderate peak in recall, which does not correspond to a peak in precision. There are a few highly similar and highly clustered pages here, but they are not necessarily highly related. This cluster is explained in Section 7. Focusing on the precision map, for very high content similarity (σc & 0.8), there is significant noise making it difficult to get a clear signal from link cues. There are many relevant pages in this region, but they cannot be identified from link analysis. Maybe these are cases where authors do not know of related pages or do not want to point to the competition. There are also many pairs in this region that are not semantically related, so very high content similarity is not a reliable signal. For medium-high content similarity (0.4 . σc . 0.8), we observe a surprising basin of low precision for low link similarity (0.1 . σl . 0.4). Such an inversion could indicate that in this region a few common links may be misleading — think of all unrelated pages linking to popular portals and software download sites. An alternative explanation originates from recalling that our semantic similarity measure is based on a particular tree view of the ODP ontology. It is possible that the basin corresponds to pairs of pages that are in reality more semantically related than our measure reveals. For example if one page about email marketing is classified under the “Computers” branch while another is under the “Business” branch, σs = 0 even though the two are semantically related. Thus such a basin may be symptomatic of the limitations of any particular tree ontology. In the same medium-high content similarity range, precision reaches many very high peaks for maximum link similarity (σl & 0.7). Such peaks correspond to text and link similarity values such that there are very few pairs, but those few correspond to highly related pages. This helps under-

1

0.8 -3

0.6 σl

0.4

-8 0.2

Table 1: Top level topics from the Open Directory and summary statistics from pairwise similarity analysis. Pearson’s correlation coefficients above the all-pairs values (shown in the last row for reference) are in bold. Topic Pairs ρ(σc , σl ) ρ(σc , σs ) ρ(σl , σs ) Adult 2.5 × 106 0.15 0.11 0.07 Arts 2.1 × 107 0.06 0.11 0.09 Business 0.6 × 107 0.06 0.08 0.13 Computers 2.3 × 107 0.08 0.12 0.22 Games 1.4 × 107 0.12 0.04 0.05 Health 2.3 × 107 0.10 0.16 0.07 Home 2.2 × 107 0.50 0.35 0.16 Kids & Teens 2.0 × 107 0.13 0.16 0.08 News 3.2 × 106 0.39 0.30 0.48 Recreation 1.6 × 107 0.08 0.16 0.11 Reference 2.5 × 107 0.16 0.19 0.27 Science 2.3 × 107 0.11 0.14 0.05 Shopping 1.9 × 107 0.06 0.11 0.14 Society 1.7 × 107 0.08 0.14 0.07 Sports 1.9 × 107 0.14 0.19 0.09 All-Pairs 3.8 × 109 0.10 1.11 0.08

0 0

0.2

0.4

0.6

0.8

1

stand the good precision at low recall achieved by the “filtering” strategy in Section 5. Such a strategy mines the high precision region effectively.

σc 1

0.8 1

7. 0.6 σl

0.4

0 0.2

0 0

0.2

0.4

0.6

0.8

1

σc

Figure 6: Semantic maps of recall (top) and precision (bottom) for all pairs of sample Web pages. For readability, recall is visualized on a logarithmic scale between 10−8 and 10−3 or above. Note that given the very different distributions of P and R, it is not useful to combine them into a single F-measure [42]. R values span 8 orders of magnitude in the unit interval, therefore they dominate the product P · R in F . As a result the F-measure map looks almost identical to the recall map.

HETEROGENEITY ACROSS TOPICS

Similarity data thus far has been visualized with all pairs grouped together irrespective of what the pages are about. The resulting snapshots cannot tease apart the fundamental differences between the ways in which authors create and link content in different topical areas. To analyze the heterogeneity of the semantic maps and the utility of the results outlined above, I repeated the brute force analysis for each of the 15 top-level topics under which the pages are classified in the ODP. This time a pair of pages is considered only if both pages belong to the same topic. The topics and some summary statistics are shown in Table 1. All Pearson’s correlation coefficients are significantly positive although most are small. A few interesting exceptions are in the “Home,” “News,” and “Reference” categories. For example “Home” has the highest correlation between content and link similarity and between content and semantic similarity; the majority of these sites are about recipes — the words used in recipe pages are very good clues about the meaning of the pages! It is also reassuring to confirm that news sites choose words and place links consistently with the meaning of pages. And it is no surprise that “Reference” pages do a good job linking to related sites. Figure 7 shows the localized recall maps for the 15 toplevel ODP topics while Figure 8 shows the corresponding localized precision maps. It is immediately apparent that there is indeed a significant level of heterogeneity in these maps, qualitatively as well as quantitatively. Looking at semantic recall maps, the topics that display higher correlation coefficients are those for which non-zero recall values extend further away from the origin toward high content and link similarity. One exception is the lonely peak in the σc ≈ σl ≈ 1 corner of the “Adult” recall map. This is explained as a clique of spammer sites [6], thus called

Adult

Arts

Business

Computers

Games

Health

Home

Kids and Teens

News

1

0.75 σl

0.5

0.25

0 1

0.75 σl

0.5

0.25

0 1 -1

0.75 σl

0.5

0.25

-7

0 Recreation

Reference

Science

Shopping

Society

Sports

1

0.75 σl

0.5

0.25

0 1

0.75 σl

0.5

0.25

0 0

0.25

0.5 σc

0.75

1

0

0.25

0.5 σc

0.75

1

0

0.25

0.5

0.75

1

σc

Figure 7: Semantic recall maps for each top-level topic. Recall is visualized on a log color scale between 10−7 and 10−1 .

Adult

Arts

Business

Computers

Games

Health

Home

Kids and Teens

News

1

0.75 σl

0.5

0.25

0 1

0.75 σl

0.5

0.25

0 1 1

0.75 σl

0.5

0.25

0

0 Recreation

Reference

Science

Shopping

Society

Sports

1

0.75 σl

0.5

0.25

0 1

0.75 σl

0.5

0.25

0 0

0.25

0.5 σc

0.75

1

0

0.25

0.5 σc

0.75

1

0

0.25

0.5 σc

Figure 8: Semantic precision maps for each top-level topic.

0.75

1

because they are designed to fool the ranking algorithms of search engines by boosting link and content analysis. But in topics such as “Home,” “News,” and “Reference” it is clear that semantic similarity is correlated with both content and link similarity, and therefore text and links are informative cues about the meaning of pages. While for all topics most of the semantically related pairs occur near the origin, there are significant local optima and ridges that extend away from the origin for the above mentioned topics as well as “Computers.” However the recall topology is different for each topic. No topic displays the negative basin of the general recall map obtained using all page pairs. This is consistent with the hypothesis that the basin is due to cross-topic (σs = 0) pairs. Semantic precision maps display even greater heterogeneity, and differ significantly from the general precision map. With the exception of “Adult,” all topics have visible regions of high precision (shown in yellow) with various sizes, shapes, and locations. A couple of topics have large, well localized high precision regions: for “Computers” this is found at low σc and high σl while for “News” it is at high σc and high σl . These observations are quite simplistic but they are meant to highlight the diverse semantic inferences that can be drawn from text and link cues depending on the topical context of a search. To analyze how different combinations of content and link similarity would perform in ranking pages within different topical contexts, the analysis of Section 5 was repeated for some of the combination function and for each top-level topic. The resulting functional precision-recall plots are shown in Figure 9. These plots confirm that some topics are more amenable to semantic approximation based on content and link analysis. As expected, the topics with highest correlation between content, link, and semantic similarity (e.g., “Home” and “News”) are intrinsically easier in that the optimal precision-recall curves are very high. These, along with some other topics (e.g., “Computers,” “Reference,” and “Shopping”), are also those for which content and link analysis best approximate the best achievable performance. It is noteworthy that different semantic cues are most helpful in different contexts. For example, in the “Computers” and “Home” topics, link similarity is the best cue in order to maximize precision at low recall. In other topics (e.g. “News”) combinations of content and link similarity perform best. And in some cases, such as “Games” and “Sports,” all functions perform equally poorly.

8.

SEMI-SUPERVISED EVALUATION OF SEARCH ENGINES

Semantic maps give us a tool to estimate the relatedness between any two pages. If we know what one page is about, we can estimate the degree to which another arbitrary page is about the same topic. This can be done by automatic measurements of similarity based on locally observable content/link cues. Extending the idea, if we know one relevant page for a query, we can estimate the relevance of any given set of pages. This suggests a straightforward application of semantic maps to the evaluation of search engines. In machine learning, semi-supervised methods are techniques for extracting knowledge from data when only a very small fraction of the data is labeled so as to provide examples to the learning system [4]. Typically in supervised

learning a sufficient number of labeled examples is available for training. When a training set is not available, a few examples can be labeled by hand and then more examples can be labeled automatically by a bootstrapping process. The resulting (approximated) training set is used in a supervised setting. Here we liberally borrow the semi-supervised metaphor and apply it to an evaluation task rather than a learning task. As mentioned above, evaluation of retrieval systems requires knowledge of relevant sets, which are unavailable on the Web. The traditional approach is to have users manually assess the relevance of all documents in a collection. This is difficult due to the Web’s size and dynamic nature. However, even if it were feasible to assess the relevance of all pages indexed by a search engine, recall still could not be measured because of the many relevant pages potentially unknown to the search engine due to its limited coverage [27]. Using the semi-supervised approach, suppose a single highly relevant page r is known for a given query. Such a page can be identified by hand at relatively low cost. Through the semantic maps, we can imagine bootstrapping a virtual relevant set made of pages highly related to r. The idea is to estimate the semantic similarity between the retrieved (hit) set and such a virtual relevant set, from measures of content and link similarity. Finally we can approximate the precision and recall of the entire hit set.

8.1

Evaluation algorithm

For a query q, identify a highly relevant page rq . Then consider each page p in the hit set Hq obtained by a search engine in response to q. Let sc (p, q) = σc (p, rq ) (from Equation 1) and sl (p, q) = σl (p, rq ) (from Equation 2). To estimate the precision and recall of a hit set we can use the localized semantic similarity measures computed from our Web sample to build the precision and recall maps (cf. Equations 6 and 7). Formally, let Hqm = {p1 · · · pm } ⊂ Hq be the set of top m hits in Hq . Then estimate the set precision and recall for Hqm as follows: Pm S(sc (pi ,q),sl (pi ,q)) i=1 N (sc (pi ,q),sl (pi ,q))

P (Hqm ) ≈ R(Hqm ) ≈

m i=1 S(sc (pi , q), sl (pi , q)) Stot

(11)

Pm

(12)

where where S(·), N (·) and Stot are defined Equations 8, 9 and 10, respectively. Finally, averaging over a set of queries Q: 1 X P (m) = P (Hqm ) (13) |Q| q∈Q R(m)

=

1 X R(Hqm ). |Q| q∈Q

(14)

In other words, the relevance of every hit is assessed by estimating its semantic similarity to the known relevant page. This estimation is done by measuring the content and link similarity between the hit and the relevant page. These measures are used as coordinates into the general precision and recall maps. The precision and recall levels thus computed, P (m) and R(m), can be plotted against each other using m as an independent rank parameter. The semantic maps used in this evaluation are based on the assessment made by the experts who classify pages in

Figure 9: Select functional precision-recall plots for each top-level topic.

Table 2: TREC queries used in the evaluation of the three search engines, with URLs of relevant homepages. Homepage (rq ) www.tldp.org/ www.dogtech.com/ www.vicnet.net.au/ content.nejm.org/ www.cnet.com/ www.ksc.nasa.gov/ www.democrats.org.au/ www.geographia.com/sheraton/ www.haas.berkeley.edu/ www.manlyweb.com.au/ hcs.harvard.edu/ gsc/ www.dhs.state.tx.us/ www.cw.com/ www.brent.gov.uk/ suned.sun.com/US/certification/solaris/ www.adaic.org/ ublib.buffalo.edu/libraries/units/lml/ www.visitmaine.com/home.php www.cs.ust.hk/ www.fastnet.com.au/hotels/zone4/my/... www.ncrel.org/ www.oplin.lib.oh.us/ www.buffalo.edu/ hwmin.gbgm-umc.org/sarahjjohnsonny/ www.aads.net/main.html www.gtu.edu/ www.beachwood-motel.com/ www.chem.leeds.ac.uk/ www.donaldshaw.com/ www.law.syr.edu/lawlibrary/lawlibrary.asp

0.8

0.6 sl

Query (q) Linux Documentation Project DogHouse Technologies, Inc. Victoria, Australia New England Journal of Medicine CNET Kennedy Space Center Australian Democrats Sheraton Hotels Latin America Haas Business School Manly, Australia Harvard Graduate Student Council Texas Department of Human Services Cable Wireless, Inc. Brent Council Solaris certification Ada information clearinghouse Lockwood Memorial Library Maine Office of Tourism HKUST Computer Science Dept. Wah Yew Hotel NCREL Ohio Public Library Information Network SUNY Buffalo Sarah J. Johnson United Methodist Church Chicago NAP Graduate Theology Union Beachwood Motel, Orchard Beach, Maine Chemistry, Leeds University Donald W. Shaw Real Estate Douglas Barclay Law Library

0.4

0.2

0 0

4

http://trec.nist.gov

0.4

0.6

0.8

1

Figure 10: Content and link similarity coordinates of the top three hits returned by the three search engines for each of the 30 TREC queries. The points in the top right-hand corner correspond to the cases in which the relevant homepage is ranked among the top three hits. 0.35 Google Teoma MSN 0.3

0.25

Comparison of three search engines

0.2 P

To test the proposed methodology, I applied it to the evaluation of three large commercial search engines: Google, Teoma, and MSN. The choice was due to the fact that Google was accessible through the Google Web API and the other two were accessible through HTTP agents in compliance with the Robot Exclusion Standard. Most other commercial search engines disallow access by agents. A set Q of 30 queries were randomly chosen among the TREC4 homepage finding topics. These are realistic Web queries because they are short, and they are also selected in such a way as to guarantee that there exists at least one Web “homepage” that is ideally relevant for each query. Such relevant homepages were manually identified for each of the 30 queries, as shown in Table 2. Each of the 30 queries was fed to each of the search engines, and the top 10 hits returned by each search engine were stored. For each search engine and query, the content and link similarity coordinates (sc , sl ) of each hit were computed by comparing the hit with the query’s relevant homepage according to Equations 1 and 2. For link similarity, 10 inlinks were obtained for each hit using the Google Web API. To illustrate this step, Figure 10 plots some of the top hits in similarity space. For each search engine, query, and level m (1 ≤ m ≤ 10), the similarity coordinates of the top m hits were then mapped into semantic similarity values to obtain the estimates of query precision and recall according to Equations 11 and 12. Finally, these values were averaged across the |Q| = 30 queries to obtain the mean precision and recall estimates according to Equations 13 and 14. The resulting

0.2

sc

ODP, and as such they are as close as we can get to relevance assessments without conducting user studies. However I am not arguing that the semi-supervised evaluation method is as accurate as what could be obtained from user assessments; it is merely a methodology that can be used when evaluation is needed but user studies are infeasible.

8.2

Google Teoma MSN

1

0.15

0.1

0.05

0 0

0.05

0.1

0.15

0.2

0.25

0.3

R

Figure 11: Precision-recall plots for the three search engines, based on semi-supervised evaluation.

P (m) and R(m) were used to draw the precision-recall plots in Figure 11. The precision-recall plot for Google is consistent with our intuition about this search engine’s emphasis on precision. The PageRank score allows Google to achieve very high precision at low recall levels. This seems to be a winning strategy since few users search past the few highest-ranked hits. While Google’s precision decreases rapidly with increasing recall, it does not become as low as that of the other two search engines we evaluated. However, Google’s recall remains low compared to the other two search engines. Teoma’s perfomance is complementary to Google’s. At low recall its precision is lower, but eventually Teoma achieves significantly higher recall — over twice the maximum recall obtained by Google or MSN. Like Teoma, MSN is outperformed by Google’s precision at low recall and eventually it achieves higher recall than Google. However, MSN is consistently outperformed by Teoma at every precision/recall

0.14 Google Teoma MSN 0.12

0.1

F

0.08

0.06

0.04

0.02

0 1

2

3

4

5

6

7

8

9

10

m

Figure 12: F-measure versus rank parameter m for the three search engines, based on semi-supervised evaluation. level. Precision and recall can be combined into the F-measure [42]: F (m) = 2

P (m)R(m) . P (m) + R(m)

(15)

Figure 12 plots F (m) for the three search engines. We see that for m < 10 Teoma dominates based on this metric. The low recall achieved by Google hurts its performance from this perspective. However, because Google maintains high precision, its F-measure surpasses the other two search engines for m = 10.

9.

DISCUSSION

Understanding how semantic information can be mined from the content and links of Web pages is key for advancing search technology. This paper reported on what is to the best of my knowledge the first large-scale, brute-force effort to map semantic association in a topological space generated from content and link based metrics. A massive amount of data collected from billions of pairs of Web pages has been used to build a number of semantic maps that visualize the relationship between content and link cues and the inferences about page meaning that they allow. While this paper focuses on retrieval and ranking, the results presented here, and the semantic map as a general visualization and analysis tool, can also provide us with useful insight for crawling and clustering applications. The use of semantic maps to estimate relevance and thus evaluate search engines is a natural application of this compiled knowledge. The semi-supervised evaluation procedure can be performed with only minimal effort from human users who need to identify few relevant pages. While I have shown how to apply the methodology with a single labeled page, of course the evaluation accuracy should improve when more than one page is labeled by users as relevant. If the context of the search is known, the evaluation method can be specialized by using topical semantic maps. The idea of restricting computation to sets of pages within topics has also been proposed for metrics such as PageRank [21]. The obvious advantage of the proposed evaluation scheme is that it

is cheap and easy to implement once the semantic maps are compiled. Further variations can be explored by compiling semantic maps based on similarity measures different from the ones described here, e.g., semantic similarity derived by integrating alternative ontological techniques [20]. Another possible application of semantic maps is as “topical signatures.” Link analysis is effective at identifying communities of related pages [19, 26, 15], but labeling such communities automatically has proved difficult. Given a set of related pages, a semantic map could be easily constructed and then matched against a library of topic signature maps to label the entire set. This paper has only begun to catalog the Web regularities that can be discovered with the semantic map approach. One can obtain different maps by considering different definitions of content, link or semantic similarity, different hierarchical classifications or more general ontologies, or different cues altogether. The most urgent direction for future research is to extend the semantic similarity measure from purely hierarchical directories (trees) to the general case of graphs. This would allow for the use of additional semantic information encoded in all the ODP links, including cross-references and “see also” (related) links. We are currently working to extend the information theoretic measure of Equation 3 to general graphs. This will enable us to repeat the analysis in this paper with a more realistic and robust model of semantic relatedness between pages. A number or normative conclusions can be drawn from the results presented here. The general semantic maps provide us with some guidelines for mining the Web’s links and text features (cf. Section 6). Consider the precision map in Figure 6. Identifying semantically related pages (pairs to be precise) with high precision, say in the context of a ranking algorithm, is a non-trivial optimization problem with many local optima — regions of high precision surrounded by regions of low precision. Traditional filtering techniques are inappropriate; consider the “filtering” strategy discussed in Section 5. This strategy effectively covers the hot region, but also the lower precision region for very high σc (returning poor hits) and improperly assign low rank to the relevant pages scattered in the high σc region with low σl . While this may currently be close to the best strategy for a search engine to achieve high precision at low recall, it is far from optimal. Could we do better by combining σc and σl in other ways? Our results suggest that the answer depends on at least two important context factors. First, based on the functional precision-recall plots of Figures 4 and 5, one must determine the recall level appropriate for the users of a particular search engine. While any combination results in both false positives and false negatives because of the many local optima, different combinations of content and link similarity approximate the optimal ranking in different recall ranges. Second, the appropriate ranking strategy depends on the topical context of the search as illustrated by Figure 9. One contribution of this work toward the design of improved ranking algorithms for search engines is the quantification of the dependence between different types of relevance/semantic cues and different types of search context. Another contribution is the quantification of the intrinsic difficulty of search problems in different topical contexts. The results of Section 7 tell us that it is significantly harder to achieve strong search/clustering performance in

topics such as “Games” and “Sports” than in topics such as “News.” It should therefore come as no surprise that news are the first topical domain successfully tackled by large search engines, namely Google5 and more recently Yahoo!6 — the easy problems are attacked first. The research presented here points to our future search challenges and helps us identify targets of opportunity for research in Web information retrieval. Finally, this empirical study underlines — if it were necessary — the limits of “one-size-fits-all” solutions in Web information retrieval. No single approach will work best in every case, i.e., in every user’s information context. While this may appear as a trivial comment, it is worth remarking what enormous resources commercial search engines are currently devoting to improving universal search techniques, compared to the little efforts in specialized search tools. Partly due to the Internet economy of scale, the search engine business has consolidated into a handful of large companies that dominate the market and compete to widen their share on the same universal search turf rather than focusing on specialized niche domains. This work provides for quantitative evidence that even with better content and link analysis algorithms, the performance of universal search engines will remain sub-optimal. On the other hand, the context of individual users could be harnessed to build user-centric models of semantic relationships between pages. For example, users who use bookmark managers to store important URLs have a semantic hierarchy readily available for local analysis and personalization of search results. This is one of the directions that we are beginning to explore within a peer, collaborative, distributed crawling and searching framework [2].

Acknowledgements I am grateful to S. Chakrabarti, J. Kleinberg, L. Adamic, P. Srinivasan, N. Street, and A. Segre for many helpful comments; to the ODP and TREC for making their data publicly available; and to Google for their permission to use the Web API extensively. This work is funded in part by NSF CAREER Grant No. IIS-0348940.

10.

REFERENCES

[1] D. Achlioptas, A. Fiat, A. Karlin, and F. McSherry. Web search via hub synthesis. In Proc. 42nd Annual IEEE Symposium on Foundations of Computer Science, pages 500–509, Silver Spring, MD, 2001. IEEE Computer Society Press. [2] R. Akavipat, L.-S. Wu, and F. Menczer. Small world peer networks in distributed Web search. In Alt. Track Papers and Posters Proc. 13th International World Wide Web Conference, pages 396–397, 2004. [3] K. Bharat and M. Henzinger. Improved algorithms for topic distillation in hyperlinked environments. In Proc. 21st ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 104–111, 1998. [4] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT: Proceedings of the Workshop on Computational Learning Theory. Morgan Kaufmann, 1998. 5 6

http://news.google.com/ http://news.yahoo.com/

[5] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1–7):107–117, 1998. [6] A. Broder, S. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, 33(1–6):309–320, 2000. [7] S. Chakrabarti. Mining the Web: Discovering knowledge from hypertext data. Morgan Kaufmann, San Francisco, 2003. [8] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. ACM SIGMOD Intl. Conf. on Management of Data, 1998. [9] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks, 30(1–7):65–74, 1998. [10] S. Chakrabarti, M. Joshi, K. Punera, and D. Pennock. The structure of broad topics on the Web. In D. Lassner, D. De Roure, and A. Iyengar, editors, Proc. 11th International World Wide Web Conference, pages 251–262, New York, NY, 2002. ACM Press. [11] S. Chakrabarti, K. Punera, and M. Subramanyam. Accelerated focused crawling through online relevance feedback. In D. Lassner, D. De Roure, and A. Iyengar, editors, Proc. 11th International World Wide Web Conference, pages 148–159, New York, NY, 2002. ACM Press. [12] D. Cohn and T. Hofmann. The missing link — A probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems 13, pages 430–436, Cambridge, MA, 2001. MIT Press. [13] B. Davison. Topical locality in the Web. In Proc. 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 272–279, 2000. [14] J. Dean and M. Henzinger. Finding related pages in the World Wide Web. Computer Networks, 31(11–16):1467–1479, 1999. [15] G. Flake, S. Lawrence, C. Giles, and F. Coetzee. Self-organization of the Web and identification of communities. IEEE Computer, 35(3):66–71, 2002. [16] C. Fox. Lexical analysis and stop lists. In Information Retrieval: Data Structures and Algorithms. Prentice-Hall, 1992. [17] P. Ganesan, H. Garcia-Molina, and J. Widom. Exploiting hierarchical domain structure to compute similarity. ACM Transactions on Information Systems, 21(1):64–93, 2003. [18] L. Getoor, E. Segal, B. Taskar, and D. Koller. Probabilistic models of text and link structure for hypertext classification. In Proc. IJCAI Workshop on Text Learning: Beyond Supervision, 2001. [19] D. Gibson, J. Kleinberg, and P. Raghavan. Inferring Web communities from link topology. In Proc. 9th ACM Conference on Hypertext and Hypermedia, pages 225–234, 1998. [20] R. Guha, R. McCool, and E. Miller. Semantic search.

[21]

[22]

[23] [24]

[25]

[26]

[27] [28]

[29]

[30]

[31]

[32] [33]

[34]

[35]

[36]

In Proc. 12th International World Wide Web Conference, pages 700–709, New York, NY, 2003. ACM Press. T. Haveliwala. Topic-sensitive PageRank. In D. Lassner, D. De Roure, and A. Iyengar, editors, Proc. 11th International World Wide Web Conference, New York, NY, 2002. ACM Press. T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the Web. In D. Lassner, D. De Roure, and A. Iyengar, editors, Proc. 11th International World Wide Web Conference, New York, NY, 2002. ACM Press. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10–25, 1963. J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. J. Kleinberg. Small-world phenomena and the dynamics of information. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14, Cambridge, MA, 2002. MIT Press. S. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11–16):1481–1493, 1999. S. Lawrence and C. Giles. Accessibility of information on the Web. Nature, 400:107–109, 1999. D. Lin. An information-theoretic definition of similarity. In J. Shavlik, editor, Proc. 15th Intl. Conference on Machine Learning, pages 296–304, San Francisco, CA, 1998. Morgan Kaufmann. W. Lu, J. Janssen, E. Milios, and N. Japkowicz. Node similarity in networked information spaces. In Proceedings of the Conference of the IBM Centre for Advanced Studies on Collaborative Research (CASCON’01), page 11. IBM Press, 2001. F. Menczer. Growing and navigating the small world Web by local content. Proc. Natl. Acad. Sci. USA, 99(22):14014–14019, 2002. F. Menczer. Complementing search engines with online Web mining agents. Decision Support Systems, 35(2):195–212, 2003. F. Menczer. The evolution of document networks. Proc. Natl. Acad. Sci. USA, 101:5261–5265, 2004. F. Menczer. Lexical and semantic clustering by Web links. Journal of the American Society for Information Science and Technology, 2004. Forthcoming. F. Menczer and R. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the Web. Machine Learning, 39(2–3):203–242, 2000. F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven Web crawlers. In D. H. Kraft, W. B. Croft, D. J. Harper, and J. Zobel, editors, Proc. 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 241–249, New York, NY, 2001. ACM Press. F. Menczer, G. Pant, and P. Srinivasan. Topical web crawlers: Evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4(4), 2004. Forthcoming.

[37] A. Mendelzon and D. Rafiei. What do the neighbours think? Computing Web page reputations. IEEE Data Engineering Bulletin, 23(3):9–16, 2000. [38] M. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. [39] G. Salton and M. McGill. An Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, 1983. [40] H. Small. Co-Citation in the scientific literature: A new measure of the relationship between documents. Journal of the American Society for Information Science, 42:676–684, 1973. [41] P. Srinivasan, G. Pant, and F. Menczer. A general evaluation framework for topical crawlers. Information Retrieval, 2004. Forthcoming. [42] C. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. Second edition.