A Study of the Structure of the Web - Semantic Scholar

2 downloads 152 Views 183KB Size Report
mirrors among host web sites: i.e. top nodes in the URL hierarchy. 3 Finding Communities. In this section we describe how we identified communities in the Web ...
Paper Number: 284 .

Title Page

A Study of the Structure of the Web Amol Deshpande, Randy Huang, Vijayshankar Raman1 , Tracy Riggs, Dawn Song, Lakshminarayanan Subramanian

University of California, Berkeley

famol,rhuang,rshankar,tracyr,dawnsong,[email protected] EECS Computer Science Division 387 Soda Hall #1776, Berkeley, CA 94720-1776 Phone: (510)642-1863

Fax: (510)642-5615 Abstract

The World Wide Web is a huge, growing repository of information on a wide range of topics. It is also becoming important, commercially and sociologically, as a place of human interaction within different communities. In this paper we present an experimental study of the structure of the Web. We analyze link topologies of various communities, and patterns of mirroring of content, on 1997 and 1999 snapshots of the Web. Our results give insight into patterns of interaction within communities and how they evolve, as well as patterns of data replication. We also describe the techniques we have developed for performing complex processing on this large data set, and our experiences in doing so. We present new algorithms for finding partial and complete mirrors in URL hierarchies; these are also of independent interest for search and redirection. In order to study and visualize link topologies of different communities, we have developed techniques to compact these large link graphs without much information loss. 1

Contact Author

Paper Number: 284

.

.

A Study of the Structure of the Web

1 Introduction The World Wide Web is a huge, growing repository of information on a wide range of topics. It is also becoming important, commercially and socially, as a place for interaction in many communities. Studying the structure of the Web could give us better understanding of how information is organized in this medium, and how people interact in it. In this paper we present an experimental study of two aspects of the structure of the Web. We analyze link topologies of various communities – these are well-knit collection of web pages on a topic, e.g., a “web-ring” of Yale alumni. We also study patterns of mirroring of content, on 1997 and 1999 snapshots of the Web. Previous work on studying communities has focussed on first predicting possible community topologies based on stochastic models of the Web link structure, and then enumerating occurrences of these topologies as communities [9, 10]. We take the inverse approach. We first extract web pages for a diverse set of communities by classifying web page content, and then study, through visualization, the topology of subgraphs induced by these pages on the Web link structure. These topologies, when matched with the nature of the corresponding communities, give insight into how these communities could have evolved, and what could be the patterns of interaction within them. Our experimental results approximately match, and extend, the previous results obtained from analytical models. We find that most communities exhibit a common underlying structure that is a generalized version of the bipartite core postulated in [9]. However we also find that the particular characteristics of different communities strongly influence their structures. This leads to interesting differences in some aspects of the communities’ topologies. Besides studying the link topology of different communities, we also study patterns of mirroring of content on the web, circa 1997 and 1999. We investigate mirroring at different levels in the URL hierarchy and across different domains. Information about replication indirectly tells us about frequency of access to different communities, and how this has changed over time.

Methodology We start with a tape-resident copy of the Web circa 1997 derived from a crawl by Alexa. It occupies 14 tapes and holds 600GB of data. In addition we also use a copy of URLs from a Web crawl done at Stanford in June 1999. We scan the tape, simultaneously classifying the web pages into a few selected communities, and extract-

1

Web URLs from Stanford crawl (1999)

URLs Finding Mirrors

Visualization

URLs Web snapshot from Alexa (1997)

Community Link Graphs

Classifier

Graph Collapsing Structural analysis

Figure 1: Methodology for the study ing link graphs for these communities. We also extract a list of URLs for use in finding mirrors. We then use an iterative combination of graph collapsing (Section 4), visualization, and structure analysis (Section 5) to study the link graphs of each community. Graph collapsing involves compacting the link graph so as to make visualization tractable. The aim is to do this without losing its important topological features. Graph collapsing may also help us to identify better topologies in the link graph by smoothing out minor localized variations and making the graphs denser. Structure analysis consists of evaluating important properties of the graph such as in-degrees and out-degrees, or number of components, or counting occurrences of simple subgraphs such as cliques. We must emphasize that we need to iterate many times between collapsing, visualization, and structure analysis. Visualizing a structure may show that it is too complex in some regions, forcing us to collapse further. Parameter values obtained from structure analysis may suggest some characteristics of the topology that need to be verified by visualization. Besides studying the topologies of various communities, we also investigate patterns of mirroring on the Web. Our algorithms for detecting mirrors primarily focus on similarity in URL hierarchies, since these have been found to be more efficient than entirely content-based methods [2]. We study two snapshots of URLs on the web: one from the copy described above, and one from a crawl made in 1999 at Stanford.

Challenge of large data size The main challenge in this study is tackling the large, tape-resident dataset comprising

600 GB spread

over 14 tapes. The Web is also growing rapidly, and we want techniques used in our study to be scalable to a larger version. The data size imposes two main difficulties. First, we need to process this gigantic, tape-resident dataset with limited memory. The tape-resident nature means that our algorithms must involve only a few scans of the complete input, and not assume the ability to perform random access on it. Moreover, since tape scanning is a slow, sequential process, all processing on the complete input must be pipelined with the tape scan, and must keep up with the tape speed; otherwise we will have to run the tape many times. The limited memory size means that our algorithms must have a 2

significantly sub-linear memory footprint. Second, we need to analyze the topologies of potentially large link graphs of different communities. There have been many techniques developed to study large graphs analytically and evaluate various parameters for them [9, 5]. However, since we want to study their topologies we need to visualize these link graphs. It is well known that visualizing graphs meaningfully is extremely hard even for moderate-sized (many 100s of edges) graphs [15, 4, 1], so we need to compact the link graphs without losing important structural properties.

Contributions Besides reporting the results of our study we also describe the techniques we have developed to tackle these problems, and our experiences in doing so. The solutions we have developed for processing the data include simple but efficient heuristics for classification, and powerful algorithms for finding mirrored content in URL hierarchies. For visualizing large link graphs we have developed ways for collapsing a graph without significantly altering its important topological features. We believe that the solutions we have developed are of independent use in other applications. Our graph collapsing techniques can be adapted for use in studying any large graph. Likewise finding mirrors on the Web is also useful for balancing load across replicated mirrors through request redirection, and to collapse duplicates in search results [3].

1.1

Outline

In Section 2 we discuss related work in this area. We briefly discuss how we classify webpages into communities in Section 3. In Section 4 we discuss methods for collapsing link graphs without much loss of topological information. In Section 5 we describe the structures we found in different communities and give qualitative explanations. We present efficient algorithms for finding mirrors, and the results of our study on the 1997 and 1999 datasets, in Section 6. We conclude with directions for future work in Section 7.

2

Related Work

There has been a lot of work in the last few years on finding aggregate properties of web pages, such as page sizes, access rates, popularity, life span and so on (e.g., [5, 11]). [5] presents a nice survey of literature in this area. Correspondingly there has been some work on finding numerical properties of the link structure of the Web [12, 10]. However there has been little work in analyzing structures of communities on the Web. The closest related work is that done in the Clever project at IBM [6, 8, 9] that looks at mining knowledge bases from the 3

Web. They posit a stochastic model for the growth of the Web and derive a hubs and authorities model for structures of communities on the Web. Occurrences of this analytically derived structure are enumerated as communities. In contrast we first experimentally find communities by classification and then find structures in them. The hubs and authorities model states that communities exhibit a bipartite core structure – bipartite subgraphs where each node in one partition links to every node in the other partition. In Section 5 we compare the common topology we find in many communities with the bipartite core structure. There has been some recent algorithms for finding mirrors in the Web. The initial research focused on finding mirrors based on web page content alone [3]. However it has recently been demonstrated [2] that it is easier to find candidate mirror pairs from the URL hierarchy (which is much smaller in size than the full Web content), and then check the resultant smaller set for content similarity. Finding potential mirror pairs in the URL hierarchy of the Web involves identifying pairs of nodes (where a node could be at any level in a web site) that have similar features in the sub-trees rooted at them, according to some feature set. Bharat and Broder [2] present an algorithm for this problem with a simplistic definition of a feature: they view all pairs of adjacent URLs (i.e. parent and child pairs) in the sub-tree under a node as part of a node. Our algorithms handle more descriptive feature sets: the sub-tree under a node as a whole, or to consider individual paths to leaves in this sub-tree. In addition our algorithms for finding mirrors can handle mirrors at multiple levels of the URL hierarchy, as opposed to the algorithm in [2] that can only find mirrors among host web sites: i.e. top nodes in the URL hierarchy.

3

Finding Communities

In this section we describe how we identified communities in the Web by classifying web pages based on their content. The important constraint here is to classify web pages on the fly as they are read from the tape, without slowing down the tape scan. We first need to select salient features that can be used to distinguish different communities. We use the standard document frequency thresholding (DF) technique for feature selection, since it has been shown to be simple and fast, but yet quite reliable in comparison to more expensive approaches [14]. The idea is to choose as features for a community terms with a high frequency in its document set. We used a hand-classified training set of 180 documents to select features, with a DF threshold of 40%; noise words were thrown out based on a list of commonly used words.

k Nearest Neighbors (kNN) is a commonly used technique for classification, but we found it to be too inefficient for our needs. We want our classifier to be pipelined with the tape scan and not slow it down, but we have found the kNN algorithm to be the bottleneck. Hence we use a simple and lightweight heuristic 4

that we describe below. First, we manually choose a small set of “must-have” keywords for each class. These keywords help us prune a significant fraction of documents. For all documents that passed the keyword test, we examine the feature vector for this document with respect to the feature vector for the class. Each document is classified as a member of a class if either (a) of all the features in the class, a fraction above a certain threshold Tclass appear in the document, or (b) of all features found in the document, a fraction above a certain threshold

Tdoc appear in the class.

Pages with a significant amount of content (e.g. an authority page) are likely to

match condition (a), whereas pages that have few words (e.g. hub pages, photo galleries) are likely to match condition (b). We experimented this heuristic with different thresholds on a hand-classified training set to find optimal values. A serious problem with this classifier is that it is completely content-based, and does not use link structure information. This could lead it to miss out on some hub sites that do not contain any key words (e.g., a hub with graphic buttons linking to resource pages). This is an intrinsic problem of running a classifier pipelined with the tape scan, since link traversals will constitute random accesses in the tape. We intend to resolve this by separately extracting the link structure and features, and then augmenting the classification using link structure, as in [7].

4

Graph Collapsing

In this section, we discuss graph collapsing algorithms for reducing the complexity of the graph to aid visualization. We describe two kinds of collapsing: logical collapsing, which infers semantic relationships between web pages based on URLs and tries to cluster related webpages into logical web sites, and structural collapsing, which performs a lossless compaction of the link graph by mapping certain structures into single nodes.

4.1

Logical Graph Collapsing

Logical collapsing is used to collapse many related web pages within the URL hierarchy of a domain (such as www.ibm.com) into a logical web site. We believe that this is important to learn overall community topologies that are not skewed by local variations. E.g., consider the homepage of a person who has links to a page with his favorite cooking recipe and one with his favorite movies. These pages should be considered as a single entity while analyzing the link structure. There are many plausible metrics to automatically identify at what levels in the URL hierarchy to collapse pages on a site. The simplest approach, that we have adopted, is to collapse all pages on a web site, as in [9].

5

1

1 1

2

1

2

2

2

3

3

3

3

Figure 2: Collapsing isolated trees and cliques

Figure 3: Incorrect structural collapsing

However in some cases many unrelated web pages may be collapsed together (e.g. consider geocities.com). We have also studied two other heuristic metrics for collapsing subtrees in the URL hierarchy of a website, by number of children, and by number of descendants. In the former we collapse maximal subtrees with size less than a threshold. In the latter we identify nodes whose number of children is below a threshold and collapse the node with its children. The hard problem is figuring out ideal thresholds, and we intend to study this further. Implementing these three heuristics is simple. Since the parent-child relationship is solely based on URLs, all these three metrics can be evaluated in a single pass over a sorted list of URLs.

4.2

Structural Graph Collapsing

As argued earlier, an important challenge in studying topologies of link graphs is compacting them to visualizable sizes. In this section we describe a method for collapsing a graph without losing much structural information about it. An additional advantage of such collapsing is that it may smooth out minor variations, and allow us to identify approximate structures in the graph that could not be identified if we directly visualized the original graph. The main idea is to collapse certain subgraphs into single nodes (as in Figure 2). To avoid losing structural information we must choose subgraphs such that (a) the collapsed graph has similar structural features to the original graph, and (b) details of structures obtained from the collapsed graph can be extrapolated to details of the actual structures in the original graph. Although property (a) is hard to formalize, we identify an important rule for collapsing that we believe is a necessary condition for property (a): If we collapse subgraphs in a (directed) graph G to form a graph H, any path P in H must be convertible to a path in G if the collapsed nodes along P are expanded. Figure 3 shows why this condition is important; collapsing significantly changes the features of the original graph, and the cycle formed after collapsing is not a true cycle at all. We call subgraphs that match the above property as collapsible subgraphs. When we collapse these subgraphs, we store along with each collapsed node a summary of the contents of the subgraph that it represents. This summary can be used to approximately convert any path in H to a 6

1

1

1

1

2

2

2 3

3

3

4

Figure 4: Collapsing Trees

Figure 5: Tree 3 is not collapsible

path in G. We believe that this information can also be used to convert details of most structures found in H to details in the original graph G. Identifying Collapsible Subgraphs A simple collapsible subgraph is an isolated tree – a tree that is “isolated” save for a cut-edge, (e.g., the tree

(3) in Figure 2). Other clearly collapsible subgraphs include cliques and cycles, since there is a path between any two nodes in these. Indeed any bi-connected subgraph is collapsible. However there are several problems with using these structures for collapsing, as we see below. First, cliques are the ideal subgraph to collapse since the collapsed node can summarize the subgraph by simply maintaining its vertex count. Unfortunately cliques are very rare in the directed link graphs of Web communities. Second, bi-connected subgraphs are extremely hard to find in large link graphs. The most direct method of using a depth-first-search needs random access to the nodes in the graph; unfortunately these graphs may not fit in memory. Third, a cycle of length n can be found by a n-way self (equi) join of the adjacency relation2 of the link graph. However computing the join for long cycles is time-consuming since it involves many joins. Therefore we use a different subgraph for collapsing, that is a more general version of the isolated tree. A single-entry-point tree is a (rooted) tree in which no node except the root has any links from outside the tree. Figures 4 and 5 show why such a tree is collapsible and why this condition is necessary. The example can be generalized to the following result (proof in Appendix A). Theorem 1: A single-entry-point tree is a collapsible subgraph. Collapsing single-entry-point trees In general this problem can be viewed as a special kind of transitive closure over the adjacency relation of the link graph. However for efficiency we did not use a DBMS to evaluate the transitive closure, but instead 2

adjacency relation of a graph is the relation corresponding to its adjacency matrix, with attributes being the from and to node

for each edge.

7

we use the following algorithm. Input: a list of edges in the form (from-node, to-node) sorted by the to-node. 1. Find all loners – nodes with in-degree 1. Since the graph is sorted by to-node, this is trivial. 2. Find the nodes to which the loners are collapsed to. This involves finding the ultimate ancestor of each loner, in the graph induced by the loners. If the number of nodes is small enough that this graph fits in the memory, then use an in-memory index to traverse the links and reach the ancestor. Otherwise, make repeated passes over the graph, each time collapsing a loner with its parent, till the graph has no loners. This will require as many passes as the maximum of the heights of the collapsible trees in the graph, which is typically quite small. 3. Make a pass over the original edge list replacing all the loner nodes with the nodes to which they are collapsed. In our implementation, we required 8 bytes per edge and the number of edges in this graph is same as the number of nodes with in-degree one. Thus with 8MB of memory, we can easily keep the entire graph in memory as long as the number of nodes with in-degree 1 is less than 1 million. This condition is satisfied for all the link graphs that we studied.

5

Topologies Found in Various Communities

In this section we describe some studies we have performed on the link graphs of different communities, and the kinds of structures we have found. We study the structure of each community by visualizing its collapsed link graph. We used the VCG [13] visualization tool that enables fast visualization of moderately sized graphs. We present a common structure we found in many communities, and then describe particular features of individual communities. For brevity we present the latter results on a few representative communities only. We use a fairly diverse set, consisting of an animal, a novel, an outdoor hobby, a celebrity, and a sport: Crocodiles, Lord of the Rings, Michael Jordan, Mountain Biking, and Tennis. Our goal in choosing these

communities was to pick a set that (we suspected) would encompass a wide range of structures. We believed that the first would be small, but well-spread communities of scientists, Lord of the Rings would consist of many personal fan pages, Michael Jordan would consist of fan pages as well as professional sports sites, Mountain Biking would have a few authoritative sites with most of the information, and that Tennis would be

a huge community with a wide variety of sites.

8

Arbitrary pages that link to resource pages Hubs: lists of links to many authorities

Resource nodes: high in/ out degree

Authorities on topics for the community

Hubs And Authorities Topology Authorities on specialized related areas

Layered DAG Topology

Figure 6: The Layered DAG and Hubs-and-Authorities topologies

5.1

Features common to all communities

As the visualizations in Figures 8, 7, 9, and 10 show, many of the communities have significant numbers of separate components. Each component is mostly acyclic, and can be viewed as a DAG. VCG’s default setting is to lay out this DAG so that edges go top to bottom, with the number of crossings between edges minimized. This layout is helpful because it has a direct semantic correspondence that we discuss next. Note that there is a danger that the topology we visualize may be an artifact of a particular visualization method. To avoid this trap, we have confirmed our findings by visualizing using different layout settings in VCG. The Layered DAG topology We find that all our communities exhibit an layered DAG topology as shown in Figure 6. This is an approximate version of the hubs & authorities bipartite core model first proposed in [9], which we have discussed in Section 2. The bipartite core model is too rigid for most communities, and many of them have no bipartite cores of reasonable size. Instead they exhibit a layered structure we describe next. The middle layers of this DAG correspond to resource nodes for the community. These have high outdegrees and in-degrees, and so the links are quite dense between consecutive layers. In general, in-degree increases and out-degree decreases from top layers to bottom layers, reflecting a gradual shift in nodes from hubs (such as www.yahoo.com) to specialized authorities. Many of these middle layer nodes are general resources (like www.wimbledon.org or www.tennis.com for tennis) that are both hubs and authorities; they contain a wealth of information and so have many in-links, but also point to many other useful sites. 9

Figure 7: The community for Lord of the

Figure 8: A subset of the community for Michael

Rings

Jordan

The top layers contain isolated pages that link to a small number of resource pages, such as Bob’s page for his pet, with a reference to a resource page for the pet, or Mary’s Pete Sampras page with links to some tennis resources. The bottom layers contain relatively isolated authorities on related areas, which contain only rare pointers from resource pages. For example the bottom layers of the tennis community contain pages of small racquet manufacturers.

5.2

Features particular to individual communities

Besides the common layered DAG topology, many communities have special natures that cause them to exhibit particular structures. We describe these features for some communities. Animal: Crocodile: This community is representative of the animal community in general. We have also found similar results with other animal communities such as Iguanas and Bearded Dragons. The link graph is very sparse, with many disconnected components each containing a small number of nodes ( a two or three layer DAG). This isolation seems to be a intrinsic feature of the interaction in communities relating to pets. The individual components consist of a wide variety of commercial sites like travel websites or sites selling these pets, which don’t link to each other, probably to ensure a ”locked in” market for themselves and prevent customers from switching. Moreover there are many personal pages with pictures of pets that don’t link to any external sites. Outdoor Hobby: Mountain Biking: The mountain biking community also has many separate components, each a multi-level (5-6 level) DAGs. This componentization is due to two factors. First, this community contains pages on many semantically different topics such as bicycles versus motor-bikes, mountain biking the sport versus bicycle the object, and biking information versus bike companies. Second, many of the DAGs correspond to geographically different sub-communities that don’t link to each other. Each of these sub-communities corresponds to mountain biking information in a specific geographic area. These subcommunities do not typically link to each other due to the localized nature of much of the mountain biking information (terrain, climate, local bike shops etc.). The intermediate layer nodes are hubs (like www.yahoo.com), 10

Figure 9: A subset of the community for Mountain Biking

Figure 10: A subset of the community for Tennis extensive individual resource pages (like world.std.com/ jimf/biking), professional resource pages (like www.mountainzone.com) and so on. The lower layer nodes are typically links to bike companies or companies making bike spares. Sport: Tennis: This was the largest community we studied and it contained some massive, 16 to 17 layer DAGs. The middle layers contained resource pages like www.wimbledon.org and www.tennis.com. The top layers consist of incidental references to tennis from general sports pages, personal web pages, and bulletin boards postings. The bottom layers consist of authorities on specialized topics like ball and racquet manufacturers. Book: Lord of the Rings: This graph contains only one main component (DAG) that is centered around the Tolkien computer games page (http://www.lysator.liu.se/tolkien-games/) as an authority. There are other small component linking to other Tolkien game pages, as well as a component linking to “Lord of the rings” curio sellers. The top layer nodes are individual fan pages pointing to these authorities. Celebrity: Michael Jordan: The main hubs are directories like yahoo, sports sites, and sites selling basketball memorabilia. The top layer sites are mainly individual magazine and news articles. We couldn’t find any Jordan resource site. Apparently Michael Jordan was not popular on the Web in 1997!

11

6 Finding Mirrors in the Web In this section, we describe our investigation of mirrored content on the Web. We present new algorithms for detecting complete and partial mirrors on the Web, and results of running these on the 1997 and 1999 datasets (described in Section 1). Our algorithms try to find similarities in URL hierarchies, rather than similarities in content of web pages. As noted in [2], this is an effective way of reducing the set of candidate mirror pairs to a small set on which a content-based algorithm (such as [3]) can be run to eliminate spurious mirrors. Finding potential mirror pairs in the URL hierarchy of the Web involves identifying pairs of nodes (where a node could be at any level in a web site) that have similar features in the sub-trees rooted at them, according to some feature set. Our algorithms use as feature set the set of all paths from the root to the leaves in the subtree rooted under a node. As argued in Section 2, this set completely describes the structure of the sub-tree, in contrast to the adjacent-nodes feature set used in [2]. Moreover our algorithms can find mirrors at any level of the URL hierarchy. We believe that this is important and that there are significant numbers of mirrors at levels below the top level of the URL hierarchy (i.e. descendants of hosts), and mirrors across levels in the URL hierarchy. Indeed our results show that mirrors are growing at a much faster rate in these lower levels. We present three algorithms for finding partial and complete mirrors in a given list of all URLs. The first, Hash Tree Signature is a simple, yet efficient way for detecting mirrors that maps the entire sub-tree under a node into a single value. While it is highly efficient and can be done incrementally, it is very rigid and can only find exact mirrors. Hence we develop another algorithm ReverseURL that considers partial overlaps by ranking a mirror pair based on the extent of similarity in the subtrees rooted under its nodes. However it is not incremental and there is a possibility of an explosion in the size of intermediate results. Hence we have developed an algorithm which uses hash-based shingling to obtain a much smaller feature set for each potential host. This feature set probabilistically characterizes a sub-tree under a host, but can still approximate mirrors.

6.1

Hash Tree Signatures

This algorithm computes a hash signature for each node based on the structure of the sub-tree rooted at the node, and the URLs in the subtree. Once the hash signatures for all the nodes are computed, we find nodes with equal hash values by sorting on the hash value. The collision-free property of the hash signature guarantees that these nodes have exactly identical subtree structures.

(n) denote the URL of a node n. Assume that a node n has a set of children C (n) = faig and that k = jC (n)j: Define a hash function f for n as follows. If C (n) is empty, f (n) = X , else, f (n) = h(hf (ab ); f (ab ); : : :; f (ab )i), where ff (ab )g1jk is sorted by the value f (ab ), and h is a hash function Let s

1

2

k

j

j

12

with a good collision-free property. Analysis: We can easily prove that if

f (n1) = f (n2); then n1 and n2 have the same subtree structures

assuming that the hash function is collision free. After sorting the dataset by URL, this hash function can easily be computed for each node in a URL hierarchy in a single pass. The memory requirement is only a stack whose size is the maximum depth of the URL hierarchy.

(n) denote the length of the longest path between a node n and a leaf. Instead of sorting the list of nodes and hash values by hash value so as to identify mirrors, we sort the records using l(n) as the primary Let l

key and hash values as the secondary key because this will show us mirror pairs involving big subtrees before mirror pairs involving their smaller “sub-subtrees”. This is an important requirement to avoid unnecessary mirrors. For instance, if two URLs a/b/c/ and e/f/g are mirrors, we want to report only the top level mirror pair a/b/c :: e/f/g, and not report other mirror pairs like a/b/c/d :: e/f/g/h. Incremental behavior: Since the Web is continually growing it is important that mirroring algorithms we develop be incremental. That is, it must be easy to add a new set of URLs to the existing set and find mirrors using them without recomputing from scratch. Computing hash signatures is intrinsically incremental since the hash signature for a new web site is independent of the structure of other sites. However if the new URLs that we add are children of earlier URLs, we may need to update previously computed hash functions. There are two ways of doing this. We could use a trie to index the hash values for the URLs so that when a new URL is added we can traverse the trie to find all nodes whose hash values will be affected (because they are ancestors of this URL). Alternatively we could maintain a list of URLs and their hash values sorted by URL so that we can merge the main list with an “delta” list in one pass.

6.2

Reverse URL with Ranks

As we already indicated, the earlier algorithm has the disadvantage that it only finds subtrees that have exactly the same structure. In this section, we present a variant that allows us to find approximate structural similarities. The main idea behind the algorithm is that the subtree rooted under a node in the URL hierarchy is uniquely characterized by the set of all paths to leaves from it. For example the subtree under a node a/b with descendants a/b/c, a/b/d, and a/b/c/e is characterized by the paths c/e and d.3 Hence we can view a

similarity in the subtree under two nodes as a large intersection in their paths to leaves. The key idea to efficiently compute this intersection for all pairs of nodes is that a path to a leaf that is common for two nodes can be viewed as a common prefix for the two nodes if the URLs are reversed. Note that this reversal is done at the granularity of a directory name and not a character. e.g. reversing two URLs 3

We assume that all intermediate nodes in the hierarchy exist; otherwise we would have to store these absentee nodes separately.

13

a/b/c and d/e/b/c into c/b/a and c/b/e/d and sorting will allow us to find a and d/e as having a common path c/b.

In a single pass through the sorted list, we output each pair of nodes with a common leaf path. However a single pair may appear at different positions in the output due to different leaf paths. Hence we need to resort the list and compute an aggregate rank for a node pair that corresponds to the number of common paths to leaves. Separately we find the number of descendants for all nodes (by a single pass through the sorted list of URLs). The ratio of rank to minimum of the number of descendants of one of the URLS determines the extent of similarity between their subtrees. We choose the minimum of the number of descendants, because often a small site mirrors the content of a larger site. Incremental Behavior: A serious problem with this algorithm is that we need the complete set of URLs to determine the size of the overlap between pairs of nodes. Therefore every addition of newly crawled sites will require complete recomputation. A second problem that we have encountered is an explosion in the intermediate set of possible mirror pairs that we need to rank.

6.3

Subtree Shingling

In this section, we present an algorithm that solves both the exactness problem of the Hash Signatures algorithm and the lack of incremental behavior of the Reverse URL algorithm. The key idea is to improve the Reverse URL algorithm by computing hash signatures on each path to a leaf, and use shingling [3] to approximately rank node pairs by the intersection of these path signatures.

(p1); h(p2); : : :; h(pk)

For a node with k paths to leaves p1 ; p2; : : :; pk , we can compute path signatures h

by hashing the paths. For finding node pairs with large intersection in these paths we use a probabilistic technique called shingling first used in [3]. The idea is to convert this set of path signatures into a smaller set of hash values. We choose n hash functions h1; : : :hn and compute the path signatures. For each hash function we store

the minimum of these path signatures sj set of minimum path signatures, 

= min(hj (pi))1ik : The shingle of the subtree is defined as the

(T ) = fsj g1jm .

For two trees T1 and T2, if the percentage of matching values in the two sets 

(T1) and (T2) is higher

than a threshold , then the algorithm classifies the two trees as candidates for approximate mirror sites.

Performance Analysis: This algorithm consists of two stages. We first compute the shingle values for each subtree and in the second step, we scan through the shingle values of the different subtrees and output possible approximate mirror sites. The first stage can be done in one pass. In the second stage the shingle value of each hash function hi is output to a different file Fi . We sort each file Fi by the shingle values si 14

(T ),

#Accept_threshold: 4 out of 12

1.0

accept rate

0.8

0.6

0.4

0.2

0.0 0.0

0.2

0.4 0.6 disagree rate

0.8

1.0

Figure 11: Probability that shingling will accept two subtrees as mirrors, as a function of the fraction of paths to leaves in which they differ where T stands for a subtree. We then need to find intersections in these shingles for subtree pairs. For each

T present in every file Fi, we output the list of all the T 0 s which have the same shingle value as that of T to a file Gi . After performing this operation, we sort each of the output files Gi based on the list of URLs with the same shingle value. We later merge the corresponding files and count the number of common shingles for possible mirror pairs. We report mirror pairs with a count above a certain threshold as approximate mirror pairs. Precision Analysis: Since shingling is a probabilistic matching algorithm we can analytically compute the probability that shingling will output two subtrees in the URL hierarchy as mirrors. This is a function of the number of paths to leaves in which the two subtrees differ. Figure 11 plots this probability given that we use 12 hash functions, and have a match threshold of 4 i.e., we require that two subtrees must have identical shingle values si for at least 4 hash functions. We can tune this acceptance rate depending on how approximate we want our mirrors to be.

6.4

Experimental Results

Our mirroring algorithms were experimented on two datasets: the 1997 dataset described in Section 1, and a dataset from a June 1999 crawl at Stanford. The main features of these datasets are summarized in Figure 12. We cleaned these datasets removing queries, port numbers, cgi-bin scripts, and duplicates. The output of our algorithms are very similar and hence we report the results for the ReverseURL algorithm. Mirrors of different types In our first study, we ran the mirroring algorithms on the 1997 and the 1999 data sets and counted the number of mirrors at each level of the URL hierarchy. In other words, we computed the number of mirrors

15

Number of URLs Cleaned URLs Hosts Level 1 children Level 2 children

1997 data 35800000 19150000 164156 2450112 4396152

Among Hosts Level 1 children Level 2 children .com/ .net/ .org/ Total

1999 data 107000000 46260000 1028992 4536304 11686855

1997 data set 589 1172 601 1553 121 187 2362

1999 data set 2571 4134 2364 6153 857 1131 9069

Figure 12: Characteristics of the Data sets Figure 13: Number of Mirrors that are web-hosts, their children (Level 1) and their grandchildren (Level 2). The output of our algorithms are shown in Figure 13. We also measure the number of mirrors in three important domains in the web namely, .com/, .net/ and .org/. These readings give us some insight into how mirrors are distributed across different levels in the URL hierarchy, and how they have grown from 1997 to 1999. The first observation that we make is that the percentage of mirrors decreases as level increases: the fraction of hosts that are mirrors is much greater than the fraction of level 1 nodes that are mirrors, which is in turn greater than the fraction of level 2 nodes that are mirrors (note that the absolute numbers of mirrors do not follow this rule, due to a drastic increase in total number of nodes at different levels). For example, in the 1997 data set, the percentage of hosts that are mirrors is are mirrors is

0:38% and the percentage of level 1 nodes that

0:05%. A similar trend can be observed over the 1999 Stanford data set.

Second, we observe an uneven growth in mirrors at different levels. Whereas the percentage of hosts that are mirrors has decreased from increased from

0:38% in 1997 to 0:25% in 1999, the percentage of level 1 mirrors has

0:05% to 0:09%. This uneven growth is largely due to a sharp increase in the total number

of hosts without a corresponding increase in number of level 1 nodes. We believe that this is due to a large number of new partially developed websites. In [2], the authors had detected

10% of the hosts to be mirrors. However, this is the percentage of mirrors

among hosts with more than 100 descendants only, whereas we compute the percentage of mirrors among all the nodes crawled. In a related experiment, we determine the number of URLs which belong to the important domains in the web namely .com, .net/ and .org/. We find that a huge percentage of the mirrors are .com/ websites. In the 1997 data set, .com/ sites formed

65:7% of the mirrors and in the 1999 data set, they form 64% of the

mirrors. We can see that this percentage has not changed much for 1997 to 1999. However the percentage of .net/ and .org/ mirrors has increased from 1997 to 1999, revealing an increase in the importance of these domains. Intra-domain and Cross-domain mirrors In another experiment, we calculated the distribution of mirror pairs for various domains. We took three 16

Mirror Pairs Total .com/ - .com/ .net/ - .net/ .org/ -.org/ .com/ - .net/ .com/ -.org/ .net -.org/

1997 data 13674 4001 30 57 474 697 51

1999 data 21841 6114 204 401 1630 1873 308 Figure 14: Cross and Intra-domain mirrors

main domains for study purposes. They are .com/, .net/ and .org/. An intra-domain mirror pair refers to a mirror pair of two URLs which belong to the same domain. In a cross-domain mirror pair, two URLs belonging to different domains are mirrors. In the 1999 data set, the intra and the cross domains of .com/, .net/ and .org/ contribute around the entire set of mirror pairs. This has increased to

39% of

49% in the 1999 data set. This can be attributed to a great

increase in .com-.net and .com-.org cross domain mirrors. We observe that cross-domain mirrors have increased significantly compared to intra-domain mirrors. This probably reflects a breakdown in the traditional domain boundaries; with domain names filling up fast, more and more commercial sites are using domains like .net and .org. Mirror Size Distribution Another interesting property of mirrors that can be studied is the distribution of the size of mirror sites. We define the size of overlap between two mirrors to be the number of descendants that are common in the subtrees of the two URLs. From Figures 15 and 16, we can infer that a huge percentage of the mirror pairs have a small overlap (10-600) and a the rest of the mirror pairs are distributed across different sizes. There is approximately an exponential decrease in the number of mirror pairs with a linear increase in the size of the overlap. This suggests that most mirrored sites are pretty small. We observe that this behavior has not changed much from 1997 to 1999. Spurious Mirrors During our study we had quite some difficulties with spurious mirrors among bulletin boards, many of which share a similar structure. We have cleaned these mirrors, and the results we report do not take these mirrors into account.

7

Conclusions & Future Work

Information on a wide variety of topics is being recorded electronically on the World Wide Web at a dizzying pace. In addition, more and more communities of people are using the Web as a medium for collaboration 17

5

5

10

4

10

10

4

Number of Mirrors

Number of Mirrors

10

3

10

2

10

10

0

0

0

2

10

1

1

10

10

3

10

500 1000 1500 Size of Subtree Overlap

10

2000

0

500

1000 1500 Size of Subtree Overlap

2000

Figure 15: Mirror size distribution on 1997

Figure 16: Mirror size distribution on 1999

dataset

dataset

and interaction, by creating communities of web pages on pertinent topics. In this paper, we have described a study we have made of the structure of the Web. We analyze topologies of different communities on the Web, and also patterns of data mirroring. Our experimental results on community topologies show that many communities exhibit an underlying layered DAG topology, an approximate version of the bipartite core model proposed earlier. This finding is reassuring in that it affirms previous analytical hypotheses. At the same time, we also find significant variances in different communities’ topologies based on their individual characteristics. We have also described new algorithms for finding mirrors based on just the URL hierarchy of web sites. We have run these on URL lists from 1997 and 1999, and have studied the patterns of mirroring and how this has changed. We also quantify numbers of intra and cross domain mirrors. Besides the experimental results, our work has also forced us to develop techniques for analyzing such large datasets. We have developed many ways of compacting large graphs for visualizing them, and different algorithms for finding mirrors in web pages. We believe that these techniques will be of independent interest in other applications. Future Work The Web has great potential for data analysis, and there is much work to do to study it. We have studied some selected communities and have found fairly interesting structures. This needs to be extended to more communities. A large collection of topologies for different communities will help us to better evaluate models for the evolution of communities on the Web. Particularly interesting will be a model for the Web that handles intra-site links differently from inter-site links.

18

A second direction for further research will be in exploiting results of such studies to improve access to information on the Web. As described earlier , community topologies can be used to cluster and improve search results, and mirror listings can be used for reducing duplication in search results and for web page request redirection. Another interesting dimension for further study is temporal information about web pages. We can use timestamp information to validate some hypotheses about what roles different nodes play in the link structure of a community. For example, if we find that nodes in the middle layers were created early, and that nodes in the top layers are created much later, it will support our contention that the nodes in the top layers are links from not relevant pages to important resource pages. Likewise it will be interesting to not collapse pages into sites not based on URL information such as number of descendants or children, but instead on proximity in creation time; we can collapse sub-trees within a DAG whose nodes all have a similar time-stamp, assuming that these correspond to a site. Acknowledgment: Michael Chu, Husain Muzafar, and Jon Kuroda helped us set up machines for our experiments. We would like to thank David Gibson and Sridhar Rajagopalan for getting us the dataset from IBM. We also thank Alex Berg for sharing some of his code and thoughts with regards to data cleaning. We are thankful to the Stanford Webbase group for giving us their URL lists. Computing and network resources were provided through NSF RI grant CDA-9401156. V. Raman was supported by a Microsoft Fellowship. A. Deshpande was supported by grant CONTROL:442427-21389.

References [1] G. Battista et. al. Annotated bibliography on graph drawing algorithms. Computational Geometry: Theory and Applications, 4(5):235-282, 1994. [2] K.Bharat and A. Broder. Mirror Mirror on the Web: A Study of Host Pairs with Replicated Content. WWW8 / Computer Networks 31(11-16), 1999. [3] A. Broder et. al. Syntactic Clustering of the Web. WWW, 1997. [4] F. Brandenburg. Nice drawing of graphs are computationally hard. In Visualization in Human-Computer Interaction, LNCS 439, 1988. [5] P. Burchard. Statistical Properties of WWW. http://www.concentric.net/ aleph0/www/stats. [6] S. Chakrabarti et. al. Mining the link structure of the World Wide Web. IEEE Computer, to appear. [7] S. Chakrabarti et. al. Enhanced hypertext categorization using hyperlinks. SIGMOD 1998. [8] Jon. Kleinberg. Authoritative sources in a hyperlinked environment. JACM, 1999. [9] R. Kumar et. al. Extracting large scale knowledge bases from the web. VLDB, 1999. [10] R. Kumar et. al. Trawling the web for emerging cyber-communities. WWW, 1999. [11] J. Pitkow. Summary of WWW characterizations. WWW, 1998. [12] E. Spertus. ParaSite: Mining Structural Information on the Web. WWW, 1997.

19

[13] George Sander and Iris Lemke, VCG tool development. [14] Y. Yang and J. O. Pedersen. A Comparative Study on Feature Selection in Text Categorization. Intl. Conf. on Machine Learning, 1997. [15] T. Munzner. Laying out large directed graphs in 3D hyperbolic space. Proc. IEEE Symp. Info. Visualization., 1997.

A Informal Proof of Theorem 1 Theorem 1: A single-entry-point tree is a collapsible subgraph. Proof: Consider a graph G from which all collapsible trees are collapsed to generate a graph H. Suppose

a to b, and that it contains an intermediate node v that is a collapsed node. If v corresponds to a subgraph T of G with root r, no node in T except r has an in-link from outside T. Hence the edge in P that enters v must actually be an in-link to r in G. Suppose that the edge in P that leaves v were to correspond to an edge from x to y in G, then x 2 T and y 62 T . Since T is connected there must be a path Q in T from r to x. Therefore the path that P is a path in H from

a

,! ,! ,! v x ,! y b along P

along Q

is the path we need from a to b in G.

20

along P

Suggest Documents