Detecting near-replicas on the Web by content and ... - CiteSeerX

Detecting near-replicas on the Web by content and hyperlink analysis Ernesto Di Iorio

Michelangelo Diligenti

Marco Gori

Dipartimento di Ingegneria dell’Informazione Universita` di Siena, Italy



[email protected]

[email protected]

[email protected]

Marco Maggini

Augusto Pucci



[email protected]

[email protected]

ABSTRACT The presence of replicas or near-replicas of documents is very common on the Web. Whilst replication can improve information accessibility for the users, the presence of near-replicas can hinder the effectiveness of search engines. We propose a method to detect similar pages, in particular replicas and near-replicas, which is based on a pair of signatures. The first signature is obtained by a random projection of the bag-of-words representation of the page contents. The second signature, referred to as Hypelink Map, is computed by a recursive equation which exploits the connectivity among the Web pages to encode the page context. The experimental results show that on the given dataset replicas and near-replicas can be detected with a precision and recall of 93%.

1. INTRODUCTION The success of a search engine mainly depends on its capability to provide satisfactory answers to the users’ queries in the first positions of the result set. In order to reduce the effect of information flooding, many criteria have been used to define a ranking among the returned documents. However, another important issue is the redundancy of the information on the Web. Documents on the Web can be replicated many times on different hosts, or the same document can be accessed by different URLs because of aliases. Partial or complete mirrors of sites are quite common to ease the access to popular resources and to distribute the network and server load. Replicas and near-replicas waste storage in the search engine indexes, they reduce the quality of the query results replicating the same information, and they can also hinder the effectiveness of the ranking techniques based on link analysis. In [1] different techniques to detect mirror sites are analyzed and compared. The proposed algorithms do not consider the page contents, but are based on features such as the IP addresses of the sites, the hostnames, the structure of the path in the URL and the similarities in the connectivity to external hosts for the candidate mirrors. A different approach based on hash signatures of the page contents is presented in [3]. In this paper we propose a technique for finding lists of similar documents, and in particular replicas and nearCopyright is held by the author/owner(s). WWW2003, May 20–24, 2003, Budapest, Hungary. ACM xxx.

replicas, based on a pair of signatures which take into account both the document contents and the hyperlink structure.

2.

THE DOCUMENT SIGNATURES

Each document is mapped to a pair of signatures. The Random Projection (RP) signature maps the document contents to a lowdimensional real valued vector, preserving the notion of similarity in the original space (i.e. near representations are mapped to near points in the projected space). The Hyperlink Map (HP) signature is then obtained by propagating the RP signatures through the hyperlinks between the pages, providing a method to distinguish pages also on the basis of the pages they link. Appropriate index structures for multidimensional data (e.g R-trees, etc.) can be used to reduce the cost of n-nearest-neighbors and range searches using the RP and HM signatures.

2.1

The Random Projection Signature

Recently Random Projection has been proposed as a technique for dimensionality reduction [4]. A random projection from N to K dimensions is obtained by a K × N projection matrix Π whose entries πij may be initialized using a uniform random distribution in [-1,+1]. The Random Projection (RP) signature Rp of the page ˆp = p in the repository is computed from the random projection R ΠDp of the bag-of-words representation Dp of p. Given the vector ˆ m such that R ˆ jm = minp R ˆ p , j = 1, . . . , K, the RP signature R j ˆp − R ˆ m in order to have is obtained by normalizing the vectors R each component of the signature vectors in the interval [0,1]. In the experimental evaluation we investigated the effect of the dimension K showing that good accuracy can be obtained using very low values for K. For the dataset used in the experiments we found that no substantial improvements can be achieved by choosing K > 5.

2.2

Hyperlink Map Signature

In order to encode the information related to the context of the page in the hyperlinked environment, we can define an iterative equation which combines the signatures of the neighbor pages, using a scheme similar to the computation of the PageRank [2]. A simple implementation of this scheme is represented by the following system of equations

1 0.95 D=1.5e−3, C variable D=2.0e−3, C variable D=3.5e−3, C variable D variable, C=1

0.9

Precision

0.85 0.8 0.75 0.7 0.65

D variable, C=1 break even point D=3.5e−3, C variable break even point

0.6 0.7

0.75

0.8

0.85 Recall

0.9

0.95

1

Figure 1: Precision-Recall plots. The different curves correspond to different settings of the C and D thresholds. For each curve the value of one threshold is fixed while the other one is varying along the curve.

xpj (t

+ 1) =

rjp

+γ·

X

xqj (t)

·

rjq ,

j = 1, . . . , K

ask.slashdot.org/askslashdot/02/07/16/1727256.shtml?tid=127 developers.slashdot.org/article.pl?sid=02/07/16/1727256

(1)

q∈ch(p)

where xpj (t) is the j-th component of the signature for the page p, rjp is the j-th component of the RP signature for the page p, γ is a dumping factor which modulates the effect of the signature propagation from the pages linked by page p, and t is the iteration step. At the first iteration xpj (0) = rjp . The vector X p (t) = [xp1 (t), . . . , xpK (t)]0 represents the Hyperlink Map (HM) signature of page p at iteration t. Finally, in order to obtain values in the interval [0,1], the K vectors X p (t) are multiplied by x1M , being xM = maxj=1,...,K p=1,...,N xpj (t). At each iteration, equation (1) propagates the HM signatures from each node to its parents, combining both the page contents, encoded by the rjq value, and the structure of the page neighborhood. In order to include the contribution of the descendents which are s links away from a given page, the equation should be iterated at least s times. Significant signatures can be computed using only few iterations (3–5 iterations in the experiments).

3.

checking all the lists of colliding documents returned by the algorithm using the cosine correlation of the bag-of-words representations. This algorithm is slow but highly accurate in evaluating the document similarity. The curves in figure 1 show how the precision and the recall vary with respect to the choice of the two thresholds C and D. The three curves depending on C show the improvement provided by the use of the HM signature for three different values of D. The fourth curve (C = 1) reports the behavior of the algorithm when using only the RP signature. For all values of D, when reducing the threshold C the precision increases, while the recall is reduced. Anyway, in all the cases for appropriate values of the threshold C the precision is significantly improved with a negligible reduction in the recall. This motivates the combined use of the RP and HM signatures. The break even point is 90% for the RP signature and 93% for the combination of the RP and HM signatures. An interesting case is that of pages related to Web forums or repositories of newsgroup messages. These sites show a high redundancy because they allow different views of the same contents. The following one is an example of two URLs which refer to the same message:

EXPERIMENTAL RESULTS

We collected about 300, 000 documents by crawling the Web. The bag-of-words representation of each page was based on a dictionary containing about 2 millions words. In the experiments, two documents d1 , d2 , having RP signatures R1 and R2 and HM signatures X 1 and X 2 , were considered to collide with (C, D) tolerance iff ||X 1 − X 2 ||∞ < C and ||R1 − R2 ||∞ < D. Thus two documents collide if their RP and HM signatures lay in two hypercubes having side lengths smaller than C and D, respectively. We measured the accuracy of the algorithm in terms of recall and precision. The recall is defined as the number of detected (near)replicas divided by the total number of (near)replicas present in the dataset. Since it is not known a priori which documents are replicated in the dataset, we manually sampled 100 pages having at least one replica or near-replica and we used this set as a reference. Thus, the recall was computed as the fraction of the documents in this set returned by the algorithm. The precision is defined as the number of documents that have a (near)replica in the dataset divided by the total number of documents returned by the algorithm. The precision was evaluated by

Another example which was detected in the dataset consists of a list of messages ordered by date, by subject, by thread or by author. Even if the different pages differ for the internal organization and for some small details, the four pages have essentially the same informative content: mail.gnu.org/pipermail/libtool/1999-January/date.html mail.gnu.org/pipermail/libtool/1999-January/subject.html mail.gnu.org/pipermail/libtool/1999-January/thread.html mail.gnu.org/pipermail/libtool/1999-January/author.html

4.

CONCLUSIONS

We proposed the use of two low dimensional signatures to effectively detect replicas and near-replicas among a set of Web pages. The two signatures encode the page contents and the page hypertextual context, respectively. In particular, by using the HM signature documents are compared not only on the basis of their contents but also on their hypertextual context. The experimental results show that the use of both signatures can yield high recall and precision in finding replicas and near-replicas. This approach can be extended to broader definitions of similarity between pages than replication or near-replication.

5.

REFERENCES

[1] K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):1114–1122, 2000. [2] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th World Wide Web Conference (WWW7), 14–18 Apr. 1998. [3] J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated Web collections. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 355–366, 2000. [4] S. Vempala and R. I. Arriaga. An algorithmic theory of learning: robust concepts and random projection. In 40th Annual Symposium on Foundations of Computer Science, pages 616–623, 1999.

Detecting near-replicas on the Web by content and ... - CiteSeerX

Detecting near-replicas on the Web by content and ... - CiteSeerX

Suggest Documents

Detecting pornographic video content by combining ... - CiteSeerX

DeCore: Detecting Content Repurposing Attacks on ... - CiteSeerX

Detecting near-replicas on the Web by content and ... - Google Sites

Detecting near-replicas on the Web by content and ... - Google Sites

Detecting Malicious Content on Facebook

Detecting Semantic Cloaking on the Web

Detecting Content Changes on Ordered XML Documents ... - CiteSeerX

Deep Web Content Mining - CiteSeerX

Deep Web Content Mining - CiteSeerX

Efficient Content Creation on the Semantic Web Using ... - CiteSeerX

Learning on the Web: A Content Literacy Perspective ... - CiteSeerX

Detecting Content Changes on Ordered XML

Characterizing Web Syndication Behavior and Content - CiteSeerX

Web Content Analysis: Expanding the Paradigm - CiteSeerX

Web Content Analysis: Expanding the Paradigm - CiteSeerX

Detecting Link Hijacking by Web Spammers

WebCQ â Detecting and Delivering Information Changes on the Web

Building Quality Content on the Web

Detecting Pornography on Web to Prevent Child

Detecting Arabic Spammers and Content Polluters on Twitter (PDF ...

Detecting Steganographic Content on the Internet - CITI - University of ...

Detecting Obscene Content and Misbehaving ... - Semantic Scholar

SafeVchat: Detecting Obscene Content and ... - Computer Science

Detecting Age of Page Content

Detecting near-replicas on the Web by content and ... - CiteSeerX