Exploiting Bidirectional Links: Making Spamming ... - Semantic Scholar

5 downloads 30323 Views 453KB Size Report
Nov 6, 2009 - Henzinger et al. identify web spamming as one of the biggest challenges .... hosts. Since the experiments were performed at site-level, this data.
Exploiting Bidirectional Links: Making Spamming ∗ Detection Easier Yan Zhang



Qiancheng Jiang Lei Zhang Yizhen Zhu

Department of Machine Intelligence, Peking University, Beijing 100871, China Key Laboratory on Machine Perception, Ministry of Education, Beijing 100871, China

{zhy, jqc, zhangl, zhuyz}@cis.pku.edu.cn ABSTRACT

Parent Penalty [14] and Anti-Trust Rank [9] are typical methods of this kind. However, all of these methods only employ one type of incoming and outgoing links. Besides, they just deliver a single score to judge a page’s value, which exhibits weakness in the fight against spammers. The other task of our work is seed selection. Most of the antispamming methods such as TrustRank and Parent Penalty are kinds of biased PageRank. Thus the seed set is crucial for them. To the best of our knowledge, seed selection is a manual and timeconsuming process [6, 14]. Moreover, none of these methods has taken the propagation coverage of the seed set into consideration, which is critical to the final result. In this paper, we will study the propagation ability of seeds and demonstrate that an automatically selected large seed set is not only timesaving but also more powerful than a carefully selected small seed set.

Previous anti-spamming algorithms based on link structure suffer from either the weakness of the page value metric or the vagueness of the seed selection. In this paper, we propose two page value metrics, AVRank and HVRank. These two “values” of all the web pages can be well assessed by using the bidirectional links’ information. Moreover, with the help of bidirectional links, it becomes easier to enlarge the propagation coverage of seed sets. We further discuss the effectiveness of the combination of these two metrics, such as the quadratic mean of them. Our experimental results show that with such two metrics, our method can filter out spam sites and identify reputable ones more effectively than previous algorithms such as TrustRank.

Categories and Subject Descriptors H.3.3 [Information System]: Information Storage and RetrievalInformation Search and Retrieval

2.

General Terms Algorithms, Experimentation

Keywords Bidirectional Links, Seed Selection, Link Spamming

1.

RELATED WORK

Web spamming receives a lot of attention recently [1, 4, 6, 3]. Henzinger et al. identify web spamming as one of the biggest challenges to web search engines [7]. Gyongyi et al. propose TrustRank algorithm to combat Web spam [6]. Wu et al. make an improvement of TrustRank by using topical information in their work [16]. Wu et al. also propose an approach based on propagating negative value among pages with more strict rules [14]. Later they propose an approach by combining trust and distrust [15]. Metaxas et al. discuss an anti-progagandistic method for anti-spamming [11]. Becchetti et al. use link-based characteristics to build a classifier automatically [1]. Motivated by the heat diffusion model, Yang et al. propose a ranking algorithm called DiffusionRank [17].

INTRODUCTION

Link spamming is the most nasty adversary among all the spamming artifices [5]. To combat the rampant link spamming tricks, many link-based anti-spamming techniques have been proposed so far. TrustRank [6] assumes that good web pages seldom point to bad ones. Therefore good pages can propagate their trust via their outgoing links. On the other hand, some approaches hold the philosophy that a page should be penalized for pointing to bad pages.

3.

∗Support by NSFC under Grant No.60673129 and 60773162, 863 Program under Grant No.2007AA01Z154 †Corresponding author

3.1

EXPLOITING BIDIRECTIONAL LINKS Assessing Page Values

Inspired by HITS algorithm [8, 2], there are two types of highly valuable pages: authorities and hubs. Therefore, as stated in our previous work [18], we can evaluate a page’s authorithy value and hub value simultaneously. These two metrics are AVRank (AV for Authority Value) and HVRank (HV for Hub Value), which can be regarded as the generalization of authority score and hub score in HITS algorithm. As TrustRank and HITS algorithms, the scores of these two metrics are split equally by the number of links. Accordingly, the two metrics can be formalized as follows:

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’09, November 2–6, 2009, Hong Kong, China. Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00.

1839

Tunnel   AV Rank(qi ) HV Rank(qi ) + (1 − α ) · AV Rank(p) = α· ∑ ON(qi ) ON(qi ) qi : points to p   AV Rank(ri ) HV Rank(ri ) HV Rank(p) = β· + (1 − β ) · , ∑ IN(ri ) IN(ri ) ri : pointed by p where ON(qi ) is the outlink number of page qi and IN(ri ) is the inlink number of page ri . The parameters α (0 ≤ α ≤ 1) and β (0 ≤ β ≤ 1) in the formulas denote the ratios of each part. The AVRank and HVRank are regressed to the authority and hub scores in HITS when assigning α = 0 and β = 1. Therefore the HITS algorithm is a special case of ours. The proof of their computation convergence is given in [19].

3.2

Tendrils.OUT

Figure 1: Structure of web graph Table 1: Comparison of propagation coverage among different methods Position of Seed Propagation Coverage Set SCC Forward SCC, OUT IN Forward SCC, OUT, some IN/Tendrils OUT Forward some OUT SCC Backward SCC, IN IN Backward some IN OUT Backward IN, SCC, some OUT/Tendrils SCC/IN/OUT Bidirectional most WCC (with AVRank) SCC/IN/OUT Bidirectional full WCC (AVRank & HVRank)

[α MT + (1 − α )β MT · NT + (1 − α )(1 − β )β MT · (NT )2 + · · · ] · AVRank

The first part α MT corresponds to random visit. It happens at the probability of α . The second part is (1 − α )β MT · NT . The product of matrix MT and NT is still a column-stochastic matrix and the element M T N T [i, j] is just the transition probability from page pi to p j by first backing to previous page from page pi and then clicking page p j . According to “back-and-forth” surfer model, the probability of this kind of visit is (1− α )β . The remaining parts can be explained in the same way with the “back-and-forth” model.

Furthermore, we find that seeds can be selected more easily without the limitation of their location. For example, all of the sites in the .gov and .edu domain can be considered as seeds. No doubt that this process is quite efficient. We will further demonstrate that such an automatically selected large seed set is more powerful than a manually selected small one in Section 5.

4.2

SEED SELECTION

Bad Seeds

Besides using a good seed set to propagate trust, bad seeds can also be used for propagating distrust [9]. Spam sites are always linking to each other to form an alliance, and therefore a “qualified” bad seed set will be helpful for spam detection. In our previous work [20], we illustrate how to select bad seeds of “high values”. In this study, we will show that a bad seed set cannot help a large good seed set, though it does help a small one as in [20]. Thus we do not need such a seed set at all.

As mentioned above, seed selection is of significant importance for these biased PageRank algorithms to combat web spamming. In this section, we will discuss how to select the seeds based on the web structure so as to maximize the seed propagation coverage.

4.1

OUT

ISLANDs

It is well known that the PageRank has a good explanation through the “random surfer model” [12]. However, this model cannot depict most of users’ browsing behavior nowadays. Instead of following hyperlinks, users often bounce back with “back button” on browser and choose a new page randomly. Marcin Sydow uses this “random surfer with back step” model [13] to improve the PageRank algorithm. It is called “back-and-forth” model in our research. Our new metrics can well match the “back-and-forth” model. According to formula (A.4) in Appendix A in [19]:

4.

SCC

Tendrils.IN

“Back-and-Forth” Surfer Model

AVRank =

IN

Propagation Coverage

The first problem we need to concern about a seed set is its propagation ability, i.e., what percentage of pages or sites can be covered from a seed set. In fact, the propagation coverage of a seed set is determined by web graph structure to some extent. From previous work [10] and our experiments, the web graph can be shown as that in Figure 1, which includes IN, OUT, SCC, Tendrils (including Tunnels) and ISLANDs. The WCC (weakly connected component) consists of all of the parts except ISLANDs. In general, selecting reputable sites carefully as seeds to form a small seed set is a widely used method. As we know, most famous and reputable sites are in SCC. However, when we only choose those sites located in SCC as seeds, the seed set propagation coverage is actually fated: SCC and OUT if propagating by outlinks, or SCC and IN if propagating by inlinks. Table 1 shows the comparison among different methods, which indicates bidirectional links really help. In practice, sites located in some main ISLANDs as well as those in SCC can be selected as seeds to maximize the propagation coverage.

5. 5.1

EXPERIMENT Data Set

To evaluate our method, we conducted experiments on a set of pages crawled by Tianwang Search Engine (developed by Network Lab, Peking Univ.). It contains 13,285,711 pages from 358,245 hosts. Since the experiments were performed at site-level, this data set is slightly larger than that of WEBSPAM-UK2007 (which has 114,529 hosts). There are many tiny islands (35.56%) in our data set, however, this does not affect our analysis. To select a small seed set, the top 500 sites sorted by AVRank and HVRank scores are picked out separately. The sites appear in both groups are chosen as seed candidates. Afetr manually evaluating these candidates, we finally select 38 out of 46 sites as our small seed set X.

1840

Choosing the large seed set is timesaving. 9, 003 education sites and 14, 663 government sites in our data set are all placed into the large seed set which is named as L in the experiments. We use Kendall’s-τ correlation to measure the similarity between AVRank/HVRank and PageRank when varying the values of α and β . At last we choose α = β = 0.5 because this set of parameters not only embodies symmetry, but also has a sound visit explanation.

5.2

the spam category. Not available sites are thrown away and other substitution ones are reselected in the sampling process. The distribution of our sample is presented in Figure 2. 6SDP  %ORJ  'LUHFWRU\ 

Combining AVRank and HVRank

We have four prevalent forms for the mean values of AVRank and HVRank: • • • •

5HSXWDEOH 

arithmetic mean geometric mean harmonic mean quadratic mean (root mean square)

Figure 2: Distribution of categories in the evaluation sample

5.4.1 Table 2: Kendall’s-τ correlation between different mean values and PageRank arithmetic geometric harmonic quadratic Kendall’s-τ 0.95 0.51 0.52 0.50

For clear comparison, we show the demotion of spam sites and non-spam sites with different seed sets while using AVRank score. From Figure 3, spam sites have much more demotion than nonspam sites no matter what kind of seed sets are used. There is even an obvious division at value 4 between spam sites and non-spam sites. The large seed set L performs much better than using X. The demotion of spam sites are all above 4 when using L.

The Kendall’s-τ correlation coefficient of each mean and PageRank are all considered. As shown in Table 2, the result of arithmetic mean is very close to PageRank, while the other three are almost at same level. When manually checking the top results, we find that the result of quadratic mean is better than the other three. Therefore quadratic mean is chosen as our combination method of AVRank and HVRank in the following experiments.

  

The Propagation Coverage of Two Good Seed Sets

  

The propagation coverage of TrustRank and our method with the two seed sets is given in Table 3. Sn(s) denotes the number of sites that can be covered from seed set s. Compared with the propagation coverage using the small seed set X, the improvement of using L is not so remarkable. The reason is there are so many tiny islands in our data set, which inevitably limit the propagation of seeds.

   



















%XFNHW

Figure 3: Bucket-level demotion of AV-Rank scores

Table 3: Propagation coverage with X and L Sn(X) (Percentage) Sn(L) (Percentage) TrustRank 110769 (30.92%) 123383 (34.44%) AVRank 132427 (36.97%) 142605 (39.81%) HVRank 193986 (54.15%) 202078 (56.41%) Quadratic Mean 230857 (64.44%) 238015 (66.44%)

5.4

*RRG6LWH VPDOOVHHGVHW *RRG6LWH ODUJHVHHGVHW 6SDP6LWH VPDOOVHHGVHW 6SDP6LWH ODUJHVHHGVHW



'HPRWLRQ

5.3

The Demotion of Spam Sites and Non-spam Sites with Different Seed Sets

5.4.2

Effectiveness of a Large Seed Set

In this section, we demonstrate the results of anti-spamming using seed set L. From Figure 4, we can see that: (1) All of the methods exhibit good performance in the first 6 buckets. Most of spam sites are demoted to the last 5 buckets.

Anti-Spamming

In order to compare the ability of anti-spamming, a set of sample sites with known types is needed. A sample of 1000 sites which can give enough information and still be manageable is settled. Using a method similar to that in TrustRank [6], a list of sites in descending order of their PageRank scores is generated and segmented into 20 buckets. Each of the buckets contains a different number of sites with scores summing up to 5 percent of the total PageRank score. We construct a sample set of 1000 sites by selecting 50 sites at random from each bucket. Then we perform a manual evaluation to determine their categories. Each site is classified into one of the following categories: reputable, spam, pure directory, and personal blog. A sample site who uses any spamming techniques is put in

(2) Our method is better than TrustRank in most of the first 9 buckets. The Quadratic Mean especially has an excellent result with 100% non-spam sites in the first 6 buckets.

5.4.3

With a Mixed Seed Set

In this part, a bad seed set B is combined with the large seed set L. B consists of 43 bad sites which are selected with a method same to [20]. The initial values of these bad seeds are equal and summed up to −1. The result of employing such a mixed seed set is shown in Figure 5. Unfortunately, there is no improvement in terms of combating spam. The main reason is that our algorithm with a large seed

1841

L

1

0

0

8

0

%

6

0

%

0

%

4

2

0

0

e

f

t

:

T

r

u

s

t



R

a

n

k

M

i

d

d

l

e

:

A

V



R

a

n

k

R

i

g

h

t

:

Q

u

a

d

r

a

t

i

c

M

e

a

n n

%

%

%

1

2

3

4

5

6

7

8

9

1

0

1

1

1

2

1

3

1

4

1

5

1

6

1

7

1

8

1

9

2

0

Figure 4: Good sites in Trust-Rank, AV-Rank and Quadratic Mean buckets with L                          

Figure 5: Good sites in AV-Rank Buckets with L and B set can do the job well enough. In this scenario, it does not help when combining a bad seed set, especially a small one. However, we believe a large bad seed set of “high quality” is undoubtedly useful for combating spam.

6.

CONCLUSION

In this paper, we propose two new metrics, AVRank and HVRank, to measure a page’s value. By emploiting the bidirectional links, these two metrics can work more effectively. We also analyze the shortcoming of small seed sets and demonstrate that using a large seed set can achieve a better performance. In addition, choosing a large number of seeds automatically is actually easier than choosing a small number of seeds manually and carefully. The experimental results show that our proposed method performs better than TrustRank, especially with the quadratic mean of AVRank and HVRank. Actually AVRank/HVRank can be used other than anti-spamming. For example, we find it is more reasonable when using these two metrics to rank the Wikipedia pages. We believe it can also be used in common page ranking after removing the irrelevant links. These problems are left as our future work.

7.

REFERENCES

[1] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In AIRWeb ’06, Seattle, WA, USA, August 2006. [2] A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In WWW ’01, pages 415–429, Hong Kong, 2001. [3] G. Geng, Q. Li, and X. Zhang. Link based small sample learning for web spam detection. In WWW 2009, Madrid, Spain, April 2009.

1842

[4] Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In VLDB ’05, pages 517–528, Trondheim, Norway, 2005. [5] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In AIRWeb ’05, Chiba, Japan, 2005. [6] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In VLDB ’04, pages 576–587, Toronto, Canada, 2004. [7] M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2), 2002. [8] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, 1999. [9] V. Krishnan and R. Raj. Web spam detection with anti-trust rank. In AIRWeb’06, Seattle, WA, USA, August 2006. [10] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. The Web as a graph. In Proceedings of 19th PODS, pages 1–10, Dallas, USA, 2000. [11] P. T. Metaxas and J. DeStefano. Web spam, propaganda and trust. In AIRWeb ’05, Chiba, Japan, May 2005. [12] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998. [13] M. Sydow. Random surfer with back step. In WWW ’04 on alternate track papers and posters, pages 352–353, New York, NY, USA, 2004. [14] B. Wu and B. D. Davison. Identifying link farm spam pages. In WWW ’05, pages 820–829, Chiba, Japan, 2005. [15] B. Wu, V. Goel, and B. D. Davison. Propagating trust and distrust to demote web spam. In Proceeding of Models of Trust for the Web (MTW’06), Edinburgh, Scotland, 2006. [16] B. Wu, V. Goel, and B. D. Davison. Topical trustrank: using topicality to combat web spam. In WWW ’06, pages 63–72, Edinburgh, Scotland, 2006. [17] H. Yang, I. King, and M. R. Lyu. Diffusionrank: a possible penicillin for web spamming. In SIGIR ’07, pages 431–438, Amsterdam, The Netherlands, 2007. [18] L. Zhang, Y. Zhang, Y. Zhang, and X. Li. Exploring both content and link quality for anti-spamming. In CIT ’06, Seoul, Korea, 2006. [19] Y. Zhang, Q. Jiang, L. Zhang, and Y. Zhu. Deeply exploiting link structure: Setting a tougher life for spammers. Technical report, Peking University, 2009. http://www.cis.pku.edu.cn/faculty/system/zhangyan/papers/CPV.pdf. [20] L. Zhao, Q. Jiang, and Y. Zhang. From good to bad seeds: Making spam detection easier. In CIT ’08, Sydney, Australia, 2008.