Link Based Spam Algorithms in Adversarial Information Retrieval

13 downloads 14400 Views 1MB Size Report
good sites seldom point to spam sites and trust propagated through the link structure ..... n until it eventually get to find the pairwise orderedness of all host sites.
Link Based Spam Algorithms in Adversarial Information Retrieval Alex Goh Kwang Leng, Ravi Kumar P, Ashutosh Kumar Singh, Anand Mohan* Department of Electrical and Computer Engineering Curtin University, Sarawak Campus, Miri, Malaysia. [email protected], [email protected], [email protected], *National Institute of Technology Kurukshetra, India [email protected] Web spam has become one of the most exciting challenges and threats to Web search engines. The relationship between the search systems and those who try to manipulate them came up with the field of adversarial information retrieval. In this paper, we have set up several experiments to compare HostRank and TrustRank to show how effective it is for TrustRank to combat Web spam and we have also reported a comparison on different link based Web spam detection algorithms.

Keywords: Adversarial Information Retrieval, Web Spam, Link Based Spam Algorithms, HostRank, TrustRank

INTRODUCTION Web Search engines have played an important role in retrieving information on the World Wide Web. Generally Web users only see the top few pages of the search engine results. This makes the content providers to use improper ways to make their sites ranked top of the search engine results. This kind of practices has evolved into Web spam, also known as spamdexing (Gyöngyi et al. 2006) i.e. to achieve higher rankings in search engines results than deserved rankings. The human manual spam detection process is expensive, slow and difficult to automate. Understanding spamming techniques and dealing the spamming technique is critical to the success of search engines.

Page 1 of 19

Perkins et al. (2001) did a white paper on classification of search engine spam; Henzinger et al. (2003) identified Web spam as one of the most important challenges to Web search engine and have shown the impact of Web spam. According to Gyongyi et al. (2005), there are two kinds of spamming techniques. They are term spamming and link spamming. Term spamming includes content spamming and Meta spamming; a lot of research work is done to combat different kinds of web spam but in this paper, we are focusing on link spamming. Gyongyi et al. (2006), Fetterly et al. (2004), B. Wu et al (2005), Drost et al. (2005) and Becchetti et al. (2006) did extensive research on identifying Web spam pages based on link structure. Link based Web spam detection algorithms are constantly being proposed to combat Web spam. Gyongyi et al. (2004), proposed TrustRank which is based on the intuition that good sites seldom point to spam sites and trust propagated through the link structure of the Web. Derivatives of TrustRank are then proposed and shown more effective, like Anti-TrustRank by Krishnan et al. (2006), Topical TrustRank Algorithm by Wu et al. (2006), DiffusionRank by Yang et al. (2007) and Link Variable TrustRank Algorithm proposed by Chen et al. (2008). Other related spam detection algorithms like SpamRank proposed by Benczur et al. (2008), ParentPenalty by Wu et al. (2005), Truncated PageRank proposed by Becchetti et al. (2006) and DirichletRank proposed by Wang et al. (2008) are briefly discussed. For our experiments, we implemented HostRank, mentioned by Eiron et al. (2004) and Arasu et al. (2002) and compare with TrustRank (Gyöngyi et al. 2004) algorithm applied on large data set and shown how TrustRank had filter out spam sites effectively. This paper is organized as follows: Section 2 introduced various kinds of link based spam detection algorithms. Section 3 shows the experimental results and Section 4 concludes this paper by providing a couple of future directions.

LINK BASED SPAM DETECTION ALGORITHMS There are a number of Web spam detection algorithms. In this section we described important link based spam detection algorithms.

A. TrustRank

Page 2 of 19

Yang et al. (2007) mentioned that TrustRank has a strong theoretical relation with PageRank. The algorithm semi-automatically separate reputable good pages from spam, and trust flows from the link structure of the good pages to identify additional good pages. The intuition behind TrustRank is that good pages seldom point to bad pages. TrustRank starts by selecting seeds. Seed selection is done by applying inverse PageRank to the dataset, the reason behind is to get pages that would be most useful to identify additional pages. Another heuristic is to get pages with high PageRank, a high PageRank page would most probably pointed to another high PageRank page, this way, it will also propagate trust more. The results are then ranked in descending order and choose the good pages from top L pages as good seed set because trust flows only from good seed set. TrustRank then normalize the distribution vector and apply on the equation in (1), similar to PageRank with some minor changes:

t *    T  t *  (1   )  d

(1)

For the equation (1), α is the decay factor, usually sets 0.85, T is the transition matrix, while d is the distribution vector after normalization. Just like PageRank, this is an iterative algorithm and calculated in M iterations.

Figure 1. Simple Web Graph with PageRank and TrustRank results Assuming α decay factor is 0.85 running in M=50 iterations and set L = 3 with s+ = {F} and s- = {D, E}; Figure 1 illustrate the results from both PageRank (upper with non-bold) and TrustRank (lower with bold). Good page F propagates trust to page A, B and C and therefore they are having high PageRank Page 3 of 19

values while page D and E having low PageRank values. Page F is promoted since it is a good page while page D and E are punished for being a bad page.

B. Derivatives of TrustRank Algorithms a. Anti-TrustRank

Figure 2. Simple Web Graph with Good Pages (blue) and Bad Pages (red) Anti-TrustRank algorithm uses the same approximate isolation principle used by the TrustRank algorithm but Anti-Trust is propagated in the reverse direction along incoming links from a seed set of spam pages. A page is categorized as spam page if the Anti-TrustRank score of the page is more than a given threshold value. For example in Figure 2, assuming page A is a spam seed set, Anti-Trust would propagate to page B, and page B would propagate to page C and to page F if these pages are more than the threshold value. At first, Anti-TrustRank evaluate the dataset with PageRank algorithm and select spam pages seed set with high PageRank; Spam pages with high PageRank are most likely to be pointed by another spam pages with high PageRank. This way, Anti-TrustRank achieves fast reachability and earlier detection of high PageRank spam pages. After that, Anti-TrustRank run the biased PageRank algorithm on the transpose matrix which represents the Web graph with the spam seed set. Finally, pages are ranked in descending order by their PageRank score to estimate the spam content. Pages with score greater than the threshold value given are marked as spam. Anti-TrustRank is able to report that the pages from which its seed set can be reached in short paths are untrustworthy. Also, the authors found that the average spam pages rank calculated by Anti-TrustRank Page 4 of 19

is higher than the average spam pages rank calculated by TrustRank. In summary, Anti-TrustRank has the added benefit of returning spam pages with high precision. The intuition behind is that by starting with seed spam pages of high PageRank, they would expect that walking backward would lead to a good number of spam pages of high PageRank. b. Topical TrustRank Selecting seed function in TrustRank algorithm has a bias towards communities. The Web consists of large repositories from different kinds of topic. In addition to this, the seed set coverage used by TrustRank does not cover every topic exist on the Web. To address these issues, inspired by Topic Sensitive PageRank (Haveliwala 2003), Wu et al. (2006) proposed Topical TrustRank which uses topical information to partition the seed set and calculate the trust score for each topic separately. Given a seed set, Topical TrustRank divides the seed set into different partitions corresponding to the topics. The equation would be given as (2): n  n    mi   t   (mi  ti ) i 1  i 1 

(2)

The equation is a version of the Linearity theorem proved by Jeh and Widom [10]. Assume we have a seed set T. It can be partitioned into n subsets, T1,T2,…,T3 where each containing mi·(1 ≤ i ≤ n) seeds. t represent the TrustRank scores calculated by using T as the seed set and ti·(1 ≤ i ≤ n) represent the TrustRank scores calculated by using Ti as the seed set. It shows that product of TrustRank score and the total number of seeds equals the sum of products of the individual partition-specific scores and the number of seeds in that partition. The transformation of the equation is: n

t j 1

mn



n

m i 1 i

 tn

(3)

The authors introduced two techniques, which are called simple summation and quality bias to combine the generated topical trust score so as to present a single measure of trust for a page. Simple summation is calculated by adding up all trust scores by topic and applies on TrustRank, and then the

Page 5 of 19

Topical TrustRank score is generated. In the other hand, quality bias takes the average PageRank value of the seed pages of particular community into consideration. The authors also proposed three seed selection improvements for Topical TrustRank algorithm. They are seed weighting, seed filtering and finer topic hierarchy. In seed weighting, each node is assigned a constant value proportional to its quality; another way of saying is that some seed pages’ trust is generally higher than some other seed pages. In seed filtering, the quality of a page can be measured using PageRank or Topical TrustRank scores, low quality seed pages can be filtered out to improve the performance of the Topical TrustRank as low quality pages might include spam pages. For finer topic hierarchy, topic directories usually provide a tree structure for each topic and calculation is expensive to involve finer topics. However, finer topic hierarchy would be ideal to categories the Web. There is a tradeoff for using simple summation. For that reason, the authors experiment using quality bias and the combination of seed weighting, seed filtering and finer topic hierarchy. The topical TrustRank results provided a reduction of 19% – 43.1% in spam sites compare to TrustRank. c.

DiffusionRank Motivated by the viewpoint of the Web structure and heat diffusion phenomena, Yang et al. (2007)

proposed DiffusionRank, a generalization of PageRank which additionally has the ability to reduce the effect of link manipulations. Heat diffusion is a physical phenomena in which heat always flow from high temperature position to low temperature position. The authors explained two points where PageRank is susceptible to Web spam. The two points are over-democratic and input-independent. The belief behind PageRank is that all pages are born equal; all pages have the right to vote in a summation of one for each page. Over-democratic can be explained when a large number of new pages are pointing to a page, since all new pages have the right to vote. For inputindependent, PageRank is an iterative algorithm which it calculates until a point where it converged. Input-independent property makes it impossible to set an input to avoid Web spam, like large values for trusted pages and less or even negative value for spam sites. The heat diffusion model has an advantage to

Page 6 of 19

avoid over-democratic and input-independent of PageRank. Therefore, the authors proposed DiffusionRank to view the Web from another perspective and calculate the ranking values. The DiffusionRank equation is defined in (4):

  h  1   M

 1  h  (  A  h  (1   )  1) M n 

(4)

There are four advantages for DiffusionRank: two closed forms, group-group relations, graph cut and anti-manipulation. The two closed forms include discrete form and continuous form, the primary one has the advantage of fast computing while the secondary one has the advantage of being analyzed easily from theoretical aspects. DiffusionRank is able to detect group-to-group relations easily because of the easy interpretation of the heat amount from one group to another. Another advantage is that it can partition the Web graph corresponding to the community by assigning positive and negative values among the communities. Lastly, DiffusionRank has the ability to reduce the effect of link manipulation as trusted Web pages are assigned with unit heat while all others are assigned with zero heat. The authors claimed manipulated Web pages will get lower rank until it is pointed by several good pages. d. Link Variable TrustRank Chen et al. (2008) proposed Link Variable TrustRank algorithm (also known as LVTrustRank) based on the idea of using “bursts” of linking activity as a suspicious signal (Shen et al. 2006) with the combination of the original TrustRank algorithm mentioned earlier. When there is a drastic change in the link structure of a spam site in a short period of time, LV TrustRank uses this opportunity to measure trust from the variance of the link structure and detect spam sites. Spammers intend to add links to pages which they want to promote, Shen et al. (2006) introduced inlink growth rate (IGR) to measure the ratio of the increased number of incoming links of a site to the number of original incoming links. The metrics is defined as (5):

IGR 

S in (t1 )  S in (t 0 )  S in (t1 ) S in (t 0 )

(5)

Page 7 of 19

t0 and t1 are two different timeline where Sin(t0) is the set of inlinks of a site at time t0 and Sin(t1) is the set of inlinks of a site at time t1. IGR is a good indicator to represent the variance of spam site’s inlink structure. LVTrustRank computes the TrustRank score t*1 and t*2 at different timeline and uses IGR to get the ratio of the variance of link structure. A joint formula to compute the final trust score for the timelines is defined as (6):

t

*

f

t *1  t * 2 1 IGR ( ) 2

(6)

LVTrustRank performs well on detecting Web spam based on the variance of the link structure. However, there exist some spam sites that do not change their link structure and it is not possible for LVTrustRank to detect. Nevertheless, G. Shen et al. (2006) introduced the idea of using variance of link structure to detect spam can be explored further. C. Other Spam Detection Algorithms BadRank (eFactory 2002) works as the opposite of PageRank based on the “linking to bad neighbors” principle, that is a page’s BadRank is high if it points to another pages with high BadRank. Instead of using incoming links just like PageRank, BadRank uses the outgoing links of the page. Liang et al. (2007) proposed R-SpamRank, and to the best of our knowledge, it works like the opposite of TrustRank to detect Web spam. R-SpamRank stands for reverse spam rank, which initially uses blacklist as spam Web pages as seeds, then expand it by applying a formula similar to BadRank. The authors claimed that the algorithm ideally detect spam pages in a link farm. Bencz´ur et al. (2005) assume that the distribution of the incoming links of a trust page should not be overly dependent and should follow the power law distribution. Hence, the authors proposed SpamRank to penalize pages that have suspicious PageRank share and personalize PageRank on penalties. Wu et al. (2005) introduced a technique to identify link farm spam pages, this technique consists of three steps: Generating step, expansion step and ranking step. At first the algorithm generates a spam seed set by its common incoming and outgoing links. Then the author proposed ParentPenalty to expand the Page 8 of 19

seed set, the assumption is that if one page points to a bunch of bad pages, it is likely that the page is a bad page. Lastly, the authors rank the Web graph by down weighting the elements in the adjacency matrix. Truncated PageRank (Becchetti et al. 2008) introduced a damping function to replace the damping factor to rank the large Web graph. Link farms can promote pages easily by simply pointing to particular pages. The intuition is to consider a damping function that ignores the direct contribution of the first level of links. Truncated PageRank is able to demote spam by decreasing the quality of neighbors that are topologically close to the targeted page. Wang et al. (2008) introduced the “zero-out-link” and “zero-one gap” problems in PageRank which could potentially exploit to manipulate PageRank results by link spamming. The authors proposed DirichletRank based on Bayesian estimation with a Dirichlet prior, and proved it is more resistant and more stable under link perturbation. Zhang et al. (2009) explore the bidirectional links and proposed two page value metrics, AVRank and HVRank to detect spam easier. AVRank and HVRank are inspired by TrustRank and HITS algorithm also to expand the seed set trust propagation. The authors also proved automatically identified large seed set works better than human manual identified seed set.

EXPERIMENTS In this section, we demonstrate how TrustRank outperform HostRank in combating spam sites. Before concluding the results, the data, measurements and algorithms are discussed.

A. Dataset To evaluate the algorithms, experiments are performed on WEBSPAM-UK 2006 dataset (Castillo et al. 2006) provided by Laboratory of Web Algorithmics, Universitàdegli Studi di Milano with the support of the DELIS EU - FET research project. The original dataset consists more than 77m pages and over 3 billion edges. However, we only consider the hostgraph which is 11401 sites and 730k edges. The labeling on the host sites are done by a group of volunteers and are given two sets – set 1 is released on

Page 9 of 19

July 2006 and set 2 is released on June 2007. The labels are assigned with “normal”, “spam”, “borderline” or “cannot classify”. The distribution of the labels from the combined sets is shown in figure 5.

Figure 3. Distribution of the number of pages

For our experiments, we totally removed “borderline” and “cannot classify” labeled sites because we are afraid that those sites might decrease the accuracy of the results. The new hostgraph consists of 10254 sites and 637k edges. In this section, “normal” sites will be referred as good sites while “spam” sites will be referred as bad sites as shown in figure 3. The dataset is enough to show the comparison between HostRank and TrustRank to combat Web spam. In the next section, we describe the measurements to evaluate our experiments.

B. Measurements The measurements that are carried out in this paper are: • • • • •

Pairwise orderedness Precision, Recall and F-measure Percentage of good and bad sites in each bucket Propagation Coverage Comparison of other algorithms We choose to use pairwise orderedness, mentioned by Gyongyi et al. (2004) to evaluate HostRank

and TrustRank concerning the ordered trust property. We arrange the sites based on their rank results in descending order and find the pairwise orderedness for top n pages. We progressively add more pages to n until it eventually get to find the pairwise orderedness of all host sites. The second measurement that we used is precision and recall, introduced by R. Baeza-Yates and B. Ribeiro-Neto (1999). Precision is used as measure for relevance whereas recall is used as measure for Page 10 of 19

completeness. Both metrics are widely used for evaluating the quality of retrieved documents. The combination of precision and recall is called F-measure, also known as the harmonic mean of precision and recall is also included in the second measurement. The third measurement is that we evaluate by calculating the percentage of good and bad sites in HostRank and TrustRank buckets. The sites are arranged based on rank results in descending order and distributed equally into 20 buckets with the 20th bucket having additional remaining sites. Then the evaluation starts to get the ratio between good sites and bad sites in each bucket. Note that the amount of sites in each buckets are the same for both HostRank and TrustRank. The fourth measurement shows the propagation coverage of the TrustRank algorithms on different seed sets to see how wide the trust has been propagated. For the experiments, we also do the measurements on TrustRank with different seed sets – 50, 75 and 100. It is to show that the algorithm works better on large seed set. Finally, we compare a few more Web spam detection algorithms based on seed selection and explain briefly on their trust or distrust propagation. The algorithms that are compared are TrustRank, ParentPenalty, Anti-TrustRank, Topical TrustRank, R-SpamRank, DiffusionRank and Link Variable TrustRank.

C. Methodology HostRank is explained more in Eiron et al. (2004) and Arasu et al. (2002). The HostRank is almost similar to the PageRank algorithm (Brin and Page 1998), just that instead of calculating individual pages, we take host sites into consideration. For example, A and B are the host sites, C is one of the sub sections of B and is pointing to A, we can just say B is pointing to A. The HostRank algorithm can be defined in (7).

HR( A) 

 HR(Ti ) HR(Tn )  (1  d )   d    ...  N C (Tn )   C (Ti )

(7)

HR(A) is the HostRank of page A, HR(Ti) is the HostRank of page Ti which point to page A, C(Ti) is the number of outgoing links on page Ti, N is the number of the host sites and d is a damping factor which Page 11 of 19

can be set between 0 and 1 (usually it is set as 0.85). The HostRank is an iterative process that calculates the HostRank values for all the host sites. Comparing to PageRank, HostRank actually has computational advantage and more resistant to rank manipulation.

Figure 4. TrustRank Algorithm. Another algorithm that we used in the experiments is the TrustRank algorithm developed by Gyöngyi et. al (2004), shown in figure 4. At first, the algorithm selects trustworthy seeds using inverse PageRank. The top highest inverse PageRank score sites are most likely sites that propagate trust to the widest coverage, hoping to identify additional good pages. After that the algorithm applies human experts to manually evaluate good sites and bad sites as a replacement to the Oracle function. A static score distribution vector will be produced and normalized to compute the TrustRank scores. In our experiments, the decay factor α for both HostRank and TrustRank are set as 0.85 and both algorithms run in 50 iterations. Furthermore, TrustRank is implemented on 50, 75 and 100 good seeds to show it is more effective with larger seed set.

D. Experiments Results

Page 12 of 19

Figure 5. Pairwise Orderedness

Figure 6. Precision on HostRank and TrustRank algorithms (50, 75 and 100 seeds)

Figure 7. Recall on HostRank and TrustRank algorithms (50, 75 and 100 seeds)

Page 13 of 19

Figure 8. F-measure on HostRank and TrustRank algorithms (50, 75 and 100 seeds) Figure 5 displays the pairwise orderedness for HostRank and the TrustRank algorithms on 50, 75 and 100 seeds. At the end of the lines, HostRank receives a pairwise orderedness of 0.886, TrustRank receives the pairwise orderedness of 0.892 (50 seeds), 0.897 (75 seeds) and 0.902 (100 seeds). It is very important that it show high pairwise orderedness in early stage because Web users only consider top pages of the results, and the results shows that TrustRank outperforms HostRank, and among TrustRank algorithms, TrustRank with 100 seeds has the highest pairwise orderedness value. Figure 6 illustrates the precision, figure 7 illustrates the recall and figure 8 illustrates the F-measure for HostRank and the TrustRank algorithms running on 50, 75 and 100 good seeds. The TrustRank algorithm heavily dominates the HostRank algorithm with the three measurements above, especially on precision. Web users only look on first few pages return by the Web search engines, therefore it is important to return relevant results earlier. At the 10th bucket, TrustRank algorithm is able to return more than 0.9 precision value. Among the TrustRank algorithms, the one with the largest seed set (100 seeds) dominated all with 0.946 precision, 0.548 recall and 0.694 at the 10th bucket. Figure 9 shows the comparison of HostRank and TrustRank on the ratio of good sites and bad sites in each bucket. The dark green bar denotes the good sites of HostRank while the dark blue bar denotes the 50 seeds. The empty spaces above the bars represent the bad sites in individual buckets. As observed, TrustRank effectively increased the rank values of good sites as it shows more good sites at the first 4

Page 14 of 19

buckets. The 4th bucket of HostRank contains bad sites as much as 43%. It is important to return results in early buckets because it shows the most relevant results.

Figure 9. Percentage of good sites in HostRank and TrustRank (50 seeds) buckets

Figure 10. Percentage of bad sites in HostRank and TrustRank (50 seeds) buckets Figure 10 compares the percentage of bad sites on the TrustRank algorithms with 50 seeds, 75 seeds and 100 seeds. Among the TrustRank algorithm, TrustRank with 100 good seeds outperforms the rest because more good seeds can reach wider coverage. With more good seed sets, more trusted sites are being promoted leaving bad pages behind.

Page 15 of 19

Table 1. Propagation Coverage Algorithms Sn(S) Sn(SG) Sn(SB) tG tB TrustRank (50 seeds) 8244 (80.4%) 6607 (64.4%) 1637 (16%) 92.9% 7.1% TrustRank (75 seeds) 8384 (81.8%) 6762 (65.7%) 1642 (16%) 94.3% 5.7% TrustRank (100 seeds) 8435 (82.3%) 6792 (66.2%) 1643 (16%) 95.6% 4.4% The propagation coverage of the TrustRank algorithms on 50, 75 and 100 seeds are given in table 1. Sn(S) denotes the number of sites that are covered from the seed set. Sn(SG) denotes the number of good sites while Sn(SB) denotes the number of bad sites propagated. tG denotes the percentage of trust propagated to good sites and tB denotes the percentage of trust propagated to bad sites. Among the algorithms, TrustRank with 100 good seeds managed to propagate the best sites (up to 66.42% of all sites). Table 2. Comparison of Web Spam Detection Algorithms Year

Good Seed Set

Bad Seed Set

Trust/Distrust Propagation

TrustRank

2004

ParentPenalty

2005

Anti-TrustRank

2006

Topical TrustRank

2006

R-SpamRank

2007

DiffusionRank

2007

Trust is propagated via good seed set using inverse high PageRank in which can reach many other pages. Pages that are pointing to spam pages above a threshold value are likely to be spam pages themselves. Distrust is propagated in a reverse direction of a spam seed set with high PageRank. Trust propagation is based on topical information on the Web corresponding to different topics. Distrust propagates from spam seed set to detect more spam sites via inverse PageRank. Use the same trust propagation as TrustRank

Link Variable TrustRank

2008

Use the same trust propagation as TrustRank

Table 2 shows a list of Web spam detection algorithms that use different seed sets to propagate either trust or distrust to other pages. To the best of our knowledge, TrustRank, proposed by Gyongyi et al. (2004) is the first paper that introduced seed selection to identify additional useful pages. On the other hand, ParentPenalty proposed by Wu et al. (2005) is the first that uses spam seeds to identify spam pages.

Page 16 of 19

DiffusionRank by Yang et al. (2007) and Link Variable TrustRank proposed by Chen et al. (2008) borrowed the same seed selection procedure from the original TrustRank algorithm.

CONCLUSION Detailed experiments on TrustRank is implemented and compared with HostRank on a large dataset. Four derivatives of TrustRank algorithms which include Anti-TrustRank, Topical TrustRank, DiffusionRank and Link variable TrustRank are discussed. We implemented HostRank and TrustRank algorithms, compared and proved that HostRank algorithm is prone to spamming. So it is better to combine spam detection algorithms along with ranking algorithms to produce fair and better ranking results. The seed selection algorithm in TrustRank can be researched further as it is the key to identify additional pages. The interplay between dampening and splitting for trust propagation can be explored further. Most algorithms use good seed set or bad seed set for propagation, we believe that by using both good seed set to propagate trust and bad seed set to propagate distrust to combat Web spam will be more exciting. Also, the construction of machine learning can be assisted to discover Web spam.

REFERENCES A. Perkins, A. 2001. The Classification of Search Engine Spam. http://www.silverdisc.co.uk/articles/spam-classification. Arasu, A., Novak, A., Tomkins, A. and Tomlin, J. 2002. PageRank Computation and the Structure of the Web: Experiments and algorithms. In The Eleventh International World Wide Web Conference. Baeza-Yates, R., and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison-Wesley. Becchetti, L., Castillo, C., Donato, D., Leonardi, S. and Baeza-Yates, R. 2008. Link analysis for Web Spam Detection. ACM Transactions on the Web (TWEB), 2(1): 1-42. Benczur, A. A., Csalogany, K., Sarlos, T. and Uher, M. 2005. SpamRank - fully automatic link spam detection. In Proc. of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb).

Page 17 of 19

Brin, S. and Page, L. 1998. The Anatomy of a Large Scale Hypertextual Web search engine. Computer Network and ISDN Systems, 30(1-7), 107-117. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006. A Reference Collection For Web Spam. SIGIR Forum, 40(2): 11-24. Chen, Q., Yu, S. N. and Cheng, S. 2008. Link Variable TrustRank for Fighting Web Spam. In Proc. of the IEEE International Conference on Computer Science and Software Engineering. Drost, I. and Scheffer, T. 2005. Thwarting the negritude ultramarine: Learning to identify link spam. In Proc. of European Conference on Machine Learning, 96-107. eFactory 2002. Pr0 – Google’s PageRank 0 Penalty. http://pr.efactory.de/e-pr0.shtml. Eiron, N., McCurley, K. S. and Tomlin, J. A. 2004. Ranking the Web Frontier. In Proc. of the Thirteenth International Conference on World Wide Web, 309-318. Fetterly, D., Manasse, M. and Najork, M. 2004. Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In Proc. of WebDB. Gyongyi, Z., Garcia-Molina, H. and Pedersen, J. 2004, Combating Web Spam with TrustRank. In Proc. of the 30th International Conference on Very Large Data Bases (VLDB), Toronto. Gyongyi, Z. and Garcia-Molina, H. 2005. Web Spam Taxonomy. In Proc. of the First International Workshop on Adversarial Information Retrieval on the Web. Gyongyi, Z., Berkhin, P., and Garcia-Molina, H. 2006. Link spam detection based on mass estimation. In Proc. of the 32nd international conference on Very large data bases, 439-450. Haveliwala, T. H. 2003. Topic-sensitive PageRank: A context-sensitive ranking algorithm for Web Search. IEEE Transactions on Knowledge and Data Engineering, 15(4): 784-796. Henzinger, M. R., Motwani, R. and Silverstein, C. 2003. Challenges in Web Search Engines. In Proc. of the International Joint Conference on Artificial Intelligence. Krishnan, V. and Raj, R. 2006. Web Spam Detection with Anti-TrustRank. In Proc. of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). Liang, C., Ru, L., and Zhu, X. 2007. R-SpamRank: A Spam Detection Algorithm Based on Link Analysis. Journal of Computational Information Systems, 3(4): 1705-1712.

Page 18 of 19

Shen, G., Gao, B., Liu, T. and Feng, G. 2006. Detecting Link Spam using Temporal Information. In Proc. of the Sixth International Conference on Data Mining (ICDM), Hong Kong. Wang, X., Tao, T., Sun, J. T., Shakery, A. and Zhai, C. 2008. DirichletRank: Solving the Zero-One-Gap Problem of PageRank. In Proc. of the ACM Transactions on Information Systems, 26(2). Wu, B. and Davison, B. D. 2005. Identifying Link Farm Spam Pages. In Proc. of the 14th International World Wide Web Conference, 820-829. Wu, B., Goel, V. and Davison, B. D. 2006. Topical TrustRank: Using Topicality to Combat Web Spam. In Proc. of the 15th World Wide Web Conference (WWW), Edinburgh. Yang, H., King, I. and Lyu, M. R. 2007. DiffusionRank: A Possible Penicillin for Web Spamming. In Proc. of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam. Zhang, Y., Jiang, Q., Zhang, L., and Zhu, Y. 2009. Exploiting Bidirectional Links: Making Spamming Detection Easier. In Proc. of the 18th ACM Conference on Information and Knowledge Management.

Page 19 of 19

This paper is an accepted manuscript in Cybernetics and Systems An International Journal. The original published manuscript can be found in: http://www.tandfonline.com/doi/abs/10.1080/01969722.2012.707491#.U hoo39Li2uE Or DOI: 10.1080/01969722.2012.707491

Suggest Documents