TPRank: Contend Web Spam with Trust Propagation Alex Goh Kwang Leng a, Ashutosh Kumar Singh a, Ravi Kumar P. a, Anand Mohan b a
Department of Electrical and Computer Engineering, Curtin University, Sarawak Campus, Miri, Malaysia. b National Institute of Technology, Kurukshetra, India
[email protected],
[email protected],
[email protected],
[email protected] The quantity and quality of the seed sets are the key factors for the success of propagation based anti-Web spam techniques. This kind of approach is simple and yet effective, but the manual evaluation of seed sets is expensively time-consuming. For this reason, the manual evaluation process becomes vital and valuable. In this paper, we propose Trust Propagation Rank (TPRank) that automatically propagates trust to demote Web spam based on small amount of reputable and spam seeds. Moreover, the proposed algorithm is extended to Trust Propagation (TP) Spam Mass in detection of Web spam. Experiments are done on two public available datasets - WEBSPAMUK2006 and WEBSPAM-UK2007, and the results shown both TPRank and TP Spam Mass outperform the state of the art TrustRank in demotion up to 10.623% and Spam Mass algorithm in detection up to 43.216%.
Keywords: Adversarial Information Retrieval, Web Spam Filtering Algorithms, Trust Propagation, TrustRank, Spam Mass
INTRODUCTION Web spam has always been a threat to mislead Web search engines’ results. It uses deceitful tricks to increase the ranking of particular pages than their deserved ranks in exchange of financial gain (Gyongyi and Garcia-Molina 2005). Contending Web spam is a tedious work because Web pages are growing significantly and so as Web spam and their spamming techniques. Page 1 of 21
In recent years, anti-Web spam techniques are constantly being proposed. Among the techniques, trust and distrust propagation model is the most efficient as the results are always promising (Zhang et al. 2011). Example of trust propagation model algorithm is TrustRank (Gyongyi et al. 2004), which is based on some evaluated reputable seeds to detect additional trustworthy pages. On the other hand, example of distrust propagation model algorithms are Anti-TrustRank (Krishnan and Raj 2006) and Spam Mass (Gyongyi et al. 2006). Anti-TrustRank works based on some spam seeds and propagates distrust to detect more spam pages while Spam Mass works on top of PageRank (Brinkmeier 2006) and TrustRank to detect spam pages. Trust and distrust propagation based anti-Web spam techniques can be divided into two categories, Web Spam demotion and Web spam detection. Trust is propagated to detect additional trustworthy pages hence demoting Web spam. Demotion of Web spam can act as a counter-bias to reduce possible rank boosts from spam (Gyongyi et al. 2004). On the other hand, distrust is propagated to detect additional spam pages. The detection of Web spam can help Web search engines by removing them as early as possible. In this paper, we propose Trust Propagation Rank (TPRank) with the idea of calculating trust scores for all pages based on limited evaluation of reputable and spam seeds to demote Web spam. To enhance our proposed algorithm, we underline “ugly” pages for the reason that the categorization of “ugly” pages and pure good pages can avoid promoting spam pages. Furthermore, spam pages are punished by giving them zero rank so that it would not affect the ranks of other pages. In addition to this proposed algorithm, we modified Spam Mass (Gyongyi et al. 2006) algorithm into Trust Propagation (TP) Spam Mass to detect Web spam. Experiments are done on two available datasets WEBSPAM-UK2006 (Castillo et al. 2006) and WEBSPAM-UK2007 (Yahoo! 2007), and the results have shown that TPRank outperforms TrustRank up to 10.623% in demotion of Web spam and TP Spam Mass outperforms Spam Mass up to 43.216% in detection of Web spam.
Page 2 of 21
RELATED WORK WEBSPAM-UK2006 (Castillo et al. 2006), WEBSPAM-UK2007 (Yahoo! 2007) and EU-2010 (Benczúr et al. 2010) are public available datasets provided to advance research in Web spam detection. A taxonomy about Web spam is written by (Gyongyi and Garcia-Molina 2005) that explained about spamming techniques. In addition, (Wu and Davison 2005a) conducted a detailed research on cloaking and redirection. Just recently, (Wang et al. 2011) proposed Dagger to counter cloaking. Here we presented a few anti-Web spam algorithms that are related to our work. BadRank (Sobek 2002) that based on the given blacklist and propagate distrust to measure the negative characteristics of one’s page; TrustRank (Gyongyi et al. 2004) works on the intuition that good pages seldom point to bad pages; ParentPenalty (Wu and Davison 2005b) detect link farm based on the incoming and outgoing links of Web pages; Spam Mass (Gyongyi et al. 2006) which use to estimate the impact of link-spamming on a page’s ranking; Anti-TrustRank (Krishnan and Raj 2006) detect Web spam based on high PageRank spam seeds; Topical TrustRank (Wu et al. 2006b) proposed by Wu et al. that propagates trust based on topical information on the Web corresponding to different topics; the combination of both trust and distrust model proposed by Wu et al. (Wu et al. 2006a) to promote sites as well as demoting sites; DiffusionRank (Yang et al. 2007) reduce the effect of link manipulations; Lan Nie et al. (Nie et al. 2007) incorporate both spam and non-spam measures to improve upon TrustRank; R-SpamRank (Liang et al. 2007) proposed the assignment of new values to spam seeds to detect more spams; Link Variable TrustRank (Qi et al. 2008) incorporates the variance of link structure and combined with TrustRank to propagate trust further; AVRank and HVRank (Zhang et al. 2009) exploited the bidirectional links to measure a page’s value; (Zhang et al. 2011) proposed Trust-Distrust Rank to measure a page’s value based on the fact that a page has trustworthy and untrustworthy sides,. The detailed comparison could be seen in (Leng et al. 2012). Here are some which are quite significant in this field but not much related to our work: (Fetterly et al. 2004) on the basis of the fact that certain classes of spam pages are machines-generated and proposed Page 3 of 21
of using statistical approach in identifying Web spam, Becchetti et al. (Becchetti et al. 2008) implemented a classifier on C4.5 decision tree along with their previous work Truncated PageRank (Becchetti et al. 2006) as one of the features and manage to get 0.585 true positive and 0.037 false positive for WEBSPAM-UK2006 dataset, Wang et al. (Wang et al. 2008) proposed DirichletRank to calculate the probabilities using Bayesian estimation with a Dirichlet Prior that solved the zero-one gap problem which potentially exploited to spam. Based on graph neural network model (Scarselli et al. 2009a, 2009b), Noi et al. (Noi et al. 2010) use probability graphSOM and graph neural network, a new connectionist model that extend neural network methods for graph processing to counter spam, achieving 0.9169 for Fmeasure and 0.9301 for ROC AUC for WEBSPAM-UK2007 dataset. (Abernethy et al. 2008) present WITCH, a support vector machine classifier to detect Web spam using both link and content features. The classifier achieves 0.928% for AUC10% and 0.963 for AUC100% on WEBSPAM-UK2006 dataset. In recent years, there is a growth of using genetic programming to detect link spam: (Xiaofei et al. 2010) use genetic programming on WEBSPAM-UK2006 dataset and its method improve the spam classification Fmeasure performance by 11%, accuracy performance by 4% and recall performance by 26% compare to support vector machines (SVM), Li et al. (Li et al. 2011) generated 10 newly features from genetic programming that are equivalent to those classifier that used the standard 138 transformed link-based features.
WEB MODEL For the convenience of the later discussion, we will be using the following definition and representation: Vertices ( V ): depict as Web pages or Web hosts in a Web model; edges ( E ): depict as the connection between two vertices; in-degree ( d (V ) ): depicted as the number of edges pointing to V ; out-degree ( d (V ) ):depicted as the number of edges pointing from V ; set of evaluated vertices ( VE ): vertices that are labelled as either reputable or spam; set of unevaluated vertices ( VE ):vertices that are not Page 4 of 21
labeled or evaluated; set of reputable vertices ( VR ): vertices that have relevant and trustworthy content; set of spam vertices ( VS ): vertices that have irrelevant content, spam; set of pure good vertices (VG ): reputable vertices with no way pointing to spam; set of ugly vertices ( VU ):reputable vertices but still somehow pointing to spam. A graph, G (V , E ) is used to represent the interconnections between the entities in a Web model where V is the set of vertices and E is the set of edges connecting the vertices. Graph can be divided into two main graphs which are the undirected graph, G(U ) and the directed graph, G(D) . Let e and e be the initial point and terminate point of an edge, the edge of an undirected graph is represented in set form with curly bracket such that e {e , e } | e E and note that {e , e } {e , e } . Meanwhile, the edge of a directed graph is represented in ordered pairs with parenthesis such that e (e , e ) | e E and note that (e , e ) (e , e ) . The degree of a vertex is the number of edge adjacent to the particular vertex and can be divided into two categories which are the in-degree d (V ) and the out-degree d (V ) . In this paper, the vertices V can be divided into evaluated vertices V E and unevaluated vertices V E such that V (VE , VE ) . The evaluated can be further categorized into two vertices which are reputable vertices, V R and spam vertices, VS . The reputable vertices can be distinguished into two vertices such as pure good, VG and ugly vertices, VU . In other words, the evaluated vertices can be composed in such a way that VE (VR ,VS ) where VR (VG ,VU ) . On the other hand, the unevaluated vertices are the set of unknown vertices denote as V X . Therefore, in term of set form, concern only on the base of the vertices for the general Web model, it could be represented as V {VG ,VU ,VS ,VX } where the vertices are consists of pure good, ugly, spam and unknown pages, respectively.
Page 5 of 21
A Web model can be distinguished into several levels such as the page level and host level. At the page level, the Web graph is denoted by G p ( p , p ) where the set of page vertices p are composed by pure good pages Gp , spam pages Sp , ugly pages Up and unknown pages Xp . In other words, the set of vertices p {Gp ,Sp ,Up , Xp } . The set edges are the connections between the pages where
e {(e , e ) | e p } . At the host level, the Web graph is denoted by Gh (h , h ) where h is a set of host vertices which composed of pure good hosts Gh , spam hosts Sh , ugly hosts Uh and unknown hosts Xh , and h is a set of edges between the hosts. Note that a host consists of a set of Web pages under the same domain name. We say two host vertices h1 and h2 are connected such that (h1 , h2 ) h if there are some pages under h1 are pointing to h2 or vice versa. The Web model can be represented in matrix form as follows:
Transition Matrix, M
1 d (V ) if (b, a) E M ( a, b) otherwise 0
Inverse Transition Matrix, N
1 d (V ) if (a, b) E N (b, a) otherwise 0
ALGORITHMS In this section, we cover all aspects of the algorithm which include ugly vertices, trust score calculation and tackling bad vertices. Furthermore, we explain Trust Propagation Rank (TPRank) which helps in demotion and Trust Propagation (TP) Spam Mass which helps in detection.
A. Ugly Vertices Assume that the Web graph is at the page level, TrustRank follows the intuition that reputable pages seldom point to spam pages and trust flows. However, it does not work in the real Web. Spammers can
Page 6 of 21
get lots of incoming links from reputable pages using indecent ways (Qi et al. 2008). One way of doing this is leaving comments on accessible pages, i.e. pages that can be edited by external like blog and Wikipedia. In our work, we distinguish this kind of pages as ugly pages Up which apart from the pure reputable pages Gp , and these pages denote as reputable pages that point to spam pages. The ugly pages are one of the reasons that spam pages got promoted easily. For the assessment of ugly vertices VU , this can be done after the evaluation of reputable vertices
VR and spam vertices VS . For all reputable vertices VR , if any of the outgoing vertices of the reputable vertex is a spam vertex, the reputable vertex then categorize into set of ugly vertices VU , otherwise set of pure good vertices VG .
B. Trust Score Calculation We introduced a new trust score calculator to calculate the trust score of the unknown vertices V X . The ugly vertices that introduced earlier are used to enhance the new trust score calculation. The equation of the trust score calculator is written as
tp
t
q ( q: p )E
(iG iX )
(1)
t p is the trust score for unknown vertex p. iG is the number of pure good vertices while i X is the number of unevaluated vertices. The vertices in iG , iU and t q vertices refer to the incoming vertices. The new trust score of page p is calculated by the trust score of the incoming links. For all incoming links, spam vertices and ugly vertices are simply ignored for the reason that their trust is not trustworthy as they might be pointing to spam pages. Examples of trust score calculation is shown in Figure 1.
Page 7 of 21
Figure 1. Examples of Trust Score Calculation
C. Handling Spam Vertices During the assessment of the seed set, both reputable seeds and spam seeds are evaluated. Often either one of the seed sets is used to propagate trust or distrust, for example TrustRank only uses reputable seed set to propagate trust with the spam seed set remain unused. Seed sets are expensive to be evaluated and should therefore make good use of both reputable and spam seed set. TrustRank have shown that reputable vertices will receive high trust score while spam vertices receive low trust score. Even though it is low, spam vertices can work together and boost one target page. In other words, spam vertices can still affect other vertices. ParentPenalty (Wu and Davison 2005b; Wu et al. 2006a) penalize reputable vertices that point to spam vertices. However, we argue that reputable vertices might unintentionally point to spam vertices; spammers might leave comments to make reputable vertices point to them. In our proposed method, if we know one vertex is a spam vertex, we punished them by giving them zero rank. By achieving this, they have no chance of affecting reputable vertices with low trust score and will not get ranked even though pointed by other vertices.
D. Trust Propagation Rank We propose Trust Propagation Rank (TPRank), a Web spam demotion algorithm that works similar to TrustRank but propagates trust further based on the same limited set of evaluation seeds. Unlike TrustRank, TPRank use both reputable seed set and spam seed set to demote spam. The seeds are selected
Page 8 of 21
based on inverse PageRank for the reason that to choose the seeds that propagate the widest coverage (Gyongyi et al. 2004). The equation for inverse PageRank can be written as:
s U s (1 )
1 1N N
(2)
where s is the inverse PageRank score, is a decay factor usually set as 0.85, U is the inverse transition matrix of the Web graph and N is the number of the vertices. During the process of seed
Figure 2. Trust Propagation Rank (TPRank) Algorithm selection, spam seeds are collected too. After the collection, both ugly vertices and pure good vertices can be extracted out of the reputable vertices.
E. Trust Propagation (TP) Spam Mass In (Gyongyi et al. 2006), the authors proposed the concept of Spam Mass, a measure for the impact of link spamming on PageRank. By estimating Spam Mass, it can help by identifying pages that significantly benefit from link-spamming. Spam Mass is built on top of PageRank and TrustRank, so the equation for Spam Mass is: Page 9 of 21
SM
PR TR PR
(3)
where SM stands for Spam Mass, PR stands for PageRank and TR stands for TrustRank. A vertex’s Spam Mass is calculated based on its PageRank score minus TrustRank score and divided by its PageRank score. Note that both PageRank and TrustRank should be normalized first before proceed. For our research, Trust Propagation Rank (TPRank) can be extended to Trust Propagation (TP) Spam Mass where the equation is written as:
TP _ SM
PR TP PR
(4)
where TP _ SM stands for Trust Propagation Spam Mass and TP stands for Trust Propagation Rank. It has shown that Spam Mass works more effective than Anti-TrustRank (Qureshi 2011). The two equations (3) and (4) are important to show the detection of Web spam.
EXPERIMENTS In this section, we discuss the datasets that are used for the experiments, algorithms & evaluation approaches and experimental results & discussions.
A. Datasets To evaluate the algorithms, experiments are done on two public available datasets WEBSPAMUK2006 (Castillo et al. 2006) and WEBSPAM-UK2007 (Yahoo! 2007) provided by Laboratory of Web Algorithmics, Universitàdegli Studi di Milano with the support of the DELIS EU - FET research project. Both datasets are crawled in .UK domain in May 2006 and May 2007. For the WEBSPAM-UK2006 dataset, the base data consists of a set of 77,741,046 Web pages in 11402 hosts while for WEBSPAM-UK2007 dataset, the base data consists of a set of 105,896,555 Web pages in 114529 hosts. However, we only consider the hosts graph for the reason that the time will be significantly reduced and it is more effective if we know one host is a spam host, we can assume that all pages under the host are spam. Both hosts graph dataset provide two labeled sets each by a group of Page 10 of 21
volunteers; the labeled sets consist of SET1 for training and SET2 for testing; and they are assigned spam and non-spam. Since no training and testing required, we sum both SET1 and SET2 for our experiments evaluation. Note that for WEBSPAMUK-2007, we add in some extra labeled hosts from WEBSPAMUK2006. The distribution of labeled host for evaluation is shown in figure 3.
Figure 3. The Distribution of the Datasets
B. Algorithms and Evaluation Approaches Our proposed algorithms are compared with the baseline algorithms for evaluation later in our experiments & discussion sub-section. The algorithms that are compared here are:
TrustRank versus Trust Propagation Rank (TPRank)
Spam Mass versus Trust Propagation (TP) Spam Mass
Both TrustRank and TPRank are Web spam demotion algorithms while Spam Mass and TP Spam Mass are Web spam detection algorithms. All algorithms are computed in 50 iterations and a decay factor
of 0.85. For TrustRank, we use 50 reputable hosts as seeds in WEBSPAM-UK2006 and 100 reputable seeds in WEBSPAM-UK2007. We use more seeds for the former dataset because the dataset is bigger. During the evaluation on the reputable seeds on both datasets, 179 spam seeds are detected in Page 11 of 21
WEBSPAM-UK2006, thus 50 reputable seeds are divided into 30 pure good seeds are 20 ugly seeds; in WEBSPAM-UK2007, 21 bad seeds are detected, thus 100 reputable seeds are divided into 35 pure good seeds and 65 ugly seeds. The bad seeds, pure good seeds and ugly seeds are used for TPRank. After the results from different algorithms are computed, the hosts are extracted based on the labeled set, sorted in descending order based on the algorithm’s score, and then distributed equally into 10 buckets with the last bucket having the remainders. For our experiments, Web spam demotion algorithms emphasize on the identification of reputable hosts while Web spam detection algorithms emphasize on the identification of spam hosts. Thus, the evaluation approaches for the experiments are:
Percentage of reputable or spam hosts for each bucket
Increment summation of reputable or spam hosts for all buckets
Average promotion level for reputable or spam hosts
Number of reputable or spam hosts being promoted
Propagation Coverage
The first evaluation is the percentage of reputable or spam hosts for each bucket, this evaluation measure how much the algorithms have detected reputable or spam hosts. It is important to see more reputable hosts in Web spam demotion algorithms and more spam hosts in Web spam detection algorithms as it shows the effectiveness of the algorithms. The second evaluation is the increment summation of reputable or spam hosts from the first to the last bucket, this evaluation shows how much the proposed algorithms have improved over all buckets. Next is the average promotion level for reputable or spam hosts. It is used to track the movement of the particular reputable or spam host from one bucket to the other. Let PO ( Si ) be the bucket position of the reputable or spam hosts at the TrustRank or Spam Mass algorithm and Pm ( Si ) be the bucket position of the reputable or spam hosts at the TPRank and TP Spam Mass algorithm. For each bucket, let S i be the labeled reputable or spam hosts of the
Page 12 of 21
TrustRank or Spam Mass algorithms at the i
th
bucket, the average promotion at i
th
bucket, Ri can be
defined as:
Ri
PO ( Si ) P ( Si ) Si
(5)
This evaluation metric tracks the improvements for each bucket over the baseline algorithms. The derived unit from the metric is called bucket per level. Moreover, we also show the number of reputable or spam hosts being promoted, this evaluation is correlated with the previous measurement. The last one shows the propagation coverage of the algorithms, this evaluation illustrates how much trust have reached the vertices and the percentage of trust propagated to evaluated vertices.
C. Experimental Results & Discussions To avoid confusion, we first show the experimental results and discussions on TrustRank versus TPRank. Figure 4 illustrates the percentage of reputable hosts on WEBSPAM-UK2006 and WEBSPAMUK2007. Figure 5 illustrates the incremental summation of hosts from the first bucket to the last bucket in WEBSPAM-UK2006 and the reputable hosts gap on WEBSPAM-UK2007.
Figure 4. Percentage of reputable hosts on WEBSPAM-UK2006 and WEBSPAM-UK2007 As shown in figure 4, TPRank ables to detect more reputable hosts than TrustRank for the first six bucket in WEBSPAM-UK2006 and for the first four buckets in WEBSPAM-UK2007. It is important to demote spam hosts as early as possible so that spam hosts do not appear much at the top results. Page 13 of 21
Figure 5. Summation of reputable hosts on WEBSPAM-UK2006 and on WEBSPAM-UK2007 Observed from figure 5, the 6th bucket in WEBSPAM-UK2006 has the biggest improvement of 10.623% while for WEBSPAM-UK2007, the 4th bucket has the biggest improvement gap of 35 reputable hosts detected. There is only a slight improvement for WEBSPAM-UK2007 dataset for the reason that the number of label spam hosts is small, thus it is relatively hard to see the improvement of the reputable hosts.
Figure 6. Average reputable hosts promotion and the number of reputable hosts promoted on WEBSPAM-UK2006 Figure 6 and figure 7 illustrates the average promotion for reputable hosts and the number of reputable hosts promoted from TPRank over TrustRank buckets on WEBSPAM-UK2006 and WEBSPAM-UK2007.
Page 14 of 21
Figure 7. Average reputable hosts promotion and the number of reputable hosts promoted on WEBSPAM-UK2007 Observed from figure 6, the 9th bucket has the highest improvement with an average reputable hosts promotion of 4.173 bucket per level promoting 370 reputable hosts in WEBSPAM-UK2006. In figure 7, the highest average reputable host promotion is the 5th bucket with the promotion of 1.764 bucket per level and the bucket that has the highest number of promoted reputable hosts is the 7 th bucket promoting 351 reputable hosts.
Figure 8. Percentage of spam hosts on WEBSPAM-UK2006 and WEBSPAM-UK2007 Apart from Web spam demotion algorithm, now we show the experimental results and discussion on two Web spam detection algorithm – Spam Mass and TP Spam Mass. Figure 8 illustrates the percentage of spam hosts on WEBSPAM-UK2006 and WEBSPAM-UK2007. Figure 9 illustrates the summation of all spam hosts on WEBSPAM-UK2006 and WEBSPAM-UK2007.
Page 15 of 21
Figure 9. Summation of spam hosts on WEBSPAM-UK2006 and WEBSPAM-UK2007 In figure 8, TP Spam Mass has shown that the algorithm detect more spam hosts than Spam Mass for the first three buckets in WEBSPAM-UK2006 but for WEBSPAM-UK2007, it is not so clear for the reason that the spam seed set for the dataset is relatively small. However for WEBSPAM-UK2007 as shown in figure 9, TP Spam Mass actually accumulate more spam hosts as the bucket moves further even though Spam Mass manages to detect more spam at the first bucket. As for WEBSPAM-UK2006 in figure 9, TP Spam Mass manages to accumulate more spam hosts for all buckets compare to Spam Mass algorithm.
Figure 10. Average spam hosts promotion and the number of spam hosts promoted on WEBSPAM-UK2006 In figure 10, TP Spam Mass promotes as much as 5.01 bucket per level for spam host with the 4th bucket promoting 254 spam hosts which is an improvement of 43.216% on detection of Web Spam over Spam Mass algorithm in WEBSPAM-UK2006. Page 16 of 21
Figure 11. Average spam hosts promotion and the number of spam hosts promoted on WEBSPAM-UK2007 For WEBSPAM-UK2007 in figure 11, TP Spam Mass able to promote up to 3.84 bucket per level with the last bucket promotes 25 spam hosts. Table 2. Propagation Coverage Datasets WEBSPAM-UK2006 WEBSPAM-UK2007
Algorithms Sn(S) TrustRank 8564 TPRank 10183 TrustRank 76376 TPRank 94337
Sn(SR) 4223 5242 6697 7844
Sn(SS) 1374 1553 244 338
tR tS 86.20% 13.80% 98.02% 1.98% 98.82% 1.18% 99.58% 0.42%
Table 2 illustrates the propagation coverage denote as Sn from the seeds S , reputable seeds S R and spam seeds S s in terms of number of vertices, reputable vertices and spam vertices propagated; In addition, the percentage of trust that have propagated to reputable and spam vertices are included inside the table. Sn(S) denotes the number of sites that are covered from the seed set. Sn(SR) denotes the number of reputable hosts while Sn(SS) denotes the number of spam hosts propagated. tG denotes the percentage of trust propagated to reputable hosts and tB denotes the percentage of trust propagated to spam hosts. From table 2, it has shown that TPRank has propagated more reputable vertices so as spam vertices than TrustRank based on the same number of seeds (50 seeds for WEBSPAM-UK2006 and 100 seeds for WEBSPAM-UK2007). Even though spam vertices are propagated more in TPRank, the trust propagated Page 17 of 21
to spam vertices are relatively smaller compare to TrustRank, 1.98% than 13.80% in WEBSPAMUK2006 and 0.42% than 1.18% in WEBSPAM-UK2007. Throughout the experiments, 50 reputable seeds (30 pure good and 20 ugly seeds) with 179 additional evaluated spam seeds are used for WEBSPAM-UK2006 and 100 reputable seeds (35 pure good and 65 ugly seeds) with 21 additional evaluated spam seeds are used for WEBSPAM-UK2007. From the results, TPRank has shown that the algorithm outperforms TrustRank for up to 10.623% in demotion of Web spam while TP Spam Mass outperforms Spam Mass up to 43.216% in detection of Web spam. The propagation coverage too has shown significant results for using TPRank than TrustRank to propagate trust to the whole dataset. The results will be more significant if more reputable seeds are evaluated.
CONCLUSION AND FUTURE WORK Various link-based Anti-Web spam techniques are constantly proposed in recent years. Our work proposed of Trust Propagation Rank (TPRank) to demote Web spam and Trust Propagation (TP) Spam Mass to detect Web spam. Our proposed algorithms are experimented on two large public available dataset WEBSPAM-UK2006 and WEBSPAM-UK2007, and have shown that the proposed algorithms outperform the baseline algorithms. In future, research on both trust and distrust model can be used along and carried out in assist of combating Web spam. Achieving this can increase the demotion rate or detection rate of Web spam. Other than that, a call for machine learning methods can help in assisting human experts to combat Web spam.
REFERENCES Abernethy, J., Chapelle, O., and Castillo, C. "Web spam Identification through Content and Hyperlinks" In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '08), 41-44. ACM, New York, NY, USA, Beijing, China, 2008. Becchetti, L., Castillo, C., Donato, D., Baeza-YATES, R., and Leonardi, S. "Link analysis for Web spam detection." ACM Trans. Web 2, no. 1 (2008): 1-42. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. "Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection" In Proceedings of the Workshop on Web Page 18 of 21
Mining and Web Usage Analysis (WebKDD 2006), ACM Press, Philadelphia, Pennsylvania, USA, 2006. Benczúr, A. A., Castillo, C., Erdélyi, M., Gyöngyi, Z., Masanes, J., and Matthews, M. 2010. ECML/PKDD 2010 Discovery Challenge Data Set. Crawled by the European Archive Foundation. Brinkmeier, M. "PageRank revisited." ACM Trans. Internet Technol. 6, no. 3 (2006): 282-301. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Santini, M., and Vigna, S. "A Reference Collection for Web Spam." SIGIR Forum 40, no. 2 (2006): Fetterly, D., Manasse, M., and Najork, M. "Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages" In Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, 1-6. ACM, New York, NY, USA, Maison de la Chimie, Paris, France, 2004. Gyongyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J. "Link Spam Detection Based on Mass Estimation" In Proceedings of the 32nd International Conference on Very Large Data Bases, 439-450. VLDB Endowment, Seoul, Korea, 2006. Gyongyi, Z., and Garcia-Molina, H. "Web Spam Taxonomy" In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 39-47. Chiba, Japan, 2005. Gyongyi, Z., Garcia-Molina, H., and Pedersen, J. "Combating Web Spam with TrustRank" In Proceedings of the Thirtieth International Conference on Very Large Data Bases, 576-587. VLDB Endowment, Toronto, Canada, 2004. Krishnan, V., and Raj, R. "Web Spam Detection with Anti-TrustRank" In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 37-40. Seattle, USA, 2006. Leng, A. G. K., Patchmuthu, R. K., Singh, A. K., and Mohan, A. "Link Based Spam Algorithms in Adversarial Information Retrieval " Cybernetics and Systems: An International Journal 43, no. 6 (2012): 459-475. Li, S., Niu, X., Li, P., and Wang, L. "Generating New Features Using Genetic Programming to Detect Link Spam" In 2011 International Conference on Intelligent Computation Technology and Automation (ICICTA), 135-138. Shenzhen, China, 2011. Liang, C., Ru, L., and Zhu, X. "R-SpamRank: A Spam Detection Algorithm Based on Link Analysis." Journal of Computational Information Systems 3, no. 4 (2007): 1705-1712. Nie, L., Wu, B., and Davison, B. D. "Winnowing Wheat from the Chaff: Propagating Trust to Sift Spam from the Web" In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 869-870. ACM, New York, NY, USA, Amsterdam, The Netherlands, 2007. Noi, L. D., Hagenbuchner, M., Scarselli, F., and Tsoi, A. C. "Web spam Detection by Probability Mapping GraphSOMs and Graph Neural Networks" In Proceedings of the 20th International Conference on Artificial Neural Networks: Part II, 372-381. Springer-Verlag, Germany, Thessaloniki, Greece, 2010. Qi, C., Song-Nian, Y., and Sisi, C. "Link Variable TrustRank for Fighting Web Spam" In Proceedings of International Conference on Computer Science and Software Engineering, 1004-1007. Wuhan, China, 2008. Qureshi, M. A. 2011. Improving the Quality of Web Spam Filtering by Using Seed Refinement, Department of Computer Science Korea Advanced Institute of Science and Technology. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. "Computational capabilities of graph neural networks." Trans. Neur. Netw. 20, no. 1 (2009a): 81-102. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. "The graph neural network model." Trans. Neur. Netw. 20, no. 1 (2009b): 61-80. Sobek, M. 2012. Pr0 - Google's PageRank 0 Penalty 2002 [cited 25 February 2012]. Available from http://pr.efactory.de/e-pr0.shtml. Page 19 of 21
Wang, D. Y., Savage, S., and Voelker, G. M. "Cloak and Dagger: Dynamics of Web Search Cloaking" In Proceedings of the 18th ACM Conference on Computer and Communications Security, 477-490. ACM, Chicago, Illinois, USA, 2011. Wang, X., Tao, T., Sun, J.-T., Shakery, A., and Zhai, C. "DirichletRank: Solving the zero-one gap problem of PageRank." ACM Trans. Inf. Syst. 26, no. 2 (2008): 1-29. Wu, B., and Davison, B. D. "Cloaking and Redirection: A Preliminary Study" In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 39-47. Chiba, Japan, 2005a. Wu, B., and Davison, B. D. "Identifying Link Farm Spam Pages" In Proceedings of Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, 820-829. ACM, New York, NY, USA, Chiba, Japan, 2005b. Wu, B., Goel, V., and Davison, B. D. "Propagating Trust and Distrust to Demote Web Spam" In World Wide Web (WWW2006) Workshop on Models of Trust for the Web (MTW), Edinburgh, Scotland, 2006a. Wu, B., Goel, V., and Davison, B. D. "Topical TrustRank: Using Topicality to Combat Web Spam" In Proceedings of the 15th international conference on World Wide Web, 63-72. ACM, New York, NY, USA, Edinburgh, Scotland, 2006b. Xiaofei, N., Shengen, L., Xuedong, N., Ning, Y., and Cuiling, Z. "Link spam detection based on genetic programming" In Natural Computation (ICNC), 2010 Sixth International Conference on, 33593363. 2010. Yahoo! Web Spam Collections 2007. Available from http://barcelona.research.yahoo.net/webspam/datasets/. Yang, H., King, I., and Lyu, M. R. "DiffusionRank: A Possible Penicillin for Web Spamming" In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 431-438. ACM, New York, NY, USA, Amsterdam, The Netherlands, 2007. Zhang, X., Wang, Y., Mou, N., and Liang, W. "Propagating Both Trust and Distrust with Target Differentiation for Combating Web Spam" In Proceedings of the Twenty-Fifth Conference on Artificial Intelligence (AAAI-11), 1292-1297. AAAI Press, San Francisco, California, 2011. Zhang, Y., Jiang, Q., Zhang, L., and Zhu, Y. "Exploiting Bidirectional Links: Making Spamming Detection Easier" In Proceedings of the 18th ACM conference on Information and knowledge management, 1839-1842. ACM, Hong Kong, China, 2009.
Page 20 of 21
This is an accepted manuscript. The published online version is on 29 April 2014 http://www.tandfonline.com/doi/abs/10.1080/01969722.2014.887938 or http://dx.doi.org/10.1080/01969722.2014.887938
Page 21 of 21