LNCS 8281 - Web Spam Detection: New Approach ... - NUS Computing

6 downloads 99149 Views 279KB Size Report
(AODE) [9]. Hidden Markov Model helps us to consider the dependency of web hosts ... [10] justified the statistical difference between machine generat- ... feature subset has been found that results the best accuracy for AODE classifier. Ta-.
Web Spam Detection: New Approach with Hidden Markov Models Ali Asghar Torabi1, Kaveh Taghipour2,*, and Shahram Khadivi1 1

Human Language Technology Lab, Department of Computer Engineering and IT, Amirkabir University of Technology {a.torabi,khadivi}@aut.ac.ir 2 Department of Computer Science, National University of Singapore, 13 Computing Drive, Singapore 117417 [email protected]

Abstract. Web Spam is the result of a number of methods to deceive search engine algorithms so as to obtain higher ranks in the search results. Advanced spammers use keyword and link stuffing methods to create farms of spam pages. Most of the recent works in the web spam detection literature utilize graph based methods to enhance the accuracy of this task. This paper is basically a probabilistic approach that uses content and link based features to detect the web spam pages. Since we observe there is a high connectivity between web spam pages, we adopt a method based on Hidden Markov Model to exploit conditional dependency of a sequence of hosts and their spam/normal class distribution of each host. Experimental results show that the proposed method can significantly improve the performance of baseline classifier. Keywords: web spam, link spam, hidden Markov models, ant colony optimization.

1

Introduction

Given the vast amount of information on the web, users have to use search engines to locate the useful web pages that are relevant to their interests and inquiries. The goal of search engines as the main information retrieval machines for web is to provide higher ranks for pages that are most important and relevant to the users’ query. Therefore, search engines need to distinguish the normal web pages from Spam pages so as to prevent misleading of the users [1]. In order to find desired contents, Search engines use specific textual similarity measures for determining the relevancy of a page and a query. To measure the importance of the pages there are several global query-independent indicators like Page Rank [2] that often are calculated from web link structure [3]. While these two important criteria are used in search engines for evaluating web pages, a new industry of Search Engine Optimizers has developed (SEO) recently. Malaga and Ross [4] *

This research was conducted at the time Taghipour was in AmirKabir University of Tehran.

R.E. Banchs et al. (Eds.): AIRS 2013, LNCS 8281, pp. 239–250, 2013. © Springer-Verlag Berlin Heidelberg 2013

240

A.A. Torabi, K. Taghipour, and S. Khadivi

grouped the SEO methods into two categories: White hat SEOs that stay within the guidelines by search engines and Web Spams (also known as Black hat SEOs) that violate the rules and transgress accepted norms. Web Spamming means boosting the rates of web pages undeservedly, without improving the true value of a page. The web Spamming methods cause crucial problems for search engines, e.g., they tremendously waste the resources for indexing illegitimate web pages, unduly decrease the quality of retrieval process, damage search engines [5]. According to [3] Web Spam techniques can be categorized to the following types: • Content Spam: If Spammers target textual similarity measures it is a content spam generation method. The content of pages is filled by popular words so that they are relevant to more popular users’ queries. In [6] the term “keyword-stuffing” is used to refer to this method. • Link Spam: There is a general belief that pages with more incoming links are more popular and important than others. As mentioned, Search engines use some linkbased measures like Page Rank to assess importance of a web page. Spamming methods that intend to influence these algorithms are named Link Spam. Spammers create so many pages that link to the target page to increase its popularity. Extensive researches have been presented to reduce the impact of Web Spam. Most of the proposed solutions such as [6, 7] considered Web Spam detection as a classification problem. This research considers hosts as train/test instances and features are extracted from content of pages within the host and links among them. Previous experiences show that in Web Spam Detection, instances are not independent and data labels are unbalanced [8]. In this paper we present a new approach to handle biasness of data and model the dependency between hosts. To our knowledge this is the first time that Hidden Markov Model (HMM) is used to do this. The proposed system starts by building a classifier based on Aggregating One Dependence Estimators (AODE) [9]. Hidden Markov Model helps us to consider the dependency of web hosts during the prediction process and boost the performance of AODE. A simple method to adopt HMM for this task is to find the most frequent sequences of visiting hosts. In the proposed system, Ant Colony Optimization algorithm is used to generate the required sequences of hosts. The paper is organized as follows. In Section 2, we provide an overview of previous works. In Section 3, feature selection and classification method are described. In Section 4, we propose a method that extracts sequences of hosts that we need to apply HMM on these sequences. Finally, we conclude by summarizing key principals of our approach.

2

Related Work

Several automatic techniques for web spam detection have been presented in the literature. Fetterly et al. [10] justified the statistical difference between machine generated spam pages and normal web pages. They presented some features based on page content, linkage structure and page evolution. In their next following paper [6] they

Web Spam Detection: New Approach with HMM

241

also proposed several content based features and a decision tree, classifying Spam and Normal pages. Piskorski et al [11] studied on some linguistic features and discovered several discriminative features that are available publicly for others. Moreover in [12] Araujo et al, They offered a new approach rooted in combination of link based features and language model based ones. They observed the semantic relation between linked pages and found them to be useful to improve the performance of the classification task. In addition to traditional learning models many papers used graph based methods to boost the performance of Web Spam Detection by considering topological dependency between the hosts. Link propagation as one the most popular methods in graph based problems has been widely used in web spam detection. Becchetti et al. [13] performed a statistical analysis on link structure of web pages in a large collection. Their experiments show that link based metrics like TrustRank and Truncated PageRank can improve the performance of Web Spam classifiers. TrustRank separates an initial set of good pages. It starts with a seed of good pages and then follows the link structure to propagate the rank thorough the related pages. Implementation of this method is described in [14]. Truncated PageRank is a version of PageRank that ignores the direct contribution of near neighbors according to a damping function. Experiments by Becchetti and others in [13, 15] show that Truncated PageRank is a discriminative feature. Castillo et al [5] proposed stacked graphical learning for propagating labels across the web graph. In addition to content and link based features the average of probability of Spam for neighboring hosts is added to the feature vector of each host and thus is considered in decision process. Link refinement, elimination and regularization methods are other methodologies that exploit link structure to improve performance of basic classifiers. Elimination of Nepotistic links is one of the proposed method to reduce the impact of the link stuffing by removing certain links of the web graph [16]. Jacob Abernethy et al in [17] presented a graph regularization based algorithm, WITCH, that learns simultaneously from graph structure, content and link based features. There are also other works and experiences in this regard. For additional studies on the above mentioned topics, you can refer to the Survey on Web Spam Detection by Nikita Spirin and Jiawei Ha [8] , that is a good survey covering many papers and proposed systems to date.

3

Classification and Feature Selection

The following paper has tested and trained the proposed method by WebSpamUk2006 [18] which is a public Web Spam dataset. This collection contains 11402 hosts from the .uk domain. For each host 263 features have been extracted from links and content of pages. Additional information about feature types and list of them is available at [6, 13]. In this dataset, 7473 hosts are labeled by a group of volunteers into three categories of Spam, Normal and Borderline. Here, we use the first two categories to build a model that recognizes spam hosts from normal ones.

242

A.A. Torabi, K. Taghipour, and S. Khadivi

In this research several classifiers such as decision trees, neural networks and statistical classifiers were examined and compared against each other. We use F-measure value as a criterion to compare efficiency of different classification methods. The results showed that AODE was superior to other competitive methods. AODE is a statistical classifier that achieves higher accuracy by averaging over all naïve Bayes models. The core of above mentioned classifier is to consider weaker independency assumption for Naïve Bayes. Thus it has less error rate and still is as fast as possible in training and test phases. In comparison with other methods like Bayesian networks, AODE benefits from the advantage of not performing model selection while it’s accuracy is comparable with none parametric models like decision trees and neural networks [9]. The proposal has taken advantage of AODE by using its implementation in Weka [19]. A subset of two sorts of link based and content based features have been used in this paper. The paper approach is to use Wrapper Feature Selection method with which evaluates features by using a learning model [20] and chooses the most discriminative and relevant features. After feature selection process, only 22 of 263 features have been selected. By using cross validation and a bidirectional search an optimal feature subset has been found that results the best accuracy for AODE classifier. Table 1 shows the performance for different feature selection models and AODE classifier. In this paper "true positive" and "false positive" rates are respectively about "the rate of correctly detected spam hosts" and "normal hosts that are detected as spam incorrectly". Tables 1 and 3-5 are result of cross validation with 10 folds. Table 1. Comparision of feature selection methods

True positive rate False positive rate F-Measure

Correlation Attribute Evaluation

Principal Component Analysis

Wrapper feature selection

74% 9% 0.72

77.2% 10.4% 0.724

81.2% 4.9% 0.825

Reported results show that Wrapper feature selection improves the result by increasing true positive and reducing false positive ratio which has resulted in the selection of AODE and the new feature space to setup the proposed system.

4

Smoothing

In this Section, the research proposes a method to detect Web spam by using topological dependency between hosts. It started by representing the web spam detection as a graph based problem. In this presentation each host is a node in graph , of Web hosts and links . For each pair of nodes and we have , the number of links from host to . Most of the traditional classifiers presuppose that instances are independent and identically distributed. But in web spam problem, samples are topologically dependent [21, 22]. Therefore lots of latent information in the link structure between hosts

Web Spam Detection: New Approach with HMM

243

are missed if we confine ourselves to use only the base classifier in the previous Section. The proposed system is based on smoothness assumption of semi-supervised learning. According to this, nodes which are closer to each other are more likely to share a label [23]. In web graph for each pair of nodes the paper declared a closeness factor: (1) log 1 , On the other hand Castillo et al [5] showed in their experiments that normal nodes tend to be linked by very few Spam nodes and they mainly link to other normal nodes while Spam nodes are mainly linked to Spam nodes. In Table 2 the results of our study on WebSpam-Uk2006 dataset are shown. Probability of transition from a spam host to a normal host is 23% and that is much lower than transition to other spam host. Table 2. Dependency of Spam and Normal classes

Spam 77% 13%

Spam Normal

Normal 23% 87%

Considering the conditional dependency of Spam and Normal hosts to each other and topological dependency of hosts in the web graph, and also hidden category of each host; it is intuitively obvious that Hidden Markov Model would be a useful learning schema to build a pattern of dependency between nodes and handle imbalance between spam and normal labels. HMM is a probabilistic method to model sequence of data [24]. Indeed, to take HMM into use we need a sequence of connected hosts. This paper proposes Ant colony optimization algorithm to extract sequences of related hosts according to similarity measure (1). Fig. 1 presents the workflow of the proposed system.

Host Webgraph

ACO

Sequences for Baum Welch

Generating sequences for HMM

HMM

HMM parameters

Training on sequences

Train

Host Webgraph

ACO Target Host

Generating sequences for HMM

Sequences for forward algorithm

HMM Calculating probability of Spam and Normal on sequences

Test

Fig. 1. Web spam detection workflow

Average probability of Spam and Normal

244

4.1

A.A. Torabi, K. Taghipour, and S. Khadivi

Ant Colony Optimization

In computer science, Ant Colony Optimization refers to a general propose method of finding the best solution of an optimization problem. In ACO, artificial ants build solutions and exchange information by depositing pheromone on the ground in order to remark favorable path to the optimization problem [25]. To use ACO it is needed to represent the problem space as a graph and declare three fundamental components: • Edge selection: Artificial ants move from vertex to vertex along the edges of the graph. A stochastic mechanism is hired to guide each ant to choose edge , in each step of walk. This mechanism uses a probability distribution based on heuristic function and pheromone values . The probability function is defined as: ,

∑ 0

(2)

and are "the pheromone value on edge , " and "heuristic In equation (2) function", respectively. represents a set of the neighboring hosts pointing to host that has not yet visited by ant . • Heuristic function: The heuristic function has been defined using the assumptions we mentioned at the beginning of this Section. Equation (3) illustrates heuristic function that is the same as similarity measure in equation (1). Therefore an ant that is in the web host chooses an edge , that has more input links than others. 1

,

(3)

• Pheromone update: According to [25] each artificial ant should update pheromone on edge , after each step of walk to communicate with other ants. These pheromone's updates incrementally specify the best paths of connected hosts. 1

1

, ∑

,

(4)

This study combines offline and local pheromone updates [25] in one formula. Equation (4) explains the defined pheromone's update function. represents the value of pheromone on edge , in iteration and , is the number of links from to . Real number 0 1 is a decay coefficient. According to the proposed equation, amount of pheromone on each edge decreases over time. Higher

Web Spam Detection: New Approach with HMM

245

value of gives a greater chance to other paths to be selected by edge selection method in the next following iterations and as a result we will have more paths of connected hosts. 4.2

Hidden Markov Model

HMM is a stochastic extension of Markov Model with hidden states. In this model, states are not visible but probability of states and transition between them are given by state dependent functions [24, 25]. This paper defined two states Spam and Normal. The visible output and emission probability functions are respectively the 22 dimensional feature vectors and AODE model that was presented in Section 3. Since AODE is a probabilistic model [9] that here predicts posterior class probabilities, this model is appropriate to be used as emission probability of HMM.

Fig. 2. Sequences of hosts to host t

In training phase, Baum Welch algorithm is run on generated sequences from ACO to estimate the transition matrix and the initial probabilities . All HMM parameters are recalculated in maximization step of Baum Welch except emission probabilities that have already been estimated using an AODE classifier. In test phase, the label of host will be predicted by the proposed system. It first extracts sequences of hosts that are linked to the by ACO. Fig. 2 illustrates an example of host sequences with length three that are linked to the target web host. The forward algorithm, Equation (5) is used to calculate probability of normal and spam hosts according to each sequence. |

,

:

,

,

(5)

:

,

| ,

Z |Z

Z

,X

:

246

A.A. Torabi, K. Taghipour, and S. Khadivi

| spam is the probability of observing feature vector in the state spam. Z ,X refers to the initial probabilities of spam and normal and Z |Z is the transition probability distribution. Four possible transitions are as follows: ─ ─ ─ ─ Finally to predict the label of target host, average probabilities of spam| : and nomal| : over all sequences are used. In Table 3, the performance of the new classification method is shown. Table 3. Smoothing by one HMM

True positive rate False positive rate F-Measure 4.3

AODE

HMM

81.2% 4.9% 0.825

81.8% 4.3% 0.836

Multiple HMMs

So far, we only use the output of ACO algorithm in the system, i.e., we only use the best sequences to train a single HMM. In this Section, we use the values of pheromones to better estimate the HMM parameters. We assume the pheromone values of each edge as a measure of conditional dependency between two hosts. Here, we introduce a technique for label smoothing by using multiple HMMs. In Fig. 3, the results of our experiments on .uk 2006 dataset are shown. In these experiments, we first use ACO with 100 artificial ants and then we extract sequences with length 2. Then a discretization with 10 equal depth bins is performed on pheromone values. Afterwards, we train a Hidden Markov Model for each bin, so we have ten different HMMs. Prior probability of spam Z spam and transition probability Z spam|Z spam for each bin are presented in Fig. 3. According to the reported parameters, the label of destination point is conditionally more dependent on the source point when there is more pheromone on the edge between them. Furthermore, it illustrates that probability of spam has an inverse relation with amount of pheromone. The result of the above experiment is convincing enough to make use of different HMM components to model relation between points with different dependency values (Pheromones). For sequences with length two implementation of such a system with non-parametric models is straightforward. But for sequences with length of three or higher we should present a technique that considers pheromones in edges in depth two or higher in addition to pheromones in first step of sequences. Please note that since we aim to use non-parametric models in HMM, we need to discretize the edge values so as to decrease the amount of sparsity.

Web Spam Detection: New Approach with HMM

247

1 Probability

0.8 0.6

P(X1->Spam|X2->Spam)

0.4

Probability of Spam

0.2 3.09-inf

0.2

0.28

0.12

0.1072

0.1033

0.1015

0.1006

0.1002

0.05

0

Fig. 3. Dependency between HMM parameters and pheromone value

In this paper two approaches were examined for sequences with length three. In the first one, a weight for each sequence of two edges was defined by multiplication of pheromones of the edges. Afterwards the binning was performed on these weights. In the second approach, the binning is applied two times. First on the edges connected to the target host, and second time on the edges connected to the neighbors of the target host. The binning algorithm introduces ten bins in each run. Therefore, a 10 10 table of 100 HMM component was created. Each sequence was assigned to one of the HMM components according to the amount of pheromone on their edges. For example if the first edge of sequences belonged to bin 3 and second edge belonged to bin 6 the system assigned the sequence to the HMM number 18 in row 3 and column 6 of the pheromone table. The train and test phases are the same as before; unless spam| : and nomal| : should be computed according to the appropriate HMM component. Table 4 shows the result of the experiments. Table 4. Camparison of proposed approaches

True positive rate False positive rate F-Measure

HMM

Multiple HMMs of order 2

Multiple HMMs of order 3 first approach

Multiple HMMs of order 3 second approach

81.8% 4.3% 0.836

88.1% 6.5% 0.843

87.6% 6.7% 0.838

91.7% 6.9% 0.859

As you can see in Table 4, the proposed system achieves an improvement by using several HMM components on sequences with length 2. The performance of system but then is reduced when first approach was used to model sequences with length 3. However it is raised again when the second approach is used. According to our experiments the general trend of using this method is a considerable increase of 10 percent in detection rate of baseline classifier while F-measure has also improved from 0.825 to 0.859. Next Section is a comparison of the result of this study and other existing methods to show to what extent the application of HMMs is contributed to the improvement for Web spam detection.

248

5

A.A. Torabi, K. Taghipour, and S. Khadivi

Conclusion

For many applications like Web spam detection the i.i.d assumption would fail to exploit dependency patterns between data points. This study proposed a system to detect web spam according to the content and link based features and dependency between points in web host graph. To our knowledge this is the first attempt to boost the performance of web spam detection using Hidden Markov Models. Table 5 shows a comparison between the presented method and other systems according to the FMeasure value. Experimental results show that the proposed method is effective and yields better performance in comparison with other works on the same feature set. Geng et al in [26] boosted the performance of classification task using under sampling method and reached F-measure 0.759. Castillo et al [5] as one of the most significant studies on the web spam detection reports F-Measure 0.763 using stacked graphical learning. Benczúr et al [27] reported F-measure 0.738 following the same methodology as Castillo et al [5]. To compare the performance of the proposed system with the results of the participants in web spam challenge [28] The study also evaluated the proposed method on the test set provided by the organization. Table 5. Comparing performance of systems Web Spam Detection system Our Proposed system Castillo et al Geng et al Benczúr et al Filoche et al Abou et al Fetterly et al Cormack

F-Measure Test Set

Cross Validation

0.90 NA 0.87 0.91 0.88 0.81 0.79 0.67

0.85 0.76 0.75 0.73 NA NA NA NA

One disadvantage of the proposal system is the number of the needed HMMs in the second approach. For instance using the second approach needs to create 1000 HMM models for sequences of length 4; which is a proof that it is time consuming to estimate parameters of these HMMs. In near future, we plan to propose a HMM with parametric transition probabilities that can handle the weights of the edges. Moreover we intend to employ a new content based feature using language modeling techniques. Based on the ongoing researches and studies on the topic we strongly believe that it is possible to achieve better performance using these new features.

References 1. Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in web search engines. In: ACM SIGIR Forum, pp. 11–22. ACM (2002) 2. Bianchini, M., Gori, M., Scarselli, F.: Inside pagerank. ACM Transactions on Internet Technology (TOIT) 5, 92–128 (2005)

Web Spam Detection: New Approach with HMM

249

3. Gyongyi, Z., Garcia-Molina, H.: Web spam taxonomy. In: First International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2005 (2005) 4. Malaga, R.A.: Search Engine Optimization—Black and White Hat Approaches. Advances in Computers 78, 1–39 (2010) 5. Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your neighbors: Web spam detection using the web topology. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 423– 430. ACM (2007) 6. Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proceedings of the 15th International Conference on World Wide Web, pp. 83–92. ACM (2006) 7. Mahmoudi, M., Yari, A., Khadivi, S.: Web spam detection based on discriminative content and link features. In: 2010 5th International Symposium on Telecommunications (IST), pp. 542–546. IEEE (2010) 8. Spirin, N., Han, J.: Survey on web spam detection: principles and algorithms. ACM SIGKDD Explorations Newsletter 13, 50–64 (2012) 9. Webb, G.I., Boughton, J.R., Wang, Z.: Not so naive Bayes: aggregating one-dependence estimators. Machine Learning 58, 5–24 (2005) 10. Fetterly, D., Manasse, M., Najork, M.: Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In: Proceedings of the 7th International Workshop on the Web and Databases: Colocated with ACM SIGMOD/PODS 2004, pp. 1–6. ACM (2004) 11. Piskorski, J., Sydow, M., Weiss, D.: Exploring linguistic features for Web spam detection: A preliminary study. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 25–28. ACM (2008) 12. Araujo, L., Martinez-Romo, J.: Web spam detection: new classification features based on qualified link analysis and language models. IEEE Transactions on Information Forensics and Security 5, 581–590 (2010) 13. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Link-based characterization and detection of web spam. In: 2nd Intl. Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pp. 1–8 (2006) 14. Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with trustrank. In: Proceedings of the Thirtieth international conference on Very Large Data Bases, vol. 30, pp. 576–587. VLDB Endowment (2004) 15. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Using rank propagation and probabilistic counting for link-based spam detection. In: Proc. of WebKDD (2006) 16. Davison, B.D.: Recognizing nepotistic links on the web. Artificial Intelligence for Web Search, 23–28 (2000) 17. Abernethy, J., Chapelle, O., Castillo, C.: Web spam identification through content and hyperlinks. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, pp. 41–44. ACM (2008) 18. Yahoo! Research: Web Spam Collections, http://barcelona.research.yahoo.net/webspam/datasets/ Crawled by the Laboratory of Web Algorithmics, University of Milan, http://law.dsi.unimi.it/ (retrieved August 8, 2012) 19. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)

250

A.A. Torabi, K. Taghipour, and S. Khadivi

20. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artificial intelligence 97, 273–324 (1997) 21. Menczer, F.: Mapping the semantics of Web text and links. IEEE Internet Computing 9, 27–36 (2005) 22. Chakrabarti, S., Joshi, M.M., Punera, K., Pennock, D.M.: The structure of broad topics on the web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 251–262. ACM (2002) 23. Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised learning. MIT Press (2006) 24. Rabiner, L., Juang, B.: An introduction to hidden Markov models. IEEE ASSP Magazine 3, 4–16 (1986) 25. Dorigo, M., Birattari, M., Stutzle, T.: Ant colony optimization. IEEE Computational Intelligence Magazine 1, 28–39 (2006) 26. Geng, G.-G., Wang, C.-H., Li, Q.-D., Xu, L., Jin, X.-B.: Boosting the performance of web spam detection with ensemble under-sampling classification. In: Fourth International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2007, pp. 583–587. IEEE (2007) 27. Benczúr, A., Bíró, I., Csalogány, K., Sarlós, T.: Web spam detection via commercial intent analysis. In: Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, pp. 89–92. ACM (2007) 28. Web Spam Challenge (2007), http://webspam.lip6.fr/