Towards Web Spam Filtering with Neural-Based Approaches Renato Moraes Silva1 , Tiago A. Almeida2 , and Akebo Yamakami1 1
School of Electrical and Computer Engineering, University of Campinas – UNICAMP, 13083-852, Campinas, SP, Brazil {renatoms,akebo}@dt.fee.unicamp.br 2 Department of Computer Science, Federal University of S˜ao Carlos – UFSCar, 13565-905, Sorocaba, SP, Brazil
[email protected]
Abstract. The steady growth and popularization of the Web increases the competition between the websites and creates opportunities for profit in several segments. Thus, there is a great interest in keeping the website in a good position in search results. The problem is that many websites use techniques to circumvent the search engines which deteriorates the search results and exposes users to dangerous content. Given this scenario, this paper presents a performance evaluation of different models of artificial neural networks to automatically classify web spam. We have conducted an empirical experiment using a well-known, large and public web spam database. The results indicate that the evaluated approaches outperform the state-of-the-art web spam filters.
1 Introduction Nowadays, the volume of information in the Web is explosively increasing. As a consequence, search engines have become important tools to help users find desired information. Then, the higher the relevance of a page, the greater the chance that page appears in search results and is clicked. This, combined with the current competitive business gives birth several malicious methods that try to circumvent the search engines by manipulating the relevance of web pages to increase the return of investment [1]. Such a technique is known as web spamming which can be composed by content spam and link spam. According to Araujo and Martins-Romo [2], content spam is a technique that alters the logical view that a search engine has over the page contents, for instance, by inserting invisible popular keywords that have no connection with the actual content of the page. On the other hand, link spam consists of the creation of a link structure to increases the relevance of pages in search engines that rank the importance of pages using the relation of the amount of links pointing to it. Web spam is undesirable because in addition to deteriorate the search results, still can expose users to malicious content that installs malwares on their computers and can steal sensitive information, as passwords, financial information, or web-banking credentials. Recent estimates suggest that at least 1.3% of all search queries of the Google search engine contain results that link to malicious pages [3]. Given this scenario, there is a common sense that it is very necessary to put forward efficient techniques to automatically detect web spam. Previous studies focus only on the analysis of the relation of web links [4, 5], the web pages content [6, 7], or both [8– 10]. In this paper, we evaluate the combination of different models of artificial neural
2
Renato Moraes Silva, Tiago A. Almeida, and Akebo Yamakami
networks with these three strategies to automatically detect samples of web spam. We have conducted an empirical experiment using a well-known, large, and public web spam database and the reported results indicate that the evaluated approaches outperform currently established web spam filters. This paper is organized as follows: Section 2 presents related work regarding web spam detection. Section 3 introduces the basic background of the evaluated artificial neural networks. The experiment protocol and main results are presented in Section 4. Finally, Section 5 offers the main conclusions and guidelines for future work.
2 Related Work Castilho et al. [10] a web spam detection system that combines link-based features and content-based features. In addition, they use the web graph topology by exploiting the link structure among hosts and proposed features that were used in several other relevant works and in important events, such as the Web Spam Challenge Track I and II. Svore et al. [11] present a method for detecting web spam that use content-based features and the rank-time. The experiments were performed using SVM classifier with linear kernel using the rank-time features into query-independent and query-dependent. The results indicate that the first method performs better than the second one. Noi et al. [12] present a method based on a combination of graph neural network model and probability mapping graph self organizing maps. The two models are organized into a layered architecture, consisting of a mixture of unsupervised and supervised learning methods. The found results indicate that the proposed approach was comparable with established methods at that time. Shengen et al. [1] propose to derive new features for web spam detection, using genetic programming, from existing link-based features and use them as the inputs to support vector machine and genetic programming classifiers. According to the authors, the classifiers that use the new features achieve better results compared with the features provided in the original database. Largillier and Peyronnet [13] consider that spammers use web pages with specific dedicated structure around a given target page, to increase its PageRank. The authors propose a technique for identification of web spam which deals with spam links, analyzing the frequency language associated with random walks amongst those dedicated structures. The results indicate that the proposed technique is efficient since it was able to identify spam using a few simple patterns.
3 Artificial neural network Artificial neural network (ANN) is a parallel and distributed method made up of simple processing units called neurons, which has computational capacity of learning and generalization. In this system, the knowledge is acquired through a process called training or learning that is stored in strength of connections between neurons, called synaptic weights [14]. A basic model of ANN has the following components: a set of synapses, an integrator, an activation function, and a bias. So, there are different models of ANN depending on the choice of each component [14]. In the following, we briefly present each model we have evaluated in this work.
Towards Web Spam Filtering with Neural-Based Approaches
3.1
3
Multilayer perceptron neural network
A multilayer perceptron neural network (MLP) is a perceptron-type network that has a set of sensory units composed by an input layer, one or more intermediate (hidden) layers, and an output layer of neurons [14]. By default, MLP is a supervised learning method that uses the backpropagation algorithm which can be summarized in two stages: forward and backward [15]. In the forward stage, the signal propagates through the network, layer by layer, l−1 mP l wji (n)yil−1 (n), where l = 0, 1, 2, ..., L are the indexes of as follows: ulj (n) = i=0
network layers. So, l = 0 represents the input layer and l = L represents the output layer. On the other hand, yil−1 (n) is the output function relating to the neuron i in l the previous layer, l − 1, wji (n) is the synaptic weight of neuron j in layer l and ml l corresponds to the number of neurons in layer l. For i = 0, y0l−1 (n) = +1 and wj0 (n) represent the bias applied to neuron j in layer l [14]. The output of neuron j in layer l is given by yjl (n) = ϕj (ulj (n)), where ϕj is the activation function of j. Then, the error can be calculated by elj (n) = yjl (n) − d(n), where d(n) is the desired output for an input pattern x(n). In backward stage, the derivation of the backpropagation algorithm is performed L ′ starting from the output layer, as follows: δjL (n) = ϕ′j (uL j (n))ej (n), where ϕj is the derivative of the activation function. For l = L, L − 1, ..., 2, is calculated: δjl−1 (n) = ml P l wji (n) ∗ δjl (n), for j = 0, 1, ..., ml − 1. ϕ′j (ul−1 (n)) j i=1
Consult Haykin [14] and Bishop [15] for more information.
Levenberg-Marquardt algorithm The Levenberg-Marquardt algorithm is usually employed to optimize and accelerate the convergence of the backpropagation algorithm [15]. It is considered a second order method because it uses information about the second derivative of the error function. Considering that the error function is given by mean square error (MSE), the equation used by Gauss-Newton method to update the network weights and to minimize the value of MSE is Wi+1 = W1 − H −1 ∇f (W ). The gradient ∇f (W ) can be represented by ∇f (W ) = J T e and the Hessian matrix can be calculated by ∇2 f (W ) = J T J + S, where J is a Jacobian matrix and S = n P ei ∇2 ei . It can be conclude that S is a small value when compared to the product i=1
of the Jacobian matrix, so the Hessian matrix can be represented by ∇2 f (W ) ≈ J T J. Therefore, updating the weights in Gauss-Newton method can be done by Wi+1 = W1 − (J T J)−1 J T e. One limitation of the Gauss-Newton method is that a simplified Hessian matrix can not be reversed. Thus, the Levenberg-Marquardt algorithm updates the weights by Wi+1 = Wi − (J T J + µI)−1 J T e, where I is the identity matrix and µ a parameter that makes the Hessian a positive definite matrix. More details can be found in Bishop [15] and Hagan and Menhaj [16].
4
Renato Moraes Silva, Tiago A. Almeida, and Akebo Yamakami
3.2 Kohonen’s self-organizing map The Kohonen’s self-organizing map (SOM) is based on unsupervised competitive learning. Its main purpose is to transform an input pattern of arbitrary dimension in a onedimensional or two-dimensional map in a topologically ordered fashion [14, 17]. The training algorithm for a SOM can be summarized in two stages: competition and cooperation [14, 17]. In the competition stage, a random input pattern (xj ) is chosen, the similarity between this pattern and all the neurons of the network is calculated by the Euclidean distance id = arg minkxj − wi k where i = 1, ...k, and the index of the neuron with ∀i
lowest distance is selected. In cooperation stage, the synaptic weights wid that connect the winner neuron in the input pattern xi is updated. The weights of neurons neighboring the winner neuron are also updated by wi (t + 1) = wi (t) + α(t)h(t)(xi − wi (t)), where t is the number of training iterations, wi (t+1) is the new weight vector, wi (t) is the current weight vector, α is the learning rate, h(t) is the neighborhood function and xi is the input pattern. The neighborhood function h(t) is equal to 1 when the winner neuron is updated. This is because it determines the topological neighborhood around the winning neuron, defined by the neighborhood radius σ. The amplitude of this neighborhood function monotonically decreases as the lateral distance between the neighboring neuron and the winner neuron increases. There are several ways to calculate this neighborhood function, and one of the most common is the Gaussian function, defined by −d2ji hji (t) = exp 2σ2 (t) , where dji is the lateral distance between winner neuron i and neuron j. The parameter σ(t) defines the neighborhood radius and should be some monotonic function that decreases over the time. So, the exponential decay function σ(t) = σ0 exp − τt can be used, where σ0 is the initial value of σ, t is the current 1000 iteration number and τ is a time constant of the SOM, defined by τ = log σ0 The competition and cooperation stages are carried out for all the input patterns. Then, the neighborhood radius σ and learning rate α are updated. This parameter should decrease with time and can be calculated by α(t) = α0 exp − τt , where α0 is the initial value of α, t is the current iteration number and τ is a time constant of the SOM which can be calculated as presented in the cooperation stage. 3.3 Learning vector quantization The learning vector quantization (LVQ) is a supervised learning technique that aims to improve the quality of the classifier decision regions, by adjusting the feature map through the use of information about the classes [14]. According to Kohonen [17], the SOM can be used to initialize the feature map by defining the set of weight vectors wij . The next step is to assign labels to neurons. This assignment can be made by majority vote, in other words, each neuron receives the class label in that it is more activated. After this initial step, the LVQ algorithm can be employed. Although, the training process is similar to the SOM one, it does not use neighborly relations. Therefore, it is checked if the class of the winner neuron is equal to the class of the input vector x, and it is updated as follows: wid (t) + α(t)(xi − wid (t)), equal class wid (t + 1) = wid (t) − α(t)(xi − wid (t)), different class where α is the learning rate, id is the index of the winner neuron and t is the current iteration number.
Towards Web Spam Filtering with Neural-Based Approaches
3.4
5
Radial basis function neural network
A radial basis function neural network (RBF), in its most basic form, has three layers. The first one is the input layer which has sensory units connecting the network to its environment. The second layer is hidden and composed by a set of neurons that use radial basis functions to group the input patterns in clusters. The third layer is the output one, which is linear and provides a network response to the activation function applied to the input layer [14]. The activation function most common for the RBFs is the Gaussian, 2 , where x is the input vector, c is the center point defined by h(x) = exp − (x−c) r2 and r is the width of the function. The procedure for training a RBF is performed in two stages. In the first one, the parameters of the basic functions related to the hidden layer are determined through some method of unsupervised training, as K-means. In the second training phase, the weights of the output layer are adjusted, which corresponds to solve a linear problem [15]. According to Bishop [15], considering an m P wkj hj , input vector x = [x1 , x2 , ..., xn ], the network output is calculated by yk = j=1
where x = [wk1 , wk2 , ..., xkm ] are the weights, h = [h1 , h2 , ..., hm ] are the radial basis functions, calculated by a function of radial basis activation. After calculating the outputs, the weights should be updated. A formal solution to calculate the weights is given by w = h† d, where h is the matrix of basis functions, h† represents the pseudo-inverse of h and d is a vector with the desired responses [15]. Consult Haykin [14], Bishop [15] and Orr [18] for more information.
4 Experiment and results To give credibility to the found results and in order to make the experiments reproducible, all the tests were performed with the public and well-known WEBSPAMUK2006 collection3 . It is composed by 77.9 million web pages hosted in 11,000 hosts in the UK domains. It is important to note that this corpus was used in Web Spam Challenge4 I and II, that are the most known competitions of web spam detection techniques. In our experiment, we have followed the same competition guidelines. In this way, we have used three sets of 8,487 feature vectors employed to discriminate the hosts as spam or ham. Each set is composed by 6,509 hosts labeled as ham and 1,978 labeled as spam. The organizers provided three sets of features: the first one composed by 96 content-based features [10], the second one composed by 41 link-based features [19] and the third one composed by 138 transformed link-based features [10], which are the simple combination or logarithm operation of the link-based features. 4.1
Protocol
We evaluated the following well-known artificial neural networks (ANNs) algorithms to automatically detect web spam: multilayer perceptron (MLP) trained with the gradient descent (MLP-GD) and Levenberg-Marquardt (MLP-LM) methods, Kohonen’s self-organizing map (SOM) with learning vector quantization (LVQ) and radial basis function neural network (RBF). 3
4
Yahoo! Research: “Web Spam Collections”. Available at http://barcelona. research.yahoo.net/webspam/datasets/. Web Spam Challenge: http://webspam.lip6.fr/
6
Renato Moraes Silva, Tiago A. Almeida, and Akebo Yamakami
We have implemented all the MLPs with a single hidden layer and with one neuron in the output layer. In addition, we have employed a linear activation function for the neuron of output layer and an hyperbolic tangent activation function for the neurons of the intermediate layer. We have initialized the weights and biases with random values x−xmin between [−1, 1] and normalized the data to this interval by x = 2 ∗ xmax −xmin − 1, where x is the array with all the feature vectors and xmin and xmax are, respectively, the smallest and largest value in the array x. Also, we have performed such data normalization for SOMs with LVQ and RBFs. Regarding the parameters, in all simulations, we have employed the following stopping criteria: maximum number of iterations be greater than a threshold θ, the mean square error (MSE) of the training set be smaller than a threshold γ or when the MSE of the validation set increases (checked every 10 iterations). The parameters used for each ANN model were chosen empirically, by trial-anderror method, and are presented in Table 1: Table 1. Parameters of the neural networks. Parameter MLP-GD MLP-LM RBF SOM + LVQ θ 10,000 500 2,000 γ 0.001 0.001 0.01 step learning α 0.005 0.001 Number of neurons in the hidden layer 100 50 10 120 Neighborhood function - One-dimensional Initial neighborhood radius σ 4
Note that, as Table 1 presents, for the simulations using the RBFs, we have not employed any stopping criteria because the training method is not iterative, as pointed out in Section 3. To address the algorithms performance, we used a random sub-sampling validation, which is also known as Monte Carlo cross-validation [20]. Such method provides more freedom to define the size of training and testing subsets. Unlike the traditional k-fold cross-validation, the random sub-sampling validation allows to do as many repetitions were desired, using any percentage of data for training and testing. In this way, we divided each simulation in 10 tests and calculated the arithmetic mean and standard deviation of the following well-known measures: accuracy rate (Acc%), spam recall rate (Rcl%), specificity (Spc%), spam precision rate (Pcs%), and F-measure (FM) [21– 24]. In each test, we have randomly selected 80% of the samples of each class to be presented to the algorithms in the training stage and the remaining ones were separated for testing. 4.2
Results
In this section, we report the main results of our evaluation. Table 2 presents the performance achieved by each ANN using each set of feature vectors. Bold values indicate the highest score acquired by each ANN and values preceded by the symbol “*” indicate the highest score in each performance measure. According to the results, it is clear that the MLP trained with Levenberg-Marquardt method achieved the best performance. On the other hand, the SOM and RBF accomplished poor results. Note that, although the best set of results was acquired when all the
Towards Web Spam Filtering with Neural-Based Approaches
7
Table 2. Results achieved by each evaluated neural network for WEBSPAM-UK2006 dataset. Content Links Trans. Links Content+Links MLP trained with the gradient descent method Mean Mean Mean Mean Acc 86.2±1.2 86.4±1.3 88.3±0.9 *89.3±0.7 Rcl 57.0±4.6 61.6±3.1 75.2±2.2 74.2±3.2 Spc 95.0±0.4 93.9±0.9 92.1±1.4 94.0±0.9 Pcs 77.5±2.7 75.3±2.9 73.9±3.1 79.5±2.5 FM 0.656±0.039 0.677±0.026 0.745±0.014 0.767±0.016 MLP trained with the Levemberg-Marquardt method Acc 88.6±1.4 88.1±1.6 80.0±0.8 92.1±0.6 Rcl 69.3±4.2 70.8±6.6 74.9±3.9 *81.7±2.0 Spc 94.2±1.2 93.0±0.7 93.1±0.4 95.3±0.2 Pcs 77.6±4.6 74.1±3.7 76.1±2.1 *84.4±1.9 FM 0.731±0.032 0.723±0.049 0.754±0.27 *0.830±0.008 RBF Acc 79.7±0.6 76.7±0.4 81.7±0.9 76.8±0.3 Rcl 26.7±2.8 4.8±1.3 38.9±3.9 5.5±1.4 Spc 95.8±0.7 *98.6±0.3 94.8±0.06 98.4±0.4 Pcs 65.8±3.3 50.3±7.9 69.3±2.5 50.4±5.2 FM 0.379±0.030 0.087±0.024 0.497±3.5 0.098±0.024 SOM + LVQ Acc 80.6±0.8 77.3±0.9 85.1±1.5 78.2±0.4 Rcl 29.2±2.6 15.4±3.6 62.7±5.6 17.5±1.3 Spc 96.3±0.6 96.0±1.2 91.9±0.5 96.6±0.5 Pcs 70.4±4.0 54.4±7.4 69.9±2.6 61.3±3.5 FM 0.412±0.030 0.238±0.046 0.660±0.042 0.272±0.018
features were used (content-based and link-based features), in average, the transformed link-based features offered a more balanced classification if we take into account all the classifiers performance. If we compare the algorithms performance achieved by each set of features, we can note that, in general, the best results were achieved by MLPs using the combination of content-based and link-based features. However, for RBF and SOM, the results achieved by using these set of features were much inferior than those one achieved by using transformed link-based features. On the other hand, if we compare only the results achieved by the ANNs using the content-based with the link-based features, we can see that, in general, the networks acquired better performances when content-based features were employed. It is important to note that the results shown in Table 2 also indicate that, in general, the ANNs have more successful to identify ham hosts than spam ones (specificity rate higher than spam recall rate). Thus, we suspected that the low capacity of spam recognition was due to the fact that the data is unbalanced. Consequently, the large difference between the number of ham samples and spam used for training the ANNs could cause the classifier biased in class with larger number of samples. So, we decided to use the same number of data in the two classes in the training stage. In this way, in each of the ten tests for each simulation, 1,978 samples of each class were randomly selected to be
8
Renato Moraes Silva, Tiago A. Almeida, and Akebo Yamakami
exposed to the classifiers and new simulations were performed keeping the proportion of 80% of the samples of each class for training and 20% for testing. Table 3 presents the classification results. Table 3. Results achieved by each evaluated neural network for WEBSPAM-UK2006 dataset using balanced classes in the training stage. Content Links Trans. Links Content+Links MLP trained with the gradient descent method Mean Mean Mean Mean Acc 84.7±1.6 83.9±1.9 87.4±1.6 *89.1±1.4 Rcl 82.9±2.0 90.2±2.6 89.6±2.2 *92.6±2.4 Spc 86.4±2.4 77.9±3.0 85.2±1.1 85.6±2.0 Pcs 86.1±2.4 80.0±2.9 85.9±2.0 86.7±2.1 FM 0.845±0.015 0.848±0.019 0.877±0.019 *0.895±0.013 MLP trained with the Levemberg-Marquardt method Acc 87.6±1.5 86.4±1.7 86.3±2.7 88.4±2.1 Rcl 86.5±2.0 92.1±3.8 87.5±4.0 91.9±3.5 Spc 88.8±1.9 81.1±2.6 85.2±3.6 85.0±3.3 Pcs *89.1±2.4 82.3±2.6 85.8±3.3 85.6±2.7 FM 0.877±0.017 0.868±0.019 0.866±0.027 0.886±0.021 RBF Acc 63.6±1.3 65.9±3.5 75.7±1.8 66.7±0.8 Rcl 45.6±1.9 79.5±2.6 75.2±4.3 72.3±2.3 Spc 81.6±1.5 52.2±8.7 76.1±5.6 61.1±3.3 Pcs 71.3±2.1 62.7±3.8 76.2±3.2 65.1±1.3 FM 0.556±0.016 0.700±0.019 0.755±0.018 0.684±0.007 SOM+LVQ Acc 66.9±0.7 77.2±0.5 86.0±0.4 78.1±0.6 Rcl 59.7±5.9 11.6±1.9 64.7±2.9 16.6±1.6 Spc 74.2±5.5 *97.1±0.5 92.5±0.09 96.7±0.9 Pcs 70.1±2.9 54.9±4.8 72.3±1.8 61.4±6.3 FM 0.642±0.026 0.191±0.028 0.683±0.013 0.260±0.021
The results in Table 3 indicate that the ANNs trained with the same number of samples in each class improved the performance of all classifiers. If we compare these results with the ones presented in Table 2, we can see that the F-measure was higher in almost all the simulations, except in the simulation with SOM using the link-based features and combination of the link-based features with content-based features. Note that, with this new configuration, the MLP with gradient descent method achieved the best performance. Again, we observed that the MLPs are more efficient when the combination of content-based and link-based features is used. However, for RBF and SOM, the best results were again achieved when transformed link-based features were employed. To show that the evaluated ANNs are really competitive, in Table 4 we present a comparison between the best results achieved by the evaluated methods and the top performance techniques available in the literature. To offer a fair evaluation, we have
Towards Web Spam Filtering with Neural-Based Approaches
9
implemented all the compared approaches and tested them by using the same dataset with unbalanced classes, features and protocol employed in our ANNs. We set exactly the same parameters as described in the papers or, otherwise, we kept the default values. In resume, we have implemented the bagging of decision trees [10, 6] and boosting of decision trees [6] using the WEKA library and the linear support vector machines (SVM) [11, 25–27] using the LIBSVM library. For the genetic programming [1], we just report the same results available in the paper since the authors adopted the same dataset and protocol we have used. Table 4. Comparison between the results achieved by the evaluated neural networks and the top performance classifiers available in the literature. Classifiers
Pcs Best results available in the literature Castilho et al. [10], Ntoulas et al. [6] - content 81.5 Svore et al. [11] - content 55.5 Ntoulas et al. [6] - content 78.2 Shengen et al. [1] - links 69.8 Shengen et al. [1] - transformed links 76.5
Rcl FM 68.0 86.5 68.2 76.3 81.4
0.741 0.677 0.728 0.726 0.789
Best results achieved by the neural networks MLP + Levenberg - content+links 79.5 MLP + Gradient - content+links 84.4 MLP + Levenberg - content+links (balanced classes) 85.6 MLP + Gradient - content+links (balanced classes) 86.7
74.2 81.7 91.9 92.6
0.767 0.830 0.886 0.895
The comparison indicates that the MLPs are very suitable to deal with the problem. Note that, both the MLP trained with gradient descent as the MLP trained with the Levemberg-Marquardt method achieved the highest performances. Their results are comparable with the top-performance methods such as bagging [10, 6] and boosting algorithm [6]. However, taking into account the spam precision and recall rates it is clear that the MLP trained with the gradient descent method using the complete set of features outperformed all the compared approaches.
5 Conclusions and future work In this paper, we have presented a performance evaluation of different models of artificial neural networks used to automatically classify real samples of web spam using content-based features, link-based features, the combination of both and transformed link-based features. The results indicate that, in general, the multilayer perceptron neural network trained with the gradient descent and Levenberg-Marquardt methods are the best evaluated models, especially when the combination of the link-based and content-based features are employed. Both methods outperformed established techniques available in the literature such as decision trees [10], SVM [11] and genetic programming [1]. Furthermore, since the data we used in our experiment is unbalanced, the results also indicate that all the evaluated techniques are superior when trained with the same amount of samples of each class. It is because the models tend to be biased to the benefit of the class with the largest amount of samples.
10
Renato Moraes Silva, Tiago A. Almeida, and Akebo Yamakami
Overall, we have also concluded that Kohonen’s self-organizing map and radial basis function neural network were inferior than the multilayer perceptron neural networks in all the simulations independent on the chosen set of features. Actually, we are working to propose new set of features and new possible combinations in order to enhance the classifiers prediction.
References 1. Shengen, L., Xiaofei, N., Peiqi, L., Lin, W.: Generating new features using genetic programming to detect link spam. In: Proc. of the ICICTA’11, Shenzhen, China (2011) 135–138 2. Araujo, L., Martinez-Romo, J.: Web spam detection: New classification features based on qualified link analysis and language models. IEEE Trans. on Inf. Forensics and Security 5(3) (2010) 581–590 3. Egele, M., Kolbitsch, C., Platzer, C.: Removing web spam links from search engine results. J. in Computer Virology 7 (2011) 51–62 4. Shen, G., Gao, B., Liu, T., Feng, G., Song, S., Li, H.: Detecting link spam using temporal information. In: Proc. of the 6th ICDM, Hong Kong, China (2006) 1049–1053 5. Gan, Q., Suel, T.: Improving web spam classifiers using link structure. In: Proc. of the 3rd AIRWeb, Banff, Alberta, Canada (2007) 17–20 6. Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proc. of the WWW, Edinburgh, Scotland (2006) 83–92 7. Silva, R.M., Almeida, T.A., Yamakami, A.: Artificial Neural Networks for Content-based Web Spam Detection. In: Proc. of the 14th ICAI, Las Vegas, NV, USA (2012) 1–7 8. B´ır´o, I., Sikl´osi, D., Szab´o, J., Bencz´ur, A.A.: Linked latent dirichlet allocation in web spam filtering. In: Proc. of the 5th AIRWeb, Madrid, Spain (2009) 37–40 9. Abernethy, J., Chapelle, O., Castillo, C.: Graph regularization methods for web spam detection. Machine Learning 81(2) (2010) 207–225 10. Castillo, C., Donato, D., Gionis, A.: Know your neighbors: Web spam detection using the web topology. In: Proc. of the 30th SIGIR, Amsterdam, The Netherlands (2007) 423–430 11. Svore, K.M., Wu, Q., Burges, C.J.: Improving web spam classification using rank-time features. In: Proc. of the 3rd AIRWeb, Banff, Alberta, Canada (2007) 9–16 12. Noi, L.D., Hagenbuchner, M., Scarselli, F., Tsoi, A.: Web spam detection by probability mapping graphsoms and graph neural networks. In: Proc. of the 20th ICANN, Thessaloniki, Greece (2010) 372–381 13. Largillier, T., Peyronnet, S.: Using patterns in the behavior of the random surfer to detect webspam beneficiaries. In: Proc. of the WISS’10), Berlin, Heidelberg (2011) 241–253 14. Haykin, S.: Neural Networks: A Comprehensive Foundation. 2th edn. Prentice Hall, New York, NY, USA (1998) 15. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford Press, Oxford (1995) 16. Hagan, M.T., Menhaj, M.B.: Training feedforward networks with the marquardt algorithm. IEEE Trans. on Neural Networks 5(6) (1994) 989–993 17. Kohonen, T.: The self-organizing map. In: Proc. of the IEEE. Volume 9. (1990) 1464–1480 18. Orr, M.J.L.: Introduction to radial basis function networks (1996) 19. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Using rank propagation and probabilistic counting for link-based spam detection. In: Proc. of the WebKDD’06, Philadelphia,USA (2006) 20. Shao, J.: Linear model selection by cross-validation. Journal of the American Statistical Association 88(422) (1993) 486–494 21. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. 2th edn. Morgan Kaufmann, San Francisco, CA (2005)
Towards Web Spam Filtering with Neural-Based Approaches
11
22. Almeida, T., Almeida, J., Yamakami, A.: Spam Filtering: How the Dimensionality Reduction Affects the Accuracy of Naive Bayes Classifiers. Journal of Internet Services and Applications 1(3) (2011) 183–200 23. Almeida, T., Yamakami, A., Almeida, J.: Probabilistic Anti-Spam Filtering with Dimensionality Reduction. In: Proceedings of the 25th ACM Symposium On Applied Computing, Sierre, Switzerland (2010) 1804–1808 24. Almeida, T., Yamakami, A.: Compression-Based Spam Filter. Security and Communication Networks (2012) 1–9 25. Almeida, T.A., Yamakami, A.: Advances in Spam Filtering Techniques. In Elizondo, D., Solanas, A., Martinez-Balleste, A., eds.: Computational Intelligence for Privacy and Security. Volume 394 of Studies in Computational Intelligence. Springer (2012) 199–214 26. Almeida, T.A., Yamakami, A.: Facing the Spammers: A Very Effective Approach to Avoid Junk E-mails. Expert Systems with Applications 39(7) (2012) 6557–6561 27. Almeida, T., Yamakami, A.: Occams Razor-Based Spam Filter. Journal of Internet Services and Applications (2012) 1–9