Artificial Neural Networks for Content-based Web Spam Detection Renato M. Silva1 , Tiago A. Almeida2 , and Akebo Yamakami1 1 School of Electrical and Computer Engineering, University of Campinas – UNICAMP, Campinas, SP, Brazil {renatoms,akebo}@dt.fee.unicamp.br 2 Department of Computer Science, Federal University of São Carlos – UFSCar, Sorocaba, SP, Brazil
[email protected]
Abstract— Web spam has become a big problem in the lives of Internet users, causing personal injury and economic losses. Although some approaches have been proposed to automatically detect and avoid this problem, the high speed the techniques employed by spammers are improved requires that the classifiers be more generic, efficient and highly adaptive. Despite of the fact that it is a common sense in the literature that neural based techniques have a high ability of generalization and adaptation, as far as we know there is no work that explore such method to avoid web spam. Given this scenario and to fill this important gap, this paper presents a performance evaluation of different models of artificial neural networks used to automatically classify and filter real samples of web spam based on their contents. The results indicate that some of evaluated approaches have a big potential since they are suitable to deal with the problem and clearly outperform the state-of-the-art techniques. Keywords: web spam; spam classifier; artificial neural network; pattern recognition
1. Introduction The web has becoming an essential tool in the lives of their users. The steady growth in volume of available information requires the use of search engines that help users to find desired data and presenting the results in an organized, quick and efficient way. However, there are several malicious methods that try to circumvent the search engines by manipulating the relevance of web pages. This deteriorates the search results, leaves users frustrated and exposes them to inappropriate and insecure content. Such technique is known as web spamming [1]. Web spam is composed by content spam and/or link spam. According to Gyongyi and Garcia-Molina [2], a very simple example of content spam is a web page with pornography and thousands of invisible keywords that have no connection with the pornographic content. Sheng et al. [3] explain that link spam is a kind of web spamming with thousands of links to web pages that are intended to promote. Such method increases the relevance of pages in search engines that rank the importance of pages using the relation of the amount of links pointing to it.
The minor problems caused by web spamming is the nuisance offered by forging undeserved and unexpected answers, promoting the announcement of unwanted pages [3]. However, there are more serious problems that pose a significant threat to users, since the web spam may have malicious content that installs malwares on the computers of victims, promoting the discovery of bank passwords and other personal information, degrading the performance of computers and network, among others [4]. According to annual reports, the amount of web spam is frightfully increasing. Eiron et al. [5] have classified 100 million web pages and they found that, in average, 11 of the 20 best results were pornographic pages that achieved a high relevance through the manipulation of links. According to Gyongyi and Garcia-Molina [2], there are more than 20% of web spam in the search results. Furthermore, the same study has indicated that about 9% of the search results have at least one link from a spam page in the 10 best results, while 68% of all queries have some kind of spam in the four best results presented by the search engines [2]. Unlike the large amount of available approaches to deal with email spams [6], [7], [8], [9], [10], [11], [12], [13], [14], there few methods in the literature to automatically detect web spam. In general, all the techniques employ one of the following strategies: • To analyze only the relation of web links [3], [15]; • To analyze only the content of the web pages [16], [17]; or • To extract features from both contents and links [18], [19], [20]. Among all the approaches presented in the literature, machine learning methods are the most used ones, such as ensemble selection [21], [22], clustering [20], [23], random forest [21], boosting [21], [22], support vector machine [1], [24], and decision trees [15], [20]. However, the main conclusions presented in the literature indicate that the high speed the techniques employed by spammers are improved requires that the classifiers be more generic and adaptive. To the best of our knowledge, the artificial neural network, that is the one of the most popular and successful technique for pattern recognition, has not been evaluated for classifying web spam. To fill this important gap, this paper presents a performance evaluation of different models of artificial
neural networks used to automatically classify and filter real and public samples of web spam. This paper is organized as follows: Section 2 introduces the background of the evaluated artificial neural networks. The experiment protocol and main results are presented in Section 3. Finally, Section 4 offers the main conclusions and guidelines for future work.
elj (n) = yjl (n) − d(n), where d(n) is the desired output for an input pattern x(n). In backward stage, the derivation of the backpropagation algorithm is performed starting from the output layer, as follows: L δjL (n) = ϕ′j (uL j (n))ej (n),
2. Artificial neural network Artificial neural network (ANN) is a parallel and distributed method made up of simple processing units called neurons, which has computational capacity of learning and generalization. In this system, the knowledge is acquired through a process called training or learning that is stored in strength of connections between neurons, called synaptic weights [25]. A basic model of ANN has the following components: a set of synapses, an integrator, an activation function, and a bias. So, there are different models of ANN depending on the choice of each component [25]. In the following, we briefly present each model we have evaluated in this work.
2.1 Multilayer perceptron neural network A multilayer perceptron neural network (MLP) is a perceptron-type network that has a set of sensory units composed by an input layer, one or more intermediate (hidden) layers, and an output layer of neurons. This type of network is fully connected, in other words, all neurons in any layer are connected to all neurons of the previous layer [25]. By default, MLP is a supervised learning method that uses the backpropagation algorithm which can be summarized in two stages: forward and backward. In the forward stage, the signal propagates through the network, layer by layer, as follows: ulj (n)
=
l−1 m X
l wji (n)yil−1 (n),
i=0
where l = 0, 1, 2, ..., L are the indexes of network layers. So, l = 0 represents the input layer and l = L represents the output layer. On the other hand, yil−1 (n) is the output function relating to the neuron i in the previous layer, l − 1, l wji (n) is the synaptic weight of neuron j in layer l and l m corresponds to the number of neurons in layer l. For l (n) represent the bias applied i = 0, y0l−1 (n) = +1 and wj0 to neuron j in layer l [25], [26]. The output of neuron j in layer l is given by: yjl (n) = ϕj (ulj (n)), where ϕj is the activation function of j. Then, the error can be calculated by:
where ϕ′j is the derivative of the activation function. For l = L, L − 1, ..., 2, is calculated: l
δjl−1 (n)
=
ϕ′j (ul−1 j (n))
m X
l wji (n) ∗ δjl (n),
i=1
l
for j = 0, 1, ..., m − 1 [25], [26]. According to Haykin [25], one of the most common stopping criteria for the MLPs and other types of ANNs is the amount of training iterations. The algorithm can also be stopped when the network reaches a minimum error, which can be calculated by the Mean Square Error: 1 T e e, n where n is the number of input patterns and e an array that stores the relative error for all input patterns in the network. M SE =
2.1.1 Levenberg-Marquardt algorithm The Levenberg-Marquardt algorithm is usually employed to optimize and accelerate the convergence of the backpropagation algorithm [27]. It is considered a second order method because it uses information about the second derivative of the error function. Considering that the error function is given by MSE, the equation used by Gauss-Newton method to update the network weights and to minimize the value of MSE is: Wi+1 = W1 − H −1 ∇f (W ). The gradient ∇f (W ) can be represented by: ∇f (W ) = J T e and the Hessian matrix can be calculated by: ∇2 f (W ) = J T J + S, where J is the Jacobian matrix
J =
∂e1 ∂x1 ∂e2 ∂x1
∂e1 ∂x2 ∂e2 ∂x2
∂en ∂x1
∂en ∂x2
.. .
.. .
··· ··· .. . ···
∂e1 ∂xn ∂e2 ∂x3
.. .
∂en ∂xn
,
xi is the i-th input pattern of the network and
S=
n X
ei ∇2 ei .
i=1
It can be conclude that S is a small value when compared to the product of the Jacobian matrix, so the Hessian matrix can be represented by: ∇2 f (W ) ≈ J T J. Therefore, updating the weights in Gauss-Newton method can be done by: Wi+1 = W1 − (J T J)−1 J T e. One limitation of the Gauss-Newton method is that a simplified Hessian matrix can not be reversed. Thus, the Levenberg-Marquardt algorithm updates the weights by: Wi+1 = Wi − (J T J + µI)−1 J T e, where I is the identity matrix and µ a parameter that makes the Hessian a positive definite matrix. More details about the Levenberg-Marquardt algorithm can be found at [26], [27], [28].
2.2 Kohonen’s self-organizing map The Kohonen’s self-organizing map (SOM) is based on unsupervised competitive learning. Its main purpose is to transform an input pattern of arbitrary dimension in a onedimensional or two-dimensional map in a topologically ordered fashion [25], [29]. The training algorithm for a SOM can be summarized in two stages: competition and cooperation [25], [29]. In the competition stage, a random input pattern (xj ) is chosen, the similarity between this pattern and all the neurons of the network is calculated by the Euclidean distance, defined by: id = arg minkxj − wi k, ∀i
where i = 1, ...k, and the index of the neuron with lowest distance is selected. In the cooperation stage, the synaptic weights wid that connect the winner neuron in the input pattern xi is updated. The weights of neurons neighboring the winner neuron are also updated by: wi (t + 1) = wi (t) + α(t)h(t)(xi − wi (t)), where t is the number of training iterations, wi (t + 1) is the new weight vector, wi (t) is the current weight vector, α is the learning rate, h(t) is the neighborhood function and xi is the input pattern. The neighborhood function h(t) is equal to 1 when the winner neuron is updated. This is because it determines
the topological neighborhood around the winning neuron, defined by the neighborhood radius σ. The amplitude of this neighborhood function monotonically decreases as the lateral distance between the neighboring neuron and the winner neuron increases. There are several ways to calculate this neighborhood function, and one of the most common is the Gaussian function, defined by: hji (t) = exp
−d2ji , 2σ 2 (t)
where dji is the lateral distance between winner neuron i and neuron j. The parameter σ(t) defines the neighborhood radius and should be some monotonic function that decreases over the time. So, the exponential decay function t τ can be used, where σ0 is the initial value of σ, t is the current iteration number and τ is a time constant of the SOM, defined by 1000 τ= . log σ0 σ(t) = σ0 exp −
The competition and cooperation stages are carried out for all the input patterns. Then, the neighborhood radius σ and learning rate α are updated. This parameter should decrease with time and can be calculated by: t , τ where α0 is the initial value of α, t is the current iteration number and τ is a time constant of the SOM which can be calculated as presented in the cooperation stage. α(t) = α0 exp −
2.3 Learning vector quantization The learning vector quantization (LVQ) is a supervised learning technique that aims to improve the quality of the classifier decision regions, by adjusting the feature map through the use of information about the classes [25]. According to Kohonen [29], the SOM can be used to initialize the feature map by defining the set of weight vectors wij . The next step is to assign labels to neurons. This assignment can be made by majority vote, in other words, each neuron receives the class label in that it is more activated. After this initial step, the LVQ algorithm can be employed. Although, the training process is similar to the SOM one, it does not use neighborly relations and updates only the winner neuron. Therefore, it is checked if the class label of the input vector x is equal to the label of the winner neuron. If the labels are equal, this neuron is moved towards the input vector x by: wid (t + 1) = wid (t) + α(t)(xi − wid (t)),
where α is the learning rate, id is the index of the winner neuron and t is the current iteration number. However, if the label of the winner neuron is different from the label of the input pattern x, then it is moved away by: wid (t + 1) = wid (t) − α(t)(xi − wid (t)).
2.4 Radial basis function neural network A radial basis function neural network (RBF), in its most basic form, has three layers. The first one is the input layer which has sensory units connecting the network to its environment. The second layer is hidden and composed by a set of neurons that use radial basis functions to group the input patterns in clusters. The third layer is the output one, which is linear and provides a network response to the activation function applied to the input layer [25]. The activation functions consists of radial basis functions whose values increase or decrease in relation to the distance for a central point. The decreasing radial basis function most common is the Gaussian, defined by: (x − c)2 , h(x) = exp − r2 where x is the input vector, c is the center point and r is the width of the function. On the other hand, the increasing radial basis function generally is represented by a multiquadric function, defined by [30]: p r2 + (x − c)2 h(x) = . r The procedure for training a RBF is performed in two stages. In the first one, the parameters of the basic functions related to the hidden layer are determined through some method of unsupervised training, as K-means. In the second training phase, the weights of the output layer are adjusted, which corresponds to solve a linear problem [27]. According to Bishop [27], considering an input vector x = [x1 , x2 , ..., xn ], the network output is calculated by: yk =
m X
wkj hj ,
j=1
where x = [wk1 , wk2 , ..., xkm ] are the weights, h = [h1 , h2 , ..., hm ] are the radial basis functions, calculated by a function of radial basis activation. After calculating the outputs, the weights should be updated. A formal solution to calculate the weights is given by: w = h† d, where h is the matrix of basis functions, h† represents the pseudo-inverse of h and d is a vector with the desired responses [27]. Consult [25], [27], [30] for more information.
3. Experiments and results To give credibility to the found results and in order to make the experiment completely reproducible, all the tests were performed with the public and well-known WEBSPAM-UK2006 collection1 . It is composed by a set of 105,896,555 web pages hosted in 114,529 hosts in the UK domains. It is important to note that this corpus was used in Web Spam Challenge Track2 I and II, that are the most known competitions of web spam detection techniques. In our experiments, we have followed the same competition guidelines. In this way, we have used a set of 8,487 feature vectors employed to discriminate the hosts as spam or ham. This kind of information was provided by the organizers. Each feature vector is composed by 96 features proposed by Castilho et al. [20], where 24 features were extracted from the host home page, other 24 ones from the page with the highest PageRank, 24 ones from the mean of these values for all pages, and the remaining 24 ones from the variance of all web pages. Furthermore, each host is labeled as spam, not spam (ham) or indefinite. Thus, the vectors related to hosts that are labeled as undefined were discarded.
3.1 Protocol We evaluated the following well-known artificial neural networks (ANNs) algorithms to automatically detect web spam based on its content: multilayer perceptron (MLP) trained with the gradient descent and Levenberg-Marquardt methods, Kohonen’s self-organizing map (SOM) with learning vector quantization (LVQ) and radial basis function neural network (RBF). We have implemented all the ANNs with a single hidden layer and with one neuron in the output layer. In addition, we have employed a linear activation function for the neuron of output layer and an hyperbolic tangent activation function for the neurons of the intermediate layer. Thus, we have initialized the weights and biases with random values between +1 and −1, and normalized the data to this interval. The desired output for all networks is −1 (ham) or +1 (spam). So, for RBFs and MLPs, if the output is greater than or equal to 0, the host is considered spam, otherwise it is considered ham. The same not happens with SOMs since the data receive the same label of the neuron that represents them. Regarding the parameters, in all simulations, we have employed the following stopping criteria: maximum number of iterations be greater than a threshold θ, the mean square error (MSE) of the training set be smaller than a threshold γ or when the MSE of the validation set increases (checked every 10 iterations). 1 Yahoo! Research: “Web Spam Collections”. Available at http:// barcelona.research.yahoo.net/webspam/datasets/. 2 Web Spam Challenge: http://webspam.lip6.fr/
The parameters used for each ANN model were empirically calibrated and are the following: Multilayer perceptron with gradient descent method: – θ = 10, 000 – γ = 0.001 – Step learning α = 0.005 – Number of neurons in the hidden layer: 100 • Multilayer perceptron with Levemberg-Marquardt method: – θ = 500 – γ = 0.001 – Step learning α = 0.001 – Number of neurons in the hidden layer: 50 • Kohonen’s self-organizing map with learning vector quantization: – Competition stage: ∗ One-dimensional neighborhood function with initial radius σ = 4 – Cooperation stage: ∗ θ = 2, 000 ∗ Step learning α = 0.01 ∗ Number of neurons in the hidden layer: 120 • Radial basis function neural network: – Number of neurons in the hidden layer: 10 Note that, for the simulations using the RBFs, we have not employed any stopping criteria because the training method is not iterative, as pointed out in Section 2.4. To address the algorithms performance, we divided each simulation in 10 tests and calculated the arithmetic mean and standard deviation of the following well-known measures: accuracy rate (Acc%), spam recall rate (Rcl%), specificity (Spc%), spam precision rate (Pcs%), and F-measure (FM). In each test, we have randomly selected 80% of the samples of each class to be presented to the algorithms in the training stage and the remaining ones were separated for testing. •
3.2 Results In this section, we report the main results of our evaluation. Table 1 presents the performance achieved by each artificial neural network. Bold values indicate the highest score. Table 1: Results achieved by each evaluated artificial neural network for WEBSPAM-UK2006 dataset.
Acc Rcl Spc Pcs FM
Gradient Mean 86.2±1.2 57.0±4.6 95.0±0.4 77.5±2.7 0.656±0.039
Levenberg Mean 88.6±1.4 69.3±4.2 94.2±1.2 77.6±4.6 0.731±0.032
RBF Mean 79.7±0.6 26.7±2.8 95.8±0.7 65.8±3.3 0.379±0.030
SOM + LVQ Mean 80.6±0.8 29.2±2.6 96.3±0.6 70.4±4.0 0.412±0.030
According to the results, it is clear that the MLP trained with Levenberg-Marquardt method achieved the best performance. On the other hand, the Kohonen’s self-organizing map and radial basis function neural network acquired the worst performances with a very low spam recall rate. Furthermore, in all results we noted the contrast between the high specificity rate and a not too high spam recall rate, which indicates that the classifiers have more successful to identify ham hosts than spam ones. Based on these first results, we considered the hypothesis that there are redundancies in the feature vectors that could be affecting the classifiers performance. In this way, we employed the principal component analysis (PCA) [25] to analyze and reduce the dimensionality of the feature space. Using the PCA, we observed that with only 23 dimensions is possible to represent about 99% of the information presented in the original 96 feature vectors. According to the analysis, the most relevant information are in the first 21 dimensions. Therefore, it is possible to conclude that just the attributes extracted from the host home page have enough information to characterize the host as spam or ham. After reducing the feature vectors from 96 to 23 dimensions, we performed a new simulation. The found results are presented in Table 2. Table 2: Results achieved by each evaluated artificial neural network for WEBSPAM-UK2006 dataset after a step of dimensionality reduction using principal component analysis.
Acc Rcl Spc Pcs FM
Gradient Mean 81.0±0.6 38.8±2.4 95.1±0.6 71.1±2.8 0.502±0.024
Levenberg Mean 85.5±1.5 55.5±3.4 94.8±1.2 77.0±4.0 0.644±0.027
RBF Mean 76.9±0.4 2.5±1.3 99.5±0.3 58.6±19.8 0.047±0.024
SOM + LVQ Mean 77.3±0.2 3.1±0.8 99.8±0.1 88.3±6.5 0.060±0.014
Based on the found results we can see that, although PCA indicates that the new feature vectors represent more than 99% of the information of the original data, reducing the dimensionality of the feature space clearly affected the classifiers performance for all evaluated techniques, mainly for Kohonen’s self-organizing map and radial basis function neural network which achieved spam recall rate lower than 5%. Therefore, it is conclusive that in this scenario reducing the dimensionality of the feature space is not indicated. Analyzing the results achieved in all experiments, we also suspected that the difference between the achieved high specificity rates and low spam recall rates for all evaluated techniques could be explained by the large difference between the number of ham over the amount of spam samples. This could be doing the classifiers more specialized to identify ham hosts. Thus, to test this hypothesis, we decided to present to the classifiers the same number of data in the two classes during the training process. In this way, 1,978 samples of each class were randomly selected to be trained.
With this new setup, we performed a new simulation using again all the feature vectors. Table 3 presents the classification results.
trees proposed by Castilho et al. [20] and SVM presented by Svore et al. [1].
Table 3: Results achieved by each evaluated artificial neural network for WEBSPAM-UK2006 dataset using classes of equal size in the training stage.
4. Conclusions and future work
Acc Rcl Spc Pcs FM
Gradient Mean 84.7±1.6 82.9±2.0 86.4±2.4 86.1±2.4 0.845±0.015
Levenberg Mean 87.6±1.5 86.5±2.0 88.8±1.9 89.1±2.4 0.877±0.017
RBF Mean 63.6±1.3 45.6±1.9 81.6±1.5 71.3±2.1 0.556±0.016
SOM + LVQ Mean 66.9±0.7 59.7±5.9 74.2±5.5 70.1±2.9 0.642±0.026
The last results indicate that training the artificial neural networks with the same number of samples in each class clearly improved the performance of all classifiers. Note that, if we compare the new results with the ones presented in Table 1, as it was expected, the cost for the lower specificity rate worth if we consider the great improvement in the spam recall rate. It can be also confirmed by the improvement in the F-measure for all evaluated techniques. It is important to note that the MLP with the Levemberg-Marquardt method achieved very good results and the best performance when compared with the other evaluated methods. To show that the evaluated artificial neural networks are really competitive, in Table 4 we present a comparison between the best results achieved by the evaluated artificial neural networks and the top performance techniques employed to classify web spam available in the literature. Table 4: Comparison between the results achieved by the evaluated artificial neural networks and the top performance classifiers available in the literature. Classifiers Pcs Rcl FM Top performance methods available in the literature Decision trees [20] - content 0.683 Decision trees [20] - content and links 0.763 Support Vector Machine [1] - content 60.0 36.4 Bagging [16] - content 84.4 91.2 Boosting [16] - content 86.2 91.1 Best results achieved by each artificial neural network MLP + Gradient - content 86.1 82.9 0.845 MLP + Levenberg - content 89.1 86.5 0.877 RBF - content 88.3 45.6 0.556 SOM + LVQ - content 64.2 59.7 0.642
The comparison indicates that the MLP with LevembergMarquardt method is very competitive and suitable to deal with the problem. Note that, although the Bagging algorithm presented by Ntoulas et al. [16] achieved higher spam recall rate, the spam precision rate is lower and, consequently, the specificity rate is also lower than those one achieved by MLP. Furthermore, the results indicate that the MLP neural networks outperformed established methods as the decision
In this paper, we have presented a performance evaluation of different models of artificial neural networks used to automatically classify real samples of web spam based on their contents. The results indicate that the multilayer perceptron neural network trained with the Levenberg-Marquardt method is the best evaluated model. Such a method also outperformed established techniques available in the literature such as decision trees [20] and support vector machine [1]. Furthermore, since the data is unbalanced, the results also indicate that all the evaluated techniques are superior when trained with the same amount of samples of each class. It is because the models tend to converge to the benefit of the class with the largest number of representatives, which increases the rate of false positives or false negatives. We have also observed that the Kohonen’s self-organizing map and radial basis function neural network were inferior than the multilayer perceptron neural networks in all simulations. So, we can conclude that such a models are not indicated to classify web spam. Finally, an analysis on the feature vectors originally proposed by Castilho et al. [20] indicate that they have many redundancies, since with the principal component analysis is possible to reduce them from 96 to 23 dimensions that represents about 99% of the information presented in the original feature set. However, using the reduced feature vectors clearly hurts the classifiers accuracy. For future work, we intend to propose new features to enhance the classifiers prediction, and hence increase its ability to discriminate the hosts as spam or ham. In this way, we aim to use the relation of links presented in the web pages.
References [1] K. M. Svore, Q. Wu, and C. J. Burges, “Improving web spam classification using rank-time features,” in Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’07), Banff, Alberta, Canada, 2007, pp. 9–16. [2] Z. Gyongyi and H. Garcia-Molina, “Spam: It’s not just for inboxes anymore,” Computer, vol. 38, no. 10, pp. 28–34, 2005. [3] G. Shen, B. Gao, T. Liu, G. Feng, S. Song, and H. Li, “Detecting link spam using temporal information,” in Proceedings of the 6th IEEE International Conference on Data Mining (ICDM’06), Hong Kong, China, 2006, pp. 1049–1053. [4] M. Egele, C. Kolbitsch, and C. Platzer, “Removing web spam links from search engine results,” Journal in Computer Virology, vol. 7, pp. 51–62, 2011. [5] N. Eiron, K. S. McCurley, and J. A. Tomlin, “Ranking the web frontier,” in Proceedings of the 13rd International Conference on World Wide Web (WWW’04), New York, NY, USA, 2004, pp. 309– 318.
[6] T. Almeida, A. Yamakami, and J. Almeida, “Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters,” in Proceedings of the 8th IEEE International Conference on Machine Learning and Applications, Miami, FL, USA, 2009, pp. 517– 522. [7] ——, “Filtering Spams using the Minimum Description Length Principle,” in Proceedings of the 25th ACM Symposium On Applied Computing, Sierre, Switzerland, 2010, pp. 1856–1860. [8] ——, “Probabilistic Anti-Spam Filtering with Dimensionality Reduction,” in Proceedings of the 25th ACM Symposium On Applied Computing, Sierre, Switzerland, 2010, pp. 1804–1808. [9] T. Almeida and A. Yamakami, “Content-Based Spam Filtering,” in Proceedings of the 23rd IEEE International Joint Conference on Neural Networks, Barcelona, Spain, 2010, pp. 1–7. [10] T. Almeida, J. Almeida, and A. Yamakami, “Spam Filtering: How the Dimensionality Reduction Affects the Accuracy of Naive Bayes Classifiers,” Journal of Internet Services and Applications, vol. 1, no. 3, pp. 183–200, 2011. [11] T. Almeida and A. Yamakami, “Redução de Dimensionalidade Aplicada na Classificação de Spams Usando Filtros Bayesianos,” Revista Brasileira de Computação Aplicada, vol. 3, no. 1, pp. 16–29, 2011. [12] T. Almeida, J. G. Hidalgo, and A. Yamakami, “Contributions to the Study of SMS Spam Filtering: New Collection and Results,” in Proceedings of the 2011 ACM Symposium on Document Engineering, Mountain View, CA, USA, 2011, pp. 259–262. [13] T. A. Almeida and A. Yamakami, “Facing the Spammers: A Very Effective Approach to Avoid Junk E-mails,” Expert Systems with Applications, pp. 1–5, 2012. [14] ——, “Advances in Spam Filtering Techniques,” in Computational Intelligence for Privacy and Security, ser. Studies in Computational Intelligence, D. Elizondo, A. Solanas, and A. Martinez-Balleste, Eds. Springer, 2012, vol. 394, pp. 199–214. [15] Q. Gan and T. Suel, “Improving web spam classifiers using link structure,” in Proceedings of the 3rd international Workshop on Adversarial Information Retrieval on the Web (AIRWeb’07), Banff, Alberta, Canada, 2007, pp. 17–20. [16] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly, “Detecting spam web pages through content analysis,” in Proceedings of the World Wide Web conference (WWW’06), Edinburgh, Scotland, 2006, pp. 83– 92. [17] T. Urvoy, E. Chauveau, and P. Filoche, “Tracking web spam with html style similarities,” ACM Transactions on the Web, vol. 2, no. 1, pp. 1–3, February 2008.
[18] I. Bíró, D. Siklósi, J. Szabó, and A. A. Benczúr, “Linked latent dirichlet allocation in web spam filtering,” in Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’09), Madrid, Spain, 2009, pp. 37–40. [19] J. Abernethy, O. Chapelle, and C. Castillo, “Graph regularization methods for web spam detection,” Machine Learning, vol. 81, no. 2, pp. 207–225, 2010. [20] C. Castillo, D. Donato, and A. Gionis, “Know your neighbors: Web spam detection using the web topology,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07), Amsterdam, The Netherlands, 2007, pp. 423–430. [21] M. Erdélyi, A. Garzó, and A. A. Benczúr, “Web spam classification: a few features worth more,” in Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality’11), Hyderabad, India, 2011, pp. 27–34. [22] G. Geng, C. Wang, Q. Li, L. Xu, and X. Jin, “Boosting the performance of web spam detection with ensemble under-sampling classification,” in Proceedings of the 14th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD’07), Haikou, China, 2007, pp. 583–587. [23] T. Largillier and S. Peyronnet, “Lightweight clustering methods for webspam demotion,” in Proceedings of the 9th IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT’10), Toronto, Canada, 2010, pp. 98–104. [24] Q. Ren, “Feature-fusion framework for spam filtering based on svm,” in Proceedings of the 7th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference (CEAS’10), Redmond, Washington, USA, 2010, pp. 1–6. [25] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. New York, NY, USA: Prentice Hall, 1998. [26] H. Liu, “On the levenberg-marquardt training method for feed-forward neural networks,” in Proceedings of the 6th International Conference on Natural Computation (ICNC’10), Yantai,China, 2010, pp. 456–460. [27] C. M. Bishop, Neural Networks for Pattern Recognition, 1st ed. Oxford: Oxford Press, 1995. [28] M. T. Hagan and M. B. Menhaj, “Training feedforward networks with the marquardt algorithm,” IEEE Transactions on Neural Networks, vol. 5, no. 6, pp. 989–993, 1994. [29] T. Kohonen, “The self-organizing map,” in Proceedings of the IEEE, vol. 9, no. 78, 1990, pp. 1464–1480. [30] M. J. L. Orr, “Introduction to radial basis function networks,” 1996.