Additive Neural Networks for Predictive Data Mining,. PhD thesis, North-West ... viding Base SAS R and SAS R Enterprise MinerTM software used in computing ...
Spam Detection with Generalised Additive Neural Networks Johannes C. Goosen*, Jan V. Du Toit School of Computer, Statistical and Mathematical Sciences North-West University, Potchefstroom Campus, Private Bag X6001, Potchefstroom 2520 Tel: +27 18 2992548, Fax: +27 18 2992570 email: {20040946, Tiny.DuToit}@nwu.ac.za *Primary recipient of correspondence
A BSTRACT With the rapid growth of internet users and e-commerce in the past few years, spam has become an even bigger problem. Consequently, for Internet Service Providers, it has become more important to identify and filter spam efficiently in order to keep their client base satisfied. Generalized Additive Neural Networks (GANNs) have proven to be a good method for pattern recognition and can be used for detecting spam messages. In this paper, GANNs are discussed and applied to a spam dataset to determine their predictive accuracy. Compared to other techniques found in the literature, GANNs perform favourably and may be used to increase the accuracy by which spam are detected.
is expressed as the sum of individual unspecified univariate functions. Each univariate function can be interpreted as the effect of the corresponding input while keeping the other inputs constant. GAMs can be interpreted graphically with partial residual plots (Larsen and McCleary, 1972); (Ezekiel, 1924); (Berk and Booth, 1995) and allow more flexibility than linear models. In the basic structure of a GANN, each input has a separate Multilayer Perceptron (MLP) architecture with one hidden layer. The number of units in this hidden layer may differ for each input. Each input may also have a direct connection, called a skip layer. A GANN with a single hidden layer of h units for each input is defined as f j (x j ) = w0 j x j + w1 j tanh(w01 j + w11 j x j ) + ... + wh j tanh(w0h j + w1h j x j ), where the skip layer is represented by w0 j x j . The tanh activation function is used. An example of a GANN with three inputs is shown in Figure 1.
Keywords: Generalized Additive Model, GAM, Generalized Additive Neural Network, GANN, Neural Network, Spam, Spam detection.
b
Input Layer
Hidden Layer
Skip Layer
1
I NTRODUCTION
Spam is defined as a message that is unsolicited, has commercial content, is in the form of an email, sms or mms, and is transmitted in bulk (Leng, 2006). Studies have shown that a third of all email sent are spam. In 2004 alone, an estimated two trillion spam messages were sent (Grimes, Hough and Signorella, 2007). The huge quantity of spam messages brings forth a couple of problems that will increase as the number of spam e-mails grows. Among the problems are large amounts of money spent on lost productivity and the costs of downloading the messages. Spam is widely regarded as unwanted and it is clearly a problem that needs more attention. Internet Service Providers (ISPs) can benefit if they are able to distinguish the desired e-mails from the spam in a more accurate manner. Although there are many types of spam filtering systems, spammers still find ways to bypass them. In this paper a Generalized Additive Neural Network (GANN) is used to find patterns in spam messages that will help to detect them. In the next section GANNs are described in more detail and the AutoGANN system which automates the construction of GANNs are discussed. Section 3 will focus on preliminary experimentation and results using the AutoGANN system and the Spambase data set. Finally, section 4 will discuss future work. 2
G ENERALIZED A DDITIVE N EURAL N ETWORKS
A GANN (Potts, 1999) is the neural network implementation of a Generalized Additive Model (GAM) (Hastie and Tibshirani, 1986). A GAM is defined as g−1 0 (E(y)) = B0 + f 1 (x1) + f2 (x2) + ... + fk (xk ) where the expected target (on the link scale)
x1
b
x2
b
x3
b
b b b b b
b Output Layer
b
b
y
b
Figure 1: Example GANN Model
When GANNs are constructed interactively, human judgement is required to interpret the partial residual plots. This can be time consuming for a large number of input variables. Also, human judgement may be subjective which leads to suboptimal models being created. Du Toit (2006) created a system called AutoGANN which successfully automated the construction of GANNs. Objective model selection criteria or cross validation are utilized to search for the best GANN model in a GANN search space. The partial residual plots are used to provide insight into the models built and not as a primary method for model selection. With the AutoGANN system, no human interaction is needed while building the GANN models. In the next section, preliminary experimentation using the AutoGANN system is discussed.
3
P RELIMINARY E XPERIMENTATION AND R ESULTS
A preliminary investigation was performed to determine the accuracy of a GANN. For this purpose the Spambase (Hopkins, Reeber, Forman and Suermondt, 1999) data set was used. The data set consists of 4601 instances where 39.4% is spam. There are 57 attributes and a target that classifies the instances as spam or non-spam. Kiran and Atmosukarto (n.d.) tested 8 different algorithms on the data set for spam detection and their results are shown in Table 1. Algorithm Ensemble Decision Tree Adaboost Stacking SVM Bagging Decision Tree Neural Network (MLP) Naive Bayes
Accuracy (%) 96.40 95.00 93.80 93.40 92.80 92.58 90.80 89.57
Table 1: Results from 8 algorithms on the spambase dataset
The Ensemble Decision Tree obtained the best results with an accuracy of 96.4% and the Multilayer Perceptron achieved an accuracy of 90.8%. The Spambase data set was partitioned into a training (67%) and test (33%) data set. AutoGANN performed cross validation and variable selection simultaneously and the best model found had an accuracy of 95.8%. The system chose 41 variables where 26 had a linear relationship with the target and 15 had a no-linear relationship with the target. In the next, final section, the conclusions and future work will be discussed. 4
C ONCLUSIONS
AND
F UTURE W ORK
The preceding results suggest that a GANN may be used for detecting spam accurately. Kiran and Atmosukarto (n.d.) proposed that the Spambase data would not be sufficient for detecting spam and that a better dataset should be developed. More experiments need to be done on the Spambase data set and other spam data sets to determine the robustness and the accuracy of GANN models in dealing with real world data. R EFERENCES Berk, K. N. and Booth, D. E. (1995), ‘Seeing a curve in multiple regression’, Technometrics 37(4), 385–398. Du Toit, J. (2006), Automated Construction of Generalized Additive Neural Networks for Predictive Data Mining, PhD thesis, North-West University, Potchefstroom Campus, South Africa. Ezekiel, M. (1924), ‘A method for handling curvilinear correlation for any number of variables’, Journal of the American Statistical Association 19(148), 431–453. Grimes, G. A., Hough, M. and Signorella, M. (2007), ‘Email end users and spam: relations of gender and age group to attitudes and actions’, Computers in Human Behavior 23(1), 318–332. Hastie, T. and Tibshirani, R. (1986), ‘Generalized additive models’, Statistical Science 1(3), 297–318.
Hopkins, M., Reeber, E., Forman, G. and Suermondt, J. (1999), ‘Spambase’, http://archive.ics.uci.edu/ml/datasets/Spambase, Date of access: 1 June 2009. Kiran, R. and Atmosukarto, I. (n.d.), ‘Spam or not spam that is the question’, http://www.cs.washington.edu/homes/indria/research/ spamfilter ravi indri.pdf, , Date of access: 1 June 2009. Larsen, W. A. and McCleary, S. J. (1972), ‘The use of partial residual plots in regression analysis’, Technometrics 14(3), 781–790. Leng, T. (2006), ‘Singapore’s multi-pronged strategy against spam’, Computer Law & Security Report 22(5), 402–408. Potts, W. J. E. (1999), Generalized additive neural networks, in ‘Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining’, pp. 194–200. B IOGRAPHY Mr. J.C. Goosen recently obtained his Honours degree in Computer Science and is currently busy with his M.Sc degree in the field of Computer Science at the Potchefstroom Campus of the North-West University. He is a Telkom bursary holder and is part of the Telkom Centre of Excellence program. ACKNOWLEDGEMENTS Gratitude is expressed toward SAS Institute Inc. for proR and SAS R Enterprise MinerTM software viding Base SAS used in computing the results presented in this article. This work forms part of the research done at the North-West University within the TELKOM CoE research programme, funded by TELKOM, GRINTEK TELECOM and THRIP.