Comput. Sci. Appl. Volume 1, Number 2, 2014, pp. 132-137 Received: July 8, 2014; Published: August 25, 2014
Computer Science and Applications www.ethanpublishing.com
An Expert Spam Detection System Based on Extreme Learning Machine Yılmaz Kaya1, Ömer Faruk Ertuğrul2 and Ramazan Tekin3 1. Department of Computer Engineering, Siirt University, Siirt 56100, Turkey 2. Department of Electrical and Electronics Engineering, Batman University, Batman 72060, Turkey 3. Department of Computer Engineering, Batman University, Batman 72060, Turkey Corresponding author: Yılmaz Kaya (
[email protected]) Abstract: Recent developments in electronic communication, which is one of the crucial communication tools owing to wide spreading internet technologies, heightened the requirements for a solution of the problem of circulation of unsolicited bulk messages on the internet, which is referred to as Spam. In this study, ELM (extreme learning machine), which is a training method for single hidden layer feed-forward artificial neural network, was employed as a filter to spam messages. The experimental results showed that, the proposed method achieved higher performance in terms of detection speed and classification accuracy, 91.655%, than other machine learning methods such as SVM (support vector machines), NB (naive Bayes), MLP (multi layer perceptron). Key words: Extreme learning machine, spam detection, machine learning, text categorization.
1. Introduction The advancing internet technologies increase the importance of e-mail communication. Nowadays, e-mail, which has begun to be used by millions and has still been spreading, has become the core of actions such as trade, spam and virus attacks. Thiago and Walmir reported that, spam, which make up a major part of e-mail traffic and are a component of our daily lives, have become an important trouble for both users and internet traffic [1]. It causes a waste of time for personal users in deciding whether incoming e-mails are spam or normal and also a serious bandwidth problem for communication companies. To solve this problem technical and legal measures are used, on the other hand, spams are used for money earning, weight losing, business incubation, finding friends. Informatics specialists employed various techniques in detecting spam or in filtering them, while some methods to check the address of the incoming e-mail from a blacklist, some others to make e-mail context
investigation against specific key words. Although, keyword based methods are reported to have satisfactory performance, Wu presented that, they are not deemed as sufficient [2]. As another solution, the machine learning methods that showed a satisfactory success for filtering spam, have been used widespread in recent years, such as decision tree [3], rough set [4], SVM (support vector machine) [5, 6], artificial immune system [7, 8], artificial neural network [2, 9, 10], and NB (naive Bayes) [11]. The main reason of the high accuracy of machine learning based methods depends on the fact that, these methods do not need any hypothesis, they learn from the training dataset and simply classifies the texts because they are realized according to the content of e-mails [12]. The aim of this study was to evaluate and validate the applicability of ELM (extreme learning machine) in spam detection, since the literature survey show the fact that the accuracy of spam detection is correlated with the employed machine learning method and it was
An Expert Spam Detection System Based on Extreme Learning Machine
reported by Huang et al. that ELM has high generalization capacity with a fast training stage [13]. In ELM, input layer weights and hidden-layer biases are determined randomly; the hidden-layer outputs are calculated analytically. The dataset taken from UCI machine-learning repository which composes of 4601 e-mails is used for validation [14, 15]. The results of employing ELM and other machine learning methods, i.e., the ANN (artificial neural network), NB, SVM, and decision trees methods were showed that ELM got better results in terms of speed and classification performance. The rest of the paper was organized as follows: The material and extreme learning method were explained in the next section; additionally the procedure of employing the proposed method was described briefly; results and discussions are provided in Section 3; while Section 4 concludes the paper.
133
Table 1 The distribution of the words and characters in the data set. 1-make
15-addresess
29-lab
43-original
2-address
16-free
30-labs
44-project
3-all
17-business
31-telnet
45-re
4-3d
18-email
32-857
46-edu
5-our
19-you
33-data
47-table
6-over
20-credit
34-415
48-conference
7-remove
21-your
35-85
49- ;
8-internet
22-font
36-tecnology
50- (
9-order
23-0
37-1999
51- [
10-mail
24-money
38-parts
52- !
11-receive
25-hp
39-pm
53- $ 54- #
12-will
26-hpl
40-direct
13-people
27-george
41-cs
14-report
28-650
42-meeting
2. Material and Method 2.1 Dataset The dataset is consisted 4,601 e-mails that 1,813 of them are spam, and the rest of them are not, obtained from the lab of Hewlett-Packard and taken from UCI machine-learning repository [14, 15]. Fifty-seven features were acquired from the e-mails. The first 48 features show the frequencies of the words gotten from the e-mails. 6 features from 49 to 54 show the frequencies of the characters such as, ‘(’, ‘[’, ‘!’, ‘\$’ and ‘\#’, which is listed in Table 1. The features from 55 to 57 show total and average letter number written in capital letters, and the letter number of the longest vocabulary. A spam example is presented in Fig. 1. 2.2 Extreme Learning Machine In this section, ELM, which is a machine learning method developed by Huang et al., will be described in detail [13]. ELM is a training method for single hidden layer feed-forward artificial neural network (SLFN), which the input weights randomly and output weights are analytically obtained. ELM use activation functions
Fig. 1
A spam example.
such as sigmoid, sine, Gaussian and hard-limiting in the hidden-layer, and linear functions in the output layer. Additionally, ELM can use non-differentiable or discontinuous activation functions in hidden layer [16]. All parameters of the conventional feed forward neural networks need to be tuned commonly by the gradient based learning algorithms. However, with the goal of better learning performance, the procedure of training takes so much time that the learning speed is extremely slow and also it can easily fall into local optima. In spite of the fact that adding a momentum
134
An Expert Spam Detection System Based on Extreme Learning Machine
parameter to the weight adjustment can lower the risk of the network being trapped in local optima, but the time spent on the training process is not decreased [13, 16]. The research of Huang et al. indicates that input-output weights and bias values in a single hidden layer feedforward neural network (SLFN) do have an impact on the performance of the network [13]. The ELM selects the input weights and hidden layer biases randomly, and the output weights determined analytically. Therefore, the ELM has much faster learning speed and well performance in some real world applications than conventional learning algorithms [13]. The SLFN structure is illustrated in Fig. 2. According to Fig. 2, on determining that X = (X1, X2, X3, …, XN) is input and Y = (Y1, Y2, Y3, …, YP) is output, the mathematical model with M hidden neurons can be defined as [17]: ( ∑ + ) = , = 1, 2, … , (1) where Wi = (Wi1, Wi2, …, Win), and βi = (βi1, βi2, …, βim), are the input and output weights; bi is the biases of the hidden neuron and Ok is the output of the network. g(.) denotes the activation function [18]. In a network of the N training samples, the aim is with zero error: ∑ ( − ) = 0 or with minimum error: ‖∑ ( − ) ‖. Therefore, Eq. (1) can be shown as below [13]. ( ∑ + ) = , = 1,2, … , (2) Because in the equation above g(WiXk + bi) denotes output matrix in the hidden layer, Eq. (2) can be placed
as [13] =
(3)
where ( (
(
,…… + ) ⋮
=
+
; ⋯ ⋱ ) ⋯
. .
,….. ( ( =
; , … … + ) ⋮ + ) . .
)= (4)
(5)
H denotes the hidden layer output matrix [17, 19]. In ELM, the training the SLFN is same as seeking the least-squares solution of the linear system = like linear regression [16]: ( ,…… = ; , … . . ) − min ‖ ( , … … (6) ; , … . . ) − ‖ β where = is the smallest norm least-squares of denotes the Hβ = Y. In addition here, Moore-Penrose generalized inverse of hidden layer output matrix H. The norm of is the smallest solution among all the least-squares solutions of Hβ = Y equation [13]. So the ELM can achieve the smallest training error. And also can get better classification performance than the conventional gradient-based learning algorithms [16]. ELM algorithm can be summarized in 3 steps as below [13, 16]. (1) Phase: Wi = (Wi1, Wi2, …, Win) input weights and hidden layer bi bias values are produced randomly; (2) Phase: H hidden layer output is calculated; (3) Phase: output weights are calculated in . Y is the target feature. accordance with = 2.3 Spam Detection Method The procedure of the proposed spam detection method is shown in Fig. 3.
Fig. 2 The structure of a single hidden layer feed-forward artificial neural network.
Fig. 3
Diagram belong to classification process.
An Expert Spam Detection System Based on Extreme Learning Machine
The proposed method sorted in 4 blocks, as seen in Fig. 3. Dataset was collected in block 1. The features were extracted from dataset in the block 2. The features were classified by ELM in the block 3. The last block includes the results of the classification.
also the durations of classifications are demonstrated in Figs. 6 and 7. Classification accuracies obtained with respect to neuron number in hidden layer of 50-50% training test dataset ratios are presented in Figs. 8 and 9. 1
3. Results and Discussion Training Efficiency
0.95
92.32
90.10
91.210
40-60%
92.01
90.50
91.255
50-50%
93.05
90.26
91.655
60-40%
92.54
91.20
91.870
70-30%
92.11
91.23
91.670
80-20%
91.99
91.41
91.700
Table 3
Training time
Test time
30-70%
0.0624
0.0156
40-60%
0.0780
0.0312
50-50%
0.0936
0.0156
60-40%
0.1092
0.0156
70-30%
0.1092
0.0156
80-20%
0.156
0.0001
20
30
40
50 60 Hidden Neuron
70
80
90
100
Training set accuracy rates.
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55
0
10
Fig. 5
20
30
40
50 60 Hidden Neuron
70
80
90
100
Test set accuracy rates.
0.1 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
0
10
20
30
40
50 60 Hidden Neuron
70
80
90
100
80
90
100
Training duration of training test.
Fig. 6 0.05 0.045
Duration for different train/test data ratios (s).
Train/test
10
0.95
Training Time (second)
30-70%
0.7
Fig. 4
0.04 Testing Time (second)
datasets
0.8 0.75
0
Table 2 Classification accuracies for different train/test data ratios. Training dataset Test dataset Mean accuracy
0.9 0.85
0.65
Testting Efficiency
Classification accuracies and process durations obtained through ELM, which consist 100 neurons with sigmoid activation function in the hidden layer, for data sets of various trainings and test rates are shown in Tables 2 and 3, respectively. It is apparent in Table 2 that, high classification rates were obtained with ELM for all training-test partitions. In each of six different sets, the highest classification accuracy is at 60-40% partition belongs to training-test with 91.87%. The lowest classification accuracy for 30-70% partition is 91.21%. Additionally, it can be seen in Table 3 that classification process was completed in a very short time. Training and test accuracy rates in terms of neuron number in hidden layer for 50-50 % are presented in Figs. 4 and 5 and
Train/test
135
0.035 0.03 0.025 0.02 0.015 0.01 0.005 0
Fig. 7
0
10
20
30
40 50 60 Hidden Neuron
70
Classification durations of test data set.
An Expert Spam Detection System Based on Extreme Learning Machine
136
Table 4 Classification accuracy rates in accordance with various activation functions. Training
Test
Training
Test
accuracy
accuracy
duration
duration
(%)
(%)
(s)
(s)
Sigmoid
91.69
90.38
0.0905
0.025
Sine
69.78
64.76
0.6978
0.0140
Hardlim
88.63
87.69
0.1076
0.0296
68.61
65.26
0.0998
0.0125
71.68
68.8
0.1186
0.0265
Activation function
Triangular Fig. 8 Training accuracy variation with respect to hidden neurons.
Basis Radial Basis
Table 5 Classification accuracy rates acquired by various machine learning methods.
0.95 0.9
Training Efficiency
0.85 0.8
Model
Accuracy (%)
Classification time (s)
MLP
89.96
190.14
SVM
80.65
6.07
NB
78.26
0.28
J48
90.82
2.46
PART
91.46
4.95
ELM
91.66
0.05
0.75 0.7 0.65 0.6 0.55 0 5 10 Run 15 20
0
2
4
6
8
10
12
14
16
18
20
Hidden Neuron x5
Fig. 9 Testing accuracy variation with respect to hidden neurons.
As can be seen in Figs. 8 and 9, the accuracy rates increase with respect to neuron number in the hidden layer. The highest accuracy rate is obtained when the neuron number is between 90 and 100. The accuracy results and durations of process obtained through various activation functions in the hidden layer for 50-50 % training-test data set are sorted in Table 4. The best classification results were obtained by sigmoid activation functions (1 ⁄ (1 + ^(− ) )) that can be seen in Table 4. Additionally, both classification accuracy and speed of ELM were compared with other methods: MLP (multilayer perceptron) [20-22], SVM [6, 20], NB [22, 23], decision rules (J48) and decision trees (PART) [6, 20, 21] for 50-50% training-test dataset. Obtained classification accuracies by various machine learning methods are presented in Table 5. As seen in Table 5, the highest performance in terms of both classification accuracy and classification time was obtained through ELM. These results are well
suited with literature findings as Huang et al. demonstrated that ELM has high generalization capacity with extremely fast training stage [13]. According to the previous studies, with 80-20% training-test data rate, classification accuracy is 90.88% by employing artificial neuron network [22], 75.22% with Bayesian method [22], and 91.23% with rough set method [4]. It has been determined that ELM is the best method for spam detection compared to other machine learning methods.
4. Conclusions One of the most used services of the internet is electronic communication. Yet, thanks to improving and spreading of it, electronic communication has brought some problems with it. One of the most important problems of electronic communication is the circulation of spam, which is unsolicited bulk message. In this research, spam were automatically detected by using ELM method. Results obtained through ELM
An Expert Spam Detection System Based on Extreme Learning Machine
had better results compared with other studies done earlier in the same dataset. Spam filtering is realized largely through contextual investigation. So, a method, which will do the filtering, fast and accurately, is needed. As a result, ELM shows better performance in terms of both classification accuracy and speed compared to other methods and literature results.
References [1] T.S. Guzella, W.M. Caminhas, A review of machine learning approaches to spam filtering, Expert Systems with Applications 36 (2009) 10206-10222. [2] C.-H. Wu, Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks, Expert Systems with Applications 36 (2009) 4321-4330. [3] E. Crawford, J.Kay, E. McCreath, Automatic induction of rules for e-mail classification, in: Proceedings of the 6th Australasian Document Computing Symposium, Coffs Harbour, Australia, 2001, pp. 13-20. [4] Y. Kaya, A. Yeşilova, R. Tekin, Filtering unwanted electronic messages (spam) by using rough set approach, in: Electronic-Electric and Computer Conference, Elazig, Turkey, October 5-7, 2011. [5] H.B. Wang, Y. Yu, Z. Liu, SVM classifier incorporating feature selection using GA for spam detection, in: Proceedings of the 2005 International Conference on Embedded and Ubiquitous Computing, 2005, pp. 1147-1154. [6] H. Drucker, S. Wu, V.N. Vapnik, Support vector machines for spam categorization, IEEE Transactions on Neural Networks 10 (5) (1999) 1048-1054. [7] F. Wang, Z. You, L. Man, Immune-based peer-to-peer model for antispam, in: Proceeding of the International Conference on Intelligent Computing, 2006, pp. 660-671. [8] X. Yue, A. Abraham, Z.X. Chi, Y.Y. Hao, H. Mo, Artificial immune system inspired behavior-based anti-spam filter, Soft Computing 11 (2007) 729-740. [9] C.-H. Wu, C.-H. Tsai, Robust classification for spam filtering by back-propagation neural networks using behavior-based features, Applied Intelligence 31 (2008) 107-121. [10] E.-S.M. El-Alfy, R.E. Abdel-Aal, Using GMDH-based networks for improved spam detection and email feature analysis, Applied Soft Computing 11 (2011) 477-488.
137
[11] M.N. Marsono, M.W. El-Kharashi, F. Gebali, Targeting spam control on middleboxes: Spam detection based on layer-3 e-mail content classification, Computer Networks 53 (2009) 835-848. [12] C.-C. Lai, An empirical study of three machine learning methods for spam filtering, Knowledge-Based Systems 20 (2007) 249-254. [13] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: Theory and applications, Neurocomputing 70 (1-3) (2006) 489-501. [14] A. Frank, A. Asuncion, UCI Machine Learning Repository, School of Information and Computer Science, University of California, Irvine, CA, http://archive.ics.uci.edu/ml. [15] M. Hopkins, E. Reeber, G. Forman, J. Suermondt, Spam email database from UCI machine learning repository, 2005, http://www.ics.uci.edu/~mlearn/MLRepository.html (accessed Dec.22, 2011). [16] Q. Yuan, W. Zhou, S. Li, D. Cai, Epileptic EEG classification based on Extreme learning machine and nonlinear features, Epilepsy Research 96 (2011) 29-38. [17] S. Suresh, S. Saraswathi, N. Sundararajan, Performance enhancement of extreme learning machine for multi-category sparse data classification problems, Engineering Applications of Artificial Intelligence 23 (2010) 1149-1157. [18] H.-J. Rong, Y.-S. Ong, A.-H. Tan, Z. Zhu, A fast pruned-extreme learning machine for classification problem, Neurocomputing 72 (2008) 359-366. [19] X.-G. Zhao, G. Wang, X. Bi, P. Gong, Y. Zhao, XML document classification based on ELM, Neurocomputing 74 (2011) 2444-2451. [20] K.-C. Ying, S.-W. Lin, Z.-J. Lee, Y.-T. Lin, An ensemble approach applied to classify spam e-mails, Expert Systems with Applications 37 (2010) 2197-2201. [21] M.-C. Su, H.-H. Lo, F.-H. Hsu, A neural tree and its application to spam e-mail detection, Expert Systems with Applications 37 (2010) 7976-7985. [22] Y. Yang, S. Elfayoumy, Anti-spam filtering using neural networks and Bayesian classifiers, in: Proceedings of the 2007 IEEE International Symposium on Computational Intelligence in Robotics and Automation, Jacksonville, FL, USA, Jun. 2007, pp. 20-23. [23] P. Graham, Better Bayesian filtering, in: Proceedings of the 2003 Spam Conference, Jan. 2003, http://www.paulgraham.com/better.html (accessed Dec. 28, 2011).