SPAM E-MAIL DETECTION BASED ON MACHINE LEARNING Adem Tekerek*, Omer Faruk Bay+ Gazi University, Department of Information Technology Ankara, Turkey
[email protected] + Gazi University, Technology Faculty, Department of Electric Electronic Engineering Department Ankara, Turkey
[email protected] #
Abstract— E-mail is one of the most important communication tools in last decades. E-mail usage rates are growing every year. In 2017, number of worldwide e-mail users was nearly 2.8 billion and total number of e-mails sent and received per day was over 225 billion. By the end of 2019, the number of worldwide e-mail users will increase to over 2.9 billion and the total number of e-mails sent and received per day will be over 246 billion. However, all sent and received e-mails are not real e-mails. Unreal e-mails are named as spam e-mails. Spam emails accounted for 56.8% of all total e-mail traffic worldwide as of March 2017. Spam e-mails are threating many sectors, so they need to be detected and removed from blended e-mails. Because spam e-mails are the source of financial loss and annoyance for the recipients, in this study we present a spam e-mail detection technique to classify spam e-mails by using Bayesian Classification (BC), Random Tree (RT), and Support Vector Machine (SVM). According to the classification results, RT presented the best performance. Keywords— Spam E-mail, Bayesian Classification, Random Tree, Support Vector Machine 1.
INTRODUCTION
Development of information technology has made communication systems to evolve to the state as we envision today. The most important development mechanism of information technology is internet applications. The foremost internet application that facilitates communication is e-mail systems. A postal mail sent by regular postal services takes days or weeks to reach its destination, while an electronic e-mail takes less than a second. In addition, the delivery cost of an e-mail is very low compared to postal mail. The increased use of the Internet, rapid transmission, and cost advantages have increased usage of e-mail at unprecedented levels. By 2017, the number of e-mail users around the world was about 2.8 billion, and the number of e-mails sent and received was over 225 billion per day. The fact that e-mail is one of the most important communication tools today has led marketers to use e-mail as a commercial tool which they used to promote their products massively. E-mails sent for product demonstrations are sent without the permission of the receivers. Spam e-mails sent without the receiver’s permission are used to create public opinion, propaganda, betting or pornographic content. Hence, spam e-mails annoy buyers, waste time, and cause resources to be wasted. In figure 1, the form of e-mail communication scheme between user (client) and internet is presented.
User
İnternet User Firewall
Mail Exchange Email Server
User
Figure 1. E-mail communication scheme between user (client) and internet Enhanced use of e-mail systems has increased malicious usage of e-mail. Malicious e-mails are described as spam e-mails. Spam e-mail recipients may suffer material loss and moral damage because spam e-mails can include
attacks and deception methods like phishing. Because of the rapid development of internet technology, people are now shifted from using the physical form of communications to virtual communications such as e-mails which are faster and energy saver. E-mail is the most popular method to send messages and share files with other people instantly [1]. According to the Symantec report in 2013, about 87 spam messages contain at least one Uniform Resource Locator (URL) hyperlink out of 100 spam e-mails [2]. Apparently, some of these URLs are linked to malicious websites which lead to malware infections; therefore, accurate detection of malicious spam e-mails is urgently needed. There are some detection criteria, which allow an e-mail to be understood as spam e-mail or normal e-mail. In literature, to analyze the malicious use of email, various approaches have been proposed [3, 4]. Generally machine learning based methods are used for the detection of spam e-mails. Sahami et al. [5] proposed a spam email classification method using Bayesian method. They evaluated the content of the e-mail with domain properties and showed that the accuracy could increase. R. Malathi et al., [6] presented a new spam detection method by employing text categorization, using supervised learning with Bayesian neural network which uses rule based heuristic approach and statistical analysis tests to identify spam e-mail. Zhan Chuan et al., [7] proposed an anti-spam e-mail using a new improved Bayesian classification based e-mail filter. They used vector weights for representing word frequency and adopted attribute selection based on word entropy and deduced its corresponding formula. It is proved that their filter improves total performances apparently. Georgios Paliouras et al., [8] presented learning to filter spam e-mail. They investigated the performance of two machine learning algorithms in context of anti-spam filtering by comparison of a Bayesian Classification and a Memory-Based Approach. They determined the performance on publicly available corpus for bayes. Also, they compared the performance of the Bayesian filter to an alternative memory based learning approach so that in both methods accuracy improved for spam filtering and keyword based filter were used widely for e-mail. Muhammad N. Marsono, et al., [9] demonstrated that the Bayes e-mail content classification could be adapted for layer three processing, without the need for reassembly. Suggestions on predetecting e-mail packets on spam control middle boxes to support timely spam detection at receiving e-mail servers were presented. M. N. Marsono, et al., [10] presented hardware architecture of Bayes inference engine for spam control using two class e-mail classification. It can classify more than 117 million of features per second given a stream of probabilities as inputs. This work can be extended to investigate proactive spam handling schemes on receiving e-mail servers and spam throttling on network gateways. This manuscript consists of three sections. In section 2, classifying algorithms are explained. In section 3, proposed spam detection model is presented. In section 4, conclusion is given. 2.
CLASSIFYING ALGORITHMS
In this study. BC, RT, and SVM classification algorithms are used. These algorithms are described below. 2.1.
Bayesian Classification (BC)
Bayesian Classification algorithm is one of the machine learning methods that is used in text classification. It is a statistical inference based on probability, and is used to determine previously created classes. BC is particularly suited when the dimensionality of the inputs is high. Bayesian Classification is divided into two according to form of weighting of the terms in a text: Multinomial Model and Multivariate Bernoulli Model. Multivariate Bernoulli Model is based on binary term weighting. In this model, a term is researched whether it exists in the text or not. If the term is in the text, it is weighted as 1, if not present, it is weighted as 0. Multinomial Model is based on weighting according to term frequency. In Multinomial Model, prevalence frequency of a term in a text specifies the weight value of that term. The weight values of the terms that are more common in the text are also higher [11]. It uses a discriminant function to compute the conditional probabilities of P(Ci|X) [12]. Here, given the inputs, P(Ci | X) denotes the probability that, example X belongs to class Ci 𝑃(𝐶𝑖 |𝑋) =
𝑃(𝐶𝑖 ) ∗ 𝑃(𝑋|𝐶𝑖 ) 𝑃(𝑋)
P(Ci) is the probability of observing class i. P(X | Ci) denotes the probability of observing the example, given class Ci. P(X) is the probability of the input, which is independent of the classes.
2.2.
Random Tree (RT)
RT is a tree or arborescence that is formed by a stochastic process. RT is generation of a variety of trees at "random," and for small numbers of leaves it can generate all possible trees [13]. Random trees have several usage areas;
used for phylogeny programs that do not have the ability to examine all trees or clusters of random trees used for estimate distributions of tree comparison measures, used for the production of all possible tree shapes, used as a basis for statistical tests.
In figure 2, RT working scheme is presented. 1.1
1
%82
% 100
%18
%0
% 71 1.2
RANDOM TREE %0
2
% 29
Figure 2. RT working scheme. 2.3.
Support Vector Machine (SVM)
SVM is a linear separation limit (wTx + b = 0) that classify the samples correctly. SVM, a supervised learning technique, is a combination of a linear machine learning technique. SVM is a two-dimensional variable class forms a hyperplane that divides the margin between hyperplane and the nearest data points by maximizing the weight vector w to the feature vector [14]. In figure 3, SVM decision boundary scheme is presented.
Data Of One Class Margin
Support Vector
wᵀ x + b = 1 wᵀ x + b = 0 (Decision Boundary) wᵀ x + b = -1
Data Of Another Class
Figure 3. SVM decision boundary scheme SVM is based on the concept of decision plans that define decision boundaries. A decision plane is a decision between a set of objects with different class memberships and the SVM modeling algorithm finds the best hyperplane with the maximum margin to separate from the two classes, which requires the following optimization problem to be solved. Maximum; 𝑛
𝑛
𝑖=1
𝑖,𝑗=1
1 ∑ 𝛼𝑖 − ∑ 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐾(𝑋𝑖 , 𝑋𝑗 ) 2
n
∑ α i yi = 0 i=1
where 0 ≤ αi ≤ b, i = 1,2,..,n Where αi is the weight of training sample x1. If αi > 0, x1 is called a support vector b is a regulation parameter used to trade-off the training accuracy and the model complexity so that a superior generalization capability can be achieved. K is a kernel function, which is used to measure the similarity between two samples. A popular radial basis function (RBF) kernel functions. K(xi,xj) = exp(-γ||xi – xj||2),γ>0 After the weights are determined [15], a test sample x is classified by
3.
PROPOSED SPAM E-MAIL DETECTION MODEL
Proposed model consists of two steps; one is preprocessing and other is classification. The accuracy of the classification process depends on the quality of the mining done and the accuracy of the dataset used. Preprocessing is an important step that should be done before data mining process. Preprocessing step includes data cleaning, integration, and transformation. For preprocessing of data following techniques was used.
Transformation of all characters of text to uppercase or lowercase, Division of the text into tokens by using operator. Division can be done by using nonletter characters, Removing tokens that are not necessary for classification.
In figure 4 execution steps of proposed e-mail classification is given.
Input E-Mail
Preprocess E-Mail
Preprocessing
Train BC, RT, SVM Using Train Dataset
Test Accuracy of BC, RT, SVM model Using Test Dataset
Dataset Training and Testing
Classify E-mail as Spam or Ham Classification
Figure 4. Execution steps of proposed e-mail classify BC uses words in an e-mail for spam and ham mails calculating probability to determine whether an e-mail is spam or not. RT is a tree-based classification and prediction method that is built on classification and regression tree methodology. SVM is primarily a classical method of performing classification tasks by establishing hyper details in a multidimensional field that separates states of different class labels. Algorithms are tested separately on dataset. Same training process is applied for good comparison. In this study, a dataset consisting of 4601 e-mails, which is located in the database of UCI machine learning and produced by Hewlett-Packard laboratory [16], is used. The dataset has 57 features. The first 48 features show the frequencies of the token obtained from the e-mails. The six features between 49-54 indicate the frequencies of the characters in the e-mail. The features between 55-57 indicate the "total number of letters", "average number of letters", and "number of letters of the longest word" of the words written in capital letters. The 58 th feature indicates whether the e-mail is spam e-mail. The dataset content information is presented in Table 1. Table 1. Dataset Information Dataset Values Ham E-mail 2788 Spam E-mail 1813
Total E-mail
4601
This study is based on rule-based detection. The detection rules are framed by analyzing e-mail header information, word (token) matching, and the body of the e-mail message. Specification of the rules required for the detection of spam e-mail has been established. Each rule performs a test on the e-mail dataset, and each rule has a score for decision about whether an e-mail is spam or not. The score value obtained for each e-mail result is used as the input values of the BC, RT, and SWM algorithms. If the results exceed the threshold, then the e-mail is marked as spam and the others are classified as ham e-mail. The ways of how to classify an e-mail is a spam or not are given below. The results obtained in the study are compared with the studies [17,18,19,20] as seen in table 2. According to the results of the comparison, proposed model has the best performance with RT. RT has the best Correctly Classified Instances (CCI) result (%99.9348) and the lowest False Positive (FT) (%0,002) rate. Table 2. Algorithms CCI Comparison Results and False Positive Rates (CCI = Correctly Classified Instances, BC = Bayesian Classification, SVM = Support Vector Machine, RT = Random Tree, EDT = Ensemble Decision Tree, NB = Naive Bayes, FP = False Positive) Order Study Algorithm CCI (%) F-Measure FP (%) 88.56 1 Sharma, S., and Arora, A. [17] BC 0,192 91.54 0,202 RT 93.21 SVM 2 Kiamarzpour, F. et al. [18] BC 0,963 0,231 SVM 0,846 0,024 Voting 0,933 0,07 3 SS, R. K., and Atmosukarto, I., BC 89.57 [19] SVM 93.40 EDT 96.40 3 W.A. Awad, and S.M. ELseuofi NB 99.46 [20] SVM 96.90 6 Proposed Study BC 90.4369 0,923 0,158 7 RT 99.9348 0,999 0,002 8 SVM 90.7629 0,926 0.162 (-) Refers to fields that are not given in the referenced papers.
RT classification results plot graph is presented in figure 3. In figure 3 ham emails represents zero (0), spam emails represents one (1).
Figure 3. RT classification results plot graph. 4.
CONCLUSION
In this study, it was aimed to detect spam e-mails using BC, RT, and SVM which are some of the machine learning methods. Although there are many e-mail spam filtering studies, due to the existence of spammers and adoption
of new techniques, email spam filtering becomes a challenging problem to the researchers. In this study, we used a dataset which was produced in Hewlett-Packard laboratory. The performance of proposed model was evaluated using training set and observed that RT classifier outperforms other classifiers and the false positive rate is also very low compared to other algorithms. Email spam filters using this approach can be adopted either at e-mail server or at e-mail client side to reduce the amount of spam messages and to reduce the risk of productivity loss, bandwidth, and storage usage. 5.
ACKNOWLEDGEMENT
We would like to acknowledge the developers of WEKA (https://www.cs.waikato.ac.nz/ml/weka/) for providing testbed for machine learning algorithms. REFERENCES [1]. Ozawa, Seiichi, et al. "An autonomous online malicious spam e-mail detection system using extended RBF network." Neural Networks (IJCNN), 2015 International Joint Conference on. IEEE, 2015. [2]. Symantec Corporation (2014) Internet Security Threat Report 2014, 19: 1-98,2014. [3]. J. Ma, L. K. Saul, S. Savage and G. M. Voelker, "Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs," Proc. of the 15th ACM SIGKDD Int. ConI on Knowledge Discovery and Data Mining, pp. 1245-1254,2009. [4]. M. Co va, C. Kruegel and G. Vigna, "Detection and Analysis of Driveby-Download Attacks and Malicious JavaScript Code," Proc. of the 19th Int. ConI on World Wide Web (WWW '10), pp. 281-290, 2010. [5]. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, "A bayesian approach to filtering junk e-mail ... AAAiTech.Rep.WS-98-05. pp.55-62,1998. [6]. R. Malathi, “Email Spam Filter using Supervised Learning with Bayesian Neural Network”, Computer Science, H.H. The Rajah’s College, Pudukkottai-622 001,Tamil Nadu, India, Int J Engg Techsci Vol 2(1),89-100, 2011. [7]. Zhan Chuan, LU Xian-Iiang, ZHOU Xu, HOU Meng-shu, "An Improved Bayesian with Application to AntiSpam E-mail ", Journal of Electronic Science and Technology of China, Mar. 2005, Vol.3 No.1 [8]. I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C.D. Spyropoulos, and P. Stamatopoulos, "Learning to filter spam e-mail: A comparison of a naive bayesian and a memorybased approach", Proceedings of the Workshop on Machine Learning and Textual information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), pages 1-13,2000. [9]. Muhammad N. Marsono, M. Watheq El-Kharashi, Fayez Gebali “Targeting spam control on middleboxes: Spam detection based on layer-3 e-mail content classification” Elsevier Computer Networks, 2009. [10]. M. N. Marsono, M. W. El-Kharashi, and F. Gebali, “Binary LNS-based naïve Bayes inference engine for spam control: Noise analysis and FPGA synthesis”, IET Computers & Digital Techniques, 2008. [11]. A. McCallum, K. Nigam, "A comparison of event models for Naïve bayes text classification". Proceedings in Workshop on Learning for Text Categorization (AAAI’98), 41-48, 1998. [12]. J. Gama, A Linear-Bayes Classifier, IBERAMIA-SBIA 2000, LNAI 1952, pp. 269-279, 2000. [13]. Lladó, A., Decomposing almost complete graphs by random trees. Journal of Combinatorial Theory, Series A, 154, pp.406-421, 2018. [14]. G. Chechik, G. Heitz Max-margin Classification of Data with Absent Futures. In Journal of Machine Learning Research 9, 2008. [15] El-Sayed M. El-Alfy, Radwan E. Abdel-Aal "Using GMDH-based networks for improved spam detection and email feature analysis"Applied Soft Computing, Volume 11, Issue 1, January 2011. [16]. Internet: “Spambase Data Set”, http://archive.ics.uci.edu/ml/datasets/Spambase, 26.02.2018. [17]. Sharma, S., and Arora, A. "Adaptive approach for spam detection", International Journal of Computer Science Issues, 10(4), pp.23-26, 2013. [18]. Kiamarzpour, F., Dianat, R., and Sadeghzadeh, M. "Improving the methods of email classification based on words ontology", arXiv preprint arXiv:1310.5963, 2013. [19]. SS, R. K., and Atmosukarto, I., “Spam or Not Spam–That is the question”. [20]. W.A. Awad, and S.M. ELseuofi, "MACHINE LEARNING METHODS FOR SPAM E-MAIL CLASSIFICATION", International Journal of Computer Science & Information Technology (IJCSIT), 3(1), pp.173-184, 2011.