Text Mining Approach to Detect Spam in Emails

90 downloads 126046 Views 253KB Size Report
Many companies use them for their advertisement ... processing techniques for text mining, spam emails, how to deal with spam emails, Importance of pre-.
Proceedings of The International Conference on Innovations in Intelligent Systems and Computing Technologies, Philippines 2016

Text Mining Approach to Detect Spam in Emails Zahra Khan and Usman Qamar Department of Computer Engineering, NUST, College of EME, Rawalpindi, Pakistan [email protected], [email protected]

ABSTRACT With advancement in technology most of the modern day communication takes place through emails. This has made the process of communication much faster and easier as it saves time. One probable disadvantage of using emails as a prime mode communication is advertisements. Many companies use them for their advertisement and keep on sending emails that contain unwanted information often referred to as Spam. Although many approaches have been developed for the identification of spam emails but none of them gives 100% accuracy in spam identification and screening. In this paper by using RapidMiner data mining tool a method has been proposed for identification and screening of spam emails. Initially pre-processing has been done by using different data mining pre-processing techniques. Major emphasis of proposed approach is on preprocessing and importance of pre-processing while doing text mining. After data pre-processing different algorithms for classification are applied over the taken sample dataset. Furthermore, cross authentication has been done on the basis of different parameters. In the end, a model with best classifier in combination of pre-processing technique for spam email is identified based on accuracy, precision, recall, execution time and error rate. Proposed model is used to identify spam emails.

KEYWORDS A text mining approach to detect spam, preprocessing techniques for text mining, spam emails, how to deal with spam emails, Importance of preprocessing techniques, RapidMiner-Data mining.


ISBN: 978-1-941968-27-7 ©2016 SDIWC

With the fast and rigorous use of internet, electronic mails have turn out to be an effective, rapid and inexpensive means of communication. By using emails, a user can transfer his/her data anywhere in the world with in seconds. The increase in usage of emails over the past few years has led to a subsequent increase in the spam emails. Spam is usually defined as unwanted or unsolicited mails that are send in bulk to different recipients. From early 1990s there is dramatic increase in spam emails. 81% spam are used to send through Botnet, i-e also known as network of virus infected computers. Most of the time spammers gathers information regarding emails address from chat rooms, different websites, and customer sites, viruses that inspect end user address book and after collection these addresses are sold to other spammers [1]. These spam messages are sent in bulk by many business people because most of the times spam’s cost is tolerated by receiver. These spam messages are very large threats to Information Technology world and cause billons dollar loss in terms of throughput. During the past few years’ spam emails have been defined as serious security threat and are used for phishing of sensitive data. Spam emails also spread vulnerable virus among different users [2]. Due to the above mentioned reasons email classification become a significant and vital study domain, the prime aim of which is to spontaneously categorize non – spam emails from spam emails. Automatic classification of spam emails is challenging because of unstructured data, meaning less text and large number of text files.


Proceedings of The International Conference on Innovations in Intelligent Systems and Computing Technologies, Philippines 2016

A number of classification algorithms have been implemented to identify spam emails but still 100% correctness in prediction of spam emails is problematic. So identification of algorithm in combination with pre-processing techniques is a challenging task because of pros and cons of different algorithms. For this research spam dataset is taken from CSMining repository. Spam email dataset is used for analysis using RapidMiner data mining tool. In this work pre-processing is first done to get structured data from text files. After getting structured data wordlist is extracted and is used to train different algorithms. Finally, a model has been proposed using classification algorithm in combination with pre-processing techniques to differentiate spam emails from non-spam emails. Section 2 presents a brief overview of literature regarding the work on spam classification. Section 3 reflects on proposed approach and brief description of different modules in approach. Section 4 comprises of results and performance assessment. Section 5 is based on conclusion and finally Section 6 comprise of future directions. 2 LITERATURE REVIEW Spam emails are the biggest threat for today’s internet. It causes financial crises and frustration among email customers. All approaches that have been developed to cater junk emails, filtering is one of the key approach. Spam emails, often referred to as junk emails or unsolicited emails are sent to those individuals who have never demanded for them. Main purpose of spam filters is to keep users’ inbox free from spam emails. Many pitfalls are associated with these emails e.g. consumes space in inbox, get mixed with important personal emails, use network bandwidth, requires individual time and energy to sort through it [3]. Two significant approaches for classification were defined in

ISBN: 978-1-941968-27-7 ©2016 SDIWC

paper [4]. First method is associated with the automated defined rules. The most common example of this system is rule based system. Rule based system is commonly used when classes are static they are easily separable on the basis of some common features. The second proposed system is on the basis of machine learning technique. In paper [5] criterion function is used for defining clusters of spam messages. In the above mentioned paper knearest neighbor algorithm is used to define criterion function. Main purpose of criterion function is to maximize the similarity between messages in clusters. Symbiotic Data Mining (SDM) [6] is a data mining approach that uses Content Based Filtering (CBF) and Collaborative Filtering (CF). In order to improve personalized filtering local filters are reused from diverse entities while privacy is maintained. Paper [7] defines spam filters effectiveness on the basis of Naïve Bayes and Neural Network. Accuracy and sensitivity matrices are used to evaluate results. According to that paper accurate results can be achieved using feed forward back propagation network algorithm. More accurate results mean it has high accuracy and more sensitivity. Mixed membership model on the basis of assumption at four different levels is used in Bayesian Approach for soft clusters and classification. In paper [8] – [11] to remove advertisement automatic antispam filtering becomes a key unit for junk filtering tools. [12] The writer of this paper has used distance measure for numerical and nominal data and then combine into one. In another defined approach firstly all numerical data is converted into nominal data. This nominal data is later used for the calculation of distance measure by using all variables. In this paper data is taken from different application domains and then the complexity and scalability of different algorithm is measured by testing their performance on taken data. Paper [13] defines that the spammers’ social networks are identified by using spectral clustering which is based on high behavioral similarity. The data is


Proceedings of The International Conference on Innovations in Intelligent Systems and Computing Technologies, Philippines 2016

taken from Project Honey Pot [14]. The conclusion of that paper states that “(1) phishing emails are either sent by spammers or no phishing emails at all, (2) phishing emails are either sent by the most communities of spammers or no phishing emails at all (3) numerous groups of spammers in groups show clear progressive activities by having comparable IP addresses”. It is clear from paper [15] that both stated examples are comparable in generalization performance and both methods are assets efficient although large number of training sets are used. Bag of words created from different websites are used for spam classification.

3.1 Pre-processing In real world most of the data is not complete, contain incorrect values, missing values and noisy valves. As the accuracy of the results depends on the quality of the mining which you have done, which directly depends on the accuracy of the data which is used. So, preprocessing become a key task that needs to be done before doing any mining. Data preprocessing includes data cleansing, integration and transformation. For pre-processing of data following pre-processing techniques have been used.


 Transform case is an operator which is used to transform all characters in the text files to either upper or lower case respectively

Proposed approach consists of two major steps i-e Pre-processing and Classification. The design of recommended framework is given in Figure 1 followed by details of each component.

 Tokenize is an operator that splits the text into tokens. There are different ways to identify splitting points. Split can be done on the basis of non – letter characters. This method will create tokens that consist of one single word.

. Emails



form of Text


Files Pre- Processing Cleansing


Test Dataset

 Filter token (by length) is an operator that filters token created by tokenization on the basis of their length i-e number of characters they contain.  Stem (Snowball) is an operator that stem words by applying stemming algorithms written for the Snowball language. For different languages different stemming algorithms can be chosen  Filter Stopwords (English) is an operator which is used to separate stopwords from the document. It is done by removing every token which is equal to the stopword from the build in stopwords list. This operator works properly if each token represents a single word of English. Therefore, this operator was used in combination with Tokenize. The tokens produced by tokenize operator are used as an input to Filter Stopwords.

Performance Evaluation Figure 1. Proposed Approach

ISBN: 978-1-941968-27-7 ©2016 SDIWC


Proceedings of The International Conference on Innovations in Intelligent Systems and Computing Technologies, Philippines 2016

3.2 Classifiers After cleansing of data using pre-processing techniques, three algorithms for classification are applied over training dataset one by one and their performance is compared.  Bayesian theorem is the basis for Naïve Byes classifier technique [16]. It is best suited for the input with high dimensions. Naive Bayes classifier undertakes that the occurrence (nonexistence) of a specific feature of a class (i.e. attribute) is unrelated to the occurrence (or nonexistence) of any other feature. The advantage of the Naive Bayes classifier is that it only requires minimal training data to estimate the means and variances of the variables necessary for classification.  A decision tree is a tree like graph or model. Data represented in this manner is easy to comprehend. The main purpose of this approach is to make a model that forecasts the value of target attribute (i-e class or label) which depends on several input attributes from example data set. Recursive partitioning (means repeatedly splitting on the values of attributes) is used to create a decision tree. Following steps are followed in each recursion.  In State Vector Machine decision boundaries are defined using decision planes that is the basic concept of SVM. A decision plane is defined as the one that discriminates between the objects having dissimilar class memberships. Input data is given to the standard SVM and it forecasts for each given value which of two probable classes includes the input, that makes SVM a nonprobabilistic binary linear classifier. 3.3 Performance Evaluation

Table 1. Parameters Detail

Parameter Error Rate




Execution Time

Performance evaluation is done on the basis of parameters given in Table 1. F Measure

ISBN: 978-1-941968-27-7 ©2016 SDIWC

Detail Error rate is the percentage of dataset wrongly classified by the method.

Accuracy is percentage of dataset correctly classified by the method.

Recall is identified as the relative number of correctly as positive classified examples among all positive examples. Precision is identified as the percentage of the real errors among all encounters that were classified as errors Total time taken by a classifier to execute. This parameter is a combination

Formula (Number incorrect identified samples) (Total number samples the class) Accuracy (Number correct identified samples) (Total number samples the class) Recall (True Positive) (True Positive False Negative)

Precision (True Positive) (True Positive False Positive)


/ of in = of

/ of in = / +

= / +


False Positive rate = 2(Precision


Proceedings of The International Conference on Innovations in Intelligent Systems and Computing Technologies, Philippines 2016

True Positive (TP) True Negative (TN) False Positive (FP) False Negative (FN) False Positive Rate (FPR)


of the precision and the recall Number of instances correctly classified. Number of instances correctly rejected. Number of instances incorrectly rejected. Number of instances incorrectly classified. Refers to the expectancy of the false positive ratio.

x Recall) / Precision + Recall

Table 2. Results of classification algorithms

Classification Algorithm

Naive Bayes

Decision Tree

Error Rate

16.05 % 83.50 % 69.00 % 97.18 % 7 sec

04. 00 %

State Vector Machine 09.00 %

96.00 %

91.00 %

94.00 % 98 %

100.00 % 84.74 %

5 sec

6 sec

80.70 % 69

96 %

91.73 %



True Negative 98



False Positive 2 False 31 Negative False Positive 0.02 Rate

2 6

18 0





Accuracy Recall



Precision Execution Time F Measure True Positive

False Positive / (False Positive + True Negative)


4.1 Spam Dataset The Spam dataset used in the present research was taken from Enron Spam Dataset, available at CSMining website. The taken dataset consists of 4326 message. Out of 4326 messages 1351 were spam and 2975 were ham (non – spam). Spam and Ham emails were present in the form of .eml files and are differentiated on the basis of 0, 1. A piece of java code was used to separate spam from ham emails. As all emails were in .eml format and text mining techniques can only be applied if the data is in .txt format so first data was converted from .eml format to .txt format. After getting data in .txt format preprocessing was done and different classification algorithms were applied. Results of classification algorithms on the basis of selected parameters are given in Table 2.

ISBN: 978-1-941968-27-7 ©2016 SDIWC

Figure 2. Comparison on the basis of Parameters


Proceedings of The International Conference on Innovations in Intelligent Systems and Computing Technologies, Philippines 2016

From the graph shown in Figure 2. and from the results of different classification algorithms shown in Table 2. framework with Decision tree classifier in combination of above mentioned pre-processing techniques comes out to be the best framework for the classification of Spam emails. Confusion matrix for decision tree classifier is given in Table 3. Table 3. Confusion Matrix for Decision Tree

Predicted SPAM Predicted HAM

True SPAM 94

True HAM 2



Data Set used Classification Enron Model based Dataset on Rough Set. Spam Email Enron Classification Dataset with Artificial Neural network Spam Enron Classification Dataset Using Machine Learning Technique Comparative Enron Study on Dataset Email Spam Classifier Using Feature Reduction Technique Our Proposed Enron Framework Dataset


During the last few decades’ email classification has received an incredible attention from people because it helps to classify spam mails and threats. Hence, a lot of work is being done in this domain to find the finest classifier for email classification. From the acquired results in contrast with pre-processing technique it is clear that the framework with Decision tree performs better as compared to other classifiers. After extensive pre-processing decision tree was applied and it comes out to be 96.00 % effective in classifying spam emails 6

Table 4. Comparison with Other Approaches



Future Work

Proposed model can be enhanced in following ways:


 Achieve higher accuracy by using classifiers in combination


92.00 %

 Develop a technique that can catch the sentimental phrases and train methodology for those spams.


94. 00 %


 Multilingual spam email classification REFERENCES





Prince, M.B., Dahl, B.M., Holloway, L., Keller, A.M. and Langheinrich, E., 2005, July. Understanding How Spammers Steal Your E-Mail Address: An Analysis of the First Six Months of Data from Project Honey Pot. In CEAS.


Alguliev, R.M., Aliguliyev, R.M. and Nazirova, S.A., 2011. Classification of textual e-mail spam using data mining techniques. Applied Computational Intelligence and Soft Computing, 2011, p.10.


Bratko, A., Filipič, B., Cormack, G.V., Lynam, T.R. and Zupan, B., 2006. Spam filtering using statistical data compression models. The Journal of Machine Learning Research, 7, pp.2673-2698.


Apté, C., Damerau, F. and Weiss, S.M., 1994. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS), 12(3), pp.233-251.


Sharma, A. and Rastogi, V., 2014. Spam Filtering using K mean Clustering with Local Feature

95.00 %


96.00 %

ISBN: 978-1-941968-27-7 ©2016 SDIWC


Proceedings of The International Conference on Innovations in Intelligent Systems and Computing Technologies, Philippines 2016

Selection Classifier. International Computer Applications, 108(10).




Cortez, P., Lopes, C., Sousa, P., Rocha, M. and Rio, M., 2009, September. Symbiotic data mining for personalized spam filtering. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 01 (pp. 149-156). IEEE Computer Society.


Sharma, A.K., Prajapat, S.K. and Aslam, M., 2014, April. A Comparative Study Between Naive Bayes and Neural Network (MLP) Classifier for Spam Email Detection. In IJCA Proceedings on National Seminar on Recent Advances in Wireless Networks and Communications (No. 2, pp. 12-16). Foundation of Computer Science (FCS).


Chandrasekaran, M., Narayanan, K. and Upadhyaya, S., 2006, June. Phishing email detection based on structural properties. In NYS Cyber Security Conference (pp. 1-7).


Nazirova, S., 2010. Mechanism of classification of text spam messages collected in spam pattern bases. In Proceedings of 3rd International Conference on Problems of Cybernetics and Informatics (Vol. 2, pp. 206-209).


Cohen, W.W. and Singer, Y., 1999. Contextsensitive learning methods for text categorization. ACM Transactions on Information Systems (TOIS),17(2), pp.141-173.


Khorsi, A., 2007. An overview of content-based spam filtering techniques.Informatica, 31(3).


Li, X. and Ye, N., 2006. A supervised clustering and classification algorithm for mining data with mixed variables. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 36(2), pp.396-406.


Li, X. and Ye, N., 2006. A supervised clustering and classification algorithm for mining data with mixed variables. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 36(2), pp.396-406.


Spitzner, L., 2003. The honeynet project: Trapping the hackers. IEEE Security & Privacy, (2), pp.15-23.


Kiritchenko, S. and Matwin, S., 2011, November. Email classification with co-training. In Proceedings of the 2011 Conference of the Center for Advanced Studies on Collaborative Research (pp. 301-312). IBM Corp.


Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G. and Spyropoulos, C.D., 2000. An evaluation of naive bayesian anti-spam filtering.arXiv preprint cs/0006013.

ISBN: 978-1-941968-27-7 ©2016 SDIWC