Interaction between Feature Subset Search Techniques and Machine Learning Classifiers for Detecting Unsolicited Emails Shrawan Kumar Trivedi
Shubhamoy Dey
Information Systems Indian Institute of Management Prabandh Shikhar, Rau Indore – 453 556, India
[email protected]
Information Systems Indian Institute of Management Prabandh Shikhar Rau Indore –453 556, India
[email protected]
ABSTRACT Classification of the spam from bunch of email files has become challenging for the researcher. Identification of an excellent classifier is not only evaluated by performance accuracy but also its rapid false alarm detection with less number of the features. With these challenges, this research presents the effects of using features selected by four feature subset search methods i.e. Genetic, Greedy Stepwise, Best First, and Rank Search on popular Machine Learning Classifiers like Bayesian, Naive Bayes, Support Vector Machine, Genetic Algorithm, J48 and Random Forest. Tests were performed on three different publicly available spam email datasets: “Enron”, “SpamAssassin” and “LingSpam”. Results show that, Greedy Stepwise Search method is a good method for feature subset selection in spam email detection. Among the Machine Learning Classifiers, Support Vector Machine has been found to be the best classifiers both in terms of accuracy and False Positive rate. However, results of Random Forest were very close to the best one. Genetic classifier was identified as a weak classifier.
Categories and Subject Descriptors I.5.2 [Pattern Recognition]: Design Methodology—classifier design and evaluation, feature evaluation and selection; I.2.7 [Natural Language Processing] – Text analysis
General Terms Algorithms, Performance, Experimentation, Application
Key words Email spam classification, Feature selection, Evolutionary algorithms, False Positive Rate, Classification Accuracy.
1.
INTRODUCTION
In today’s automated world, email is a necessary and useful tool for enabling rapid and inexpensive communication. It has now become a popular medium and can be seen as an essential part of the life [1]. On the other hand, Spam (also known as unsolicited bulk email), has turned into a challenge because its rate is increasing day by day. A study estimates that 70% of business emails are Spam. As a result, this rapid growth causes some serious hitches, such as unnecessary filling of users’ mailboxes engulfing important emails, consuming storage space and bandwidth as well as time to segregate them [2]. Nowadays, Spam classification is challenging due to complexity introduced by the spammers in the features of Spam. Complexity can be defined as modification created in the spam words which make a feature difficult to understand. Some attacks, such as Tokenisation (i.e. Splitting or modifying the feature such as ‘free’ written as f r 3 3) and Obfuscation (hides feature from adding HTML or some other codes such as ‘free’
coded as frexe or FR3E), change the information of a feature [3, 4]. Various Machine Learning Classifiers has been experimented with to tackle these problems. Some of these have demonstrated their strength in Spam classification. In particular, Support Vector Machines (SVM), Probabilistic Classifiers (Bayesian and Naive Bayes), Decision Tree Classifiers (J48 and Random Forest) and Evolutionary Classifiers (Genetic classifiers) have been proven their efficacy in this area. SVM [5, 3] uses the concept of “Statistical Learning Theory” proposed by Vapnik [6]. Probabilistic classifiers such as Naive Bayes [7, 8] and Bayesian Classifier [9, 10, 11] based on Bayes Theorem are also popular. Evolutionary Classifiers [12] that are based on the principles of evolution have been intensely researched. Decision Tree Classifiers [13] uses the multistage approach for breaking up a complex decision into a combination of various simpler decisions. By this way, it obtains an optimum decision. In this study, classifiers mentioned above have been tested on three well-known publicly available datasets: Enron, SpamAssassin and LingSpam to evaluate their efficacy when used in conjunction with four different feature subset selection techniques: Genetic search, Greedy Stepwise search, Best First search and Rank search. A comparative analysis of the performance (in terms of accuracy and false positive rate) of the various combinations of feature subset selection techniques and classifiers is presented. The later sections of this paper have been structured in the following way: Section 2 summarizes related work, Section 3 describes the Methodology used in this research, Section 4 describes the Experiment setup and Evaluation, Section 5 presents comparative analysis and finally Section 6 concludes the paper.
2.
RELATED WORK
Nowadays, area of Text classification is generating substantial interest. Various classifiers and feature selection method have been tested and reported in literature. This study focuses on the Spam Email Classification application of Text Classification. A lot of research literature is available in this area. This study incorporates four feature subset search methods i.e. Genetic, Greedy stepwise, Best First and Rank search to obtain less number of most informative features from three publically available datasets i.e. Enron [3, 11] and SpamAssssin [14, 10] and LingSpam . Five popular machine learning classifier has been tested for comparative performance evaluation. Bayesian classifier has already been proven its strength in the literature which was introduced by Lewis in 1998 [15]. This technique uses the content and domain knowledge of the email files and accordingly classifies spams from whole outset. An extended Naive Bayes approach was experimented with the
series of study done by Androutsopoulos et al. in 2000 [7]. These researches evaluate the performance of classifiers with the help of different number of features and data size. Results of this study was strongly supported the worth of proposed probabilistic classifier. A recent study done by Trivedi and Dey 2013 [11] is used the concept of boosting algorithms for performance improvement of probabilistic classifiers. This study tells that probabilistic classifiers effectively work with boosting even when the number of features is less. Support vector machine has been experimented many times in the literature and has a respected place. A study, done by Drucker et al. [16] compares the performance of SVM with various machine learning classifiers. Results were in favour of SVM and boosted decision tree in terms of accuracy and speed. However, training time of SVM was less than boosted decision tree. Decision tree classifiers have also captured a prominent place in the literature. Rios and Zha [4], experimented a random forest (RF) classifier on a time indexed data that includes text and Meta data features. In this study, RF was comparable with SVM in terms of false positive rate. Rule Based Classifier such as Genetic Classifier [12, 13, 17, 19] is continuously experimenting by the researcher due to its interesting rules such as Selection rules (Searching most fit individual), and Reproduction rule (combining and altering individuals to obtain new individuals).
3.
METHODOLOGY 3.1 Machine Learning Classifiers
3.1.1 Genetic Algorithm based Classifier: This algorithm uses a learning approach based on the principles of natural selection initially introduced by Holland [20]. Initially, Genetic algorithm starts with a constant population of individuals to search a sample of space. Each individual of the population is evaluated for their fitness. Thereafter, new individuals are produced by selecting the outperforming individuals who produce “offspring” [21]. The offspring retain the characteristics of their parents and generate a population with improved fitness. The process of generating new individuals is done by two significant Genetic operators i.e. “Crossover” and “Mutation”. Crossover operator operates by random selection of a point in two parent gene structures and develops two new individuals by exchanging the remaining parts of the parents. Hence, this operator formulates two new individuals with potentially improved fitness by combining the two old individuals. On the other hand, the Mutation operator creates a new individual by arbitrarily altering some component of an old individual. The work of this operator is same like population perturbation operator which introduces potentially new information in the population. This operator also helps to stave off any stagnation which can arise during the search process.
P
( ) ci dj
=
P ( ci )*P
( )
( )
dj ci
(1)
P dj
Where P ( d j ) Symbolize the Probability of arbitrarily selected documents represented by the documents vector d j and P ( ci ) is the probability of arbitrary selected documents d j belonging to a particular class ci . The discussed classification method is usually known as “Bayesian Classification”. Bayesian Classifier is a popular technique but it has been shown to have limitations in the case of high dimension of the data vector d j . This limitation is tackled by the assumption that any two arbitrary selected components of document vector d j (tokens) are independent of each other. This assumption is formalized by the equation
P (d
j
ci
)=∏
n l =1
P
( ) wlj
cl
(2)
This assumption is used in the classifier named “Naive Bayes” and which is quite popular in the area of the Text Mining.
3.1.3 Support vector machine (SVM) Support Vector Machine (SVM) is a popular category of Machine Learning Classifiers. It takes its inspiration from Statistical Learning Theory and structural Minimization Principal [6]. Due to its strength in dealing with high dimensional data by the use of unique Kernel Function, it is one of the best accepted classifier in the concerned area. The basic concept of SVM is to separate the classes (i.e. positive and negative) by a maximum margin produced by hyper-plane. Let us take a training sample X = { xi , yi } , where xi ∈ Rn and yi ∈ {+1, −1} , which is defined as the particular
class for i th training sample. In this research, +1 is denoted as the SPAM mails i.e. unsolicited emails and −1 is denoted as the HAM i.e. legitimate mails. Final output of the classifier can be determined by the following equations-
y = w.x − b ,
(3)
Where y indicates final output of classifier, w termed as normal vector analogous to those in the feature vector x , and b is the bias parameter that determine by the training procedure. The following optimization function is used for maximize the separation between classes. 2
minimize
1 2
subject to
yi ( w.x − b ) ≥ 1, ∀i .
w
(4) (5)
3.1.2 Probabilistic Classifiers:
3.1.4 Decision Tree (J48):
This idea was proposed by Lewis in 1998 [13], who introduced
J48 is based on C4.5 algorithm which is also known as simple Decision Tree Classifier. It is an open source JAVA implementation of C4.5 algorithm which works on the concept of Entropy for generating a decision tree from the training data. At each node, C4.5 chooses a best feature from features subset. The selection method is completed by normalising the
the term P
( ) and defined as the probability of a document ci dj
recognized by a vector d j = w1j , w2j ,..., wnj of terms fall within a certain category ci . This probability is calculated by the Bayes theorem
information gain (difference of entropy). Few base cases are required to understand C4.5 algorithm
A. If entire sample of the list exists in the same class, it produces a leaf node in the decision tree to pick that class. B. If none information gain is not provided by any feature, it creates a decision node, situated at up the tree with the expected value of the class. C. If the example of previous unseen class comes upon, again it produces a decision node, placed at up the tree with the expected value of the class. Algorithm for C4.5: 1.
Verify the above base cases
2.
For every feature x i , observe normalise information i gain by splitting on x .
3.
4.
If the xbi is observed to be best feature with higher normalise gain, create a decision node that split i on xb . Repeat above on the sub lists generated by splitting on xbi .
3.1.5 Random Forest (RF):
3.2 Feature Subset Search Methods These methods basically use the feedback concept and machine learning algorithm to select the best feature subset. A number of studies have been reported in the literature for searching the feature space for the best subset of features for use along with machine learning algorithms [19, 11]. In this research Genetic search and Greedy Stepwise search are considered.
3.2.1 Genetic Search: Genetic search is based on the idea given by Darwinian Theory of survival of fittest. This method is used to simulate the evolutionary processes occurring in the nature with the help of three fundamental genetic operators i.e. Selection, Crossover and Mutation within the chromosomes representing the features. Selection operator works by selecting the most fit individuals for reproduction from the current population. Reproduction is done by the use of two operators i.e. Crossover and Mutation of the parent genes to generate novel solutions. Basic Steps during GA feature search: 1
Produce arbitrary population of n chromosomes (features subset).
2
Evaluate fitness of each chromosome.
3
Iterate until the required N number of chromosomes is obtained.
A Random forest classifier is based on the ensemble of classifiers mechanism. This method combines the individual decision of various classifiers for obtaining improved classification results. Likewise, RF combines decision of various decision trees to obtain utmost classification output. The diversified members of decision tree can be generated by two procedures. First procedure is the bagging that modifies the data samples and other is by choosing random feature input from features subset. Algorithm for RF
I.
Selection: pick two chromosomes.
II.
Crossover: combine the properties of parent chromosomes to generate offspring.
III.
Mutation: Mute the offspring by predefined mutation probability.
IV.
Fitness: calculate the fitness of muted offspring.
V.
Update: Replace the muted offspring in the population.
VI.
Evaluation: if the fitness is contented:
Given: nT - number of training examples, x i -number of all
a.
Keep this offspring
features, x e -number of features selected for ensembles, mi number of all ensemble members
b.
Produce new population of chromosomes and calculate new offspring.
4
Creation of Random Forest (RF) for mi trees 1
i
For each m iterations: do,
2
Bagging: Produce sample nT with replacement from training data
3
Random Feature Selection: Grow up decision tree without pruning. For every step, choose best features by considering only x e random selected features and obtaining the Gini index.
Classification: 4
Employ text set on every mi decision trees starting from the root node. Allocate it to a specific class with respect to the leaf node. Combine the individual decisions of each member by voting to produce utmost classification result.
Return: find N chromosomes (features wrapper set)
3.2.2 Greedy Stepwise Search: This method works as an iterative process where in each step features are evaluated iteratively. Thereafter, the single best informative feature is selected and taken for the model. Evaluation is performed with the help of Stepwise regression. Selection can be done by the three different processes i.e. Forward selection (adding valuable features), Backward selection (removing worst features), and Mixed selection (Forward and backward simultaneously). Some criteria are used to indicate the termination of the feature selection process such as P-Value measure indicate whether all selected features are added or not in the model or none of the feature is left add value. Let us consider f s is the features set which has carried for search process and f e is the number of features taken under evaluation with respect to their fitness. Hence best feature wrapper set f b* is:
f b* = arg max fit ( f s ∪ { f e }) fe ∉ f s
(6)
3.2.3 Best First Search: This search method begins with an empty set of the features thereafter it includes best features for expanding the set. The best subset (with highest evaluation) is selected and extended by adding single best feature. If the extended subset does not provide any improvement in results then it will move on the next best subset and then start from there. This process will repeat again and again until the search process will not be terminated. After completion of the search process, the best subset can be found.
3.2.4 Rank Search: This search method uses a feature evaluator technique (such as correlation based feature evaluator) for selecting most informative feature). After specified a best feature evaluator algorithm, a forward selection method is used for ranking the features.
4.
EXPERIMENTS AND EVALUATION 4.1 Data Sets:
This study incorporates three different dataset, which have taken from three different sources. Our main analysis is done with “Enron email” dataset and thereafter “SpamassAssin” and “LingSpam” datasets are engaged for validation of the results carried by first dataset. The description of the dataset is mentioned below:-
4.1.1 Enron email dataset: In this study, out of the six existing version of Enron Email dataset, Enron version 5, and 6 is being selected to create 3000 Legitimate (Ham) and 3000 unsolicited (Spam) files by random sampling. The idea to take these Enron email versions was developed because of complexities imbibed in the Email Spam files.
4.1.2 SpamassAssin: The other dataset i.e. “SpamassAssin” is being taken for this study. This dataset carries some older as well as recent unsolicited emails (Spam) developed by some non-Spam-trap sources. Out of whole Spam outsets, 2350 Spam email files are being taken for this research. On the other hand, this dataset have some easy (simply identify) and hard (with imbined complexities) Legitimate (Ham) files. For maintaining same rate, both i.e. easy and hard emails are mixed to generate 2350 Ham email files.
4.2.1 Pre-Processing: An Email file can be identified as a collection of feature vector aki which is defined as the weight of word i belongs to document k [22]. The above Email data files are carried for feature extraction process to obtain hidden information (usually the words) for generating Term-Document matrix (TDM). It looks like a binary matrix where 1 indicates the presence of word in corresponding document and 0 otherwise. This matrix will be high dimensional and sparse in nature because a large number of document will participate in the classification process. However, this problem is well handled by “Dimensionality reduction” process.
4.2.2 Dimensionality reduction This process is done before the classification and refers to the techniques of generating new features as the combination of original features in order to reduce the dimension of original dataset. Dimensions can be reduced with the “Feature selection” or “Feature extraction” and “Stop word” (Terms that consist no information such as ‘Pronoun, Preposition, and conjunction’) elimination [22] and “Lemmatisation” (grouping the Terms that consist same information such as ‘Combine, Combined, and Combining’ etc.).
4.2.3 Feature Extraction process: In this section Spam and Ham files are taken to extract and develop the associated features dictionary. This process is done by the String-to-Word-Vector conversion process which also incorporates Stop word removal and Lemmatization techniques. Resultant sparse and large matrix will further evaluate by feature selection and search techniques to generate minimum number of best informative features.
4.2.4 Feature selection: Feature selection technique is employed after stop word removal and lemmatisation. This technique helps to find the most informative terms from the complete set of terms. For evaluation of classifiers, the use of a few good features (i.e. terms) to represent documents has been shown to be effective. In this study, we have used Genetic feature search and Greedy Stepwise Feature search techniques. According to the dimensionality of the original dataset, different numbers of good features were selected using these techniques for each of the two different datasets, and thereafter used for testing the concerned classifiers.
5.2.5 Classifiers:
4.1.3 LingSpam Dataset Third dataset of this study is taken from LingSpam corpus which is made from four different version of email files i.e. bare (Lemmatiser disabled, stop-list disabled), lemm (Lemmatiser enabled, stop-list disabled), lemm_stop (Lemmatiser enabled, stop-list enabled), and stop (Lemmatiser disabled, stop-list enabled). The preferred dataset includes 478 spam (unsolicited) email files and equal number of 478 ham (legitimate) email files taken from all versions. Legitimate emails are generated by randomly downloading digests from the archives, separating their messages, and removing text added by the list’s server and spam files are basically produced by including attachments, HTML tags, and duplicate spam messages were not included in the files.
4.2 Classification processes description
This study used JAVA and MATLAB environments in Window 7 operating system for testing the concerned classifiers. Six classifiers a Genetic Algorithm based Classifier, Bayesian, Naive Bayes, Support Vector Machine (SVM), J48 and Random Forest were tested on the most informative Features selected by the different feature subset selection methods from the three different datasets mentioned above.
4.2.6 Spam Classification: Now the selected features will be taken for final classification process. In this process the total number of Ham and Spam files are split in a random fashion such as in this research 66% training files and 34% test files are to be taken. Thereafter, appropriate classifier will be trained with the selected features and files and test to generate optimum classification results.
Table 3. Accuracy and F-value of classifiers tested on SpamassAssin dataset
4.2.7 Evaluation: This study employs a number of Performance Measures for evaluation and analysis. A simple measure for classifiers testing is the Classifiers Accuracy defined as the percentage of accurate classified Emails. The weakness of this measure is that it fails to distinguish between false positive and false negative. For accurate measurement the false positive rate is calculated separately. F-value defined as the harmonic mean of Precision (i.e. fraction of retrieved classified emails that are relevant) and Recall (i.e. fraction of accurate classified emails that are retrieved), is another measure used for evaluation and analysis in this study.
Gene tic
Bayesia n
NB
SVM
J48
RF
Acc F-Value Acc F-Value Acc
95.2 95.2 96.4 96.4 95
91.9 91.9 97.1 97.1 92.8
91.2 91.2 96.6 96.7 93.1
96.2 96.2 97.8 97.8 97.9
95.7 95.8 97.9 97.9 96.3
96.5 96.5 98.4 98.4 98.2
F-Value
95.1
92.8
93.2
97.9
96.4
98.2
Acc F-Value
95.4 95.5
92.2 92.2
94.4 94.5
97.5 97.5
96.4 96.4
97.6 97.6
In Percentage Genetic Search Greedy Search Best First Search Rank Search
Table 1. Performance Instruments Instruments
Table 4. Accuracy and F-value of classifiers tested on LingSpam dataset
Related Formulas
Accuracy =
Accuracy
N Ham→c + N Spam→c N Ham→c + N Ham→m + N Spam→c + N Spam→m
Gene tic
Baye sian
NB
SVM
J48
RF
Acc F-Value Acc F-Value Acc
89.5 89.5 93.2 93.3 93.1
89.5 89.5 97.7 97.8 97.8
89.7 89.8 97.8 97.8 97.1
92 92.1 97.8 97.8 97.5
89.4 89.5 95.7 95.7 96.5
90.1 90.1 96.5 96.6 96.6
F-Value
93.1
97.8
97.2
97.5
96.6
96.6
Acc F-Value
93.2 93.2
96.2 96.3
96 96.1
96.9 97
94.2 94.2
95.1 95.1
In Percentage H ,S H ,S 2*Precision * Recall
H ,S Fvalue =
F-Value
FPrate =
False Positive Rate
H ,S H ,S Precision + Recall
N Ham→m N Ham→m + N Ham→c
In the table above, the formulae of performance measure have been shown, where N Ham→c denotes the total number of correctly classified Ham Emails,
N Ham → m
misclassified Ham emails, Spam emails and
N Spam→m
N Spam→c
denotes the number of is the correctly classified
denotes the total number of
misclassified Spam emails.
5.
Comparative Analysis
This section presents the comparative analysis of various Machine Learning Classifiers that were tested by different number of most informative features. Percentage Accuracy, FValue and False Positive Rate were the measures used for analysis. For clear understanding, this analysis is presented in three segments. The first segment deals with the analysis of Machine Learning Classifiers, the second segment analyses the feature selection methods, and in the last segment, the False Positive rates are used for evaluating the accuracy of classification from a different perspective. Table 2. Accuracy and F-value of classifiers tested on Enron dataset Gene tic
Bayesia n
NB
SVM
J48
RF
80.4 80.4 87.6 87.5 80.7
85.6 85.6 93 93.1 92.1
84.8 84.8 94 93.9 91.2
87.1 87.1 94.2 94.3 94.1
86 86.1 92.1 92.2 92.6
86.6 86.6 93.8 93.9 94
F-Value
80.7
92.2
91.2
94.2
92.6
94.1
Acc F-Value
80.8 80.8
92 92.1
91.4 91.5
93.8 93.8
91.4 91.4
93.7 93.8
In Percentage Genetic Search Greedy search Best First Search Rank Search
Acc F-Value Acc F-Value Acc
Genetic Search Greedy search Best First Search Rank Search
5.1 Analysis of Machine Learning Classifiers: The results of the classifiers tested on the Enron dataset is shown in the Table 2 and Figure 1, which demonstrate that Support Vector Machine is the most accurate amongst the tested classifiers. In this case, the performance accuracy is between 87.1% and 94.2%. However, Random Forest classifier (with performance accuracy 86.6% to 94.1%) is predicted to be second best classifier whose results were proximate to SVM. The Genetic Classifier is found to be the worst in terms of accuracy with the accuracy varying between 80.4% and 87.6 %. The Bayesian and Naive Bayes were the third and fourth best respectively with accuracy between 85.6% and 93.1% for Bayesian and 84.8% and 93.9% for Naive Bayes. Testing of the same classifiers on the SpamAssassin dataset confirmed the results obtained from the Enron dataset. Results of the experiments on the SpamAssassin dataset are shown in Table 3 and Figure 2. In this case, again SVM and Random forest are proven to be excellent classifiers with performance accuracy 96.2% to 97.8% for SVM and 96.5% to 98.4% for Random Forest. Genetic algorithm (with performance accuracy 95% to 96.4%) is again predicted to be weak amongst all. The same test on LingSpam is also validated the above results. The results of LingSpam (Table 4 and Figure 3) show that SVM (with performance accuracy 92% to 97.8%) is the excellent classifier amongst all whereas the results of Random Forest (with performance accuracy 90.1% to 96.6%) are proximate to best one. Genetic classifier (with performance accuracy) is continuously showing its poor performance.
A c cu ra c y (In % )
However the features selected by Genetic search have shown poorer results i.e. 80.4% to 87.1% for Enron dataset and 91.2% to 96.2% for SpamAssassin dataset and 89.5% to 90.1 for LingSpam dataset. 110 Genetic Search Greedy Search
100
Best First Search Rank Search
90 80 Genetic
Bayesian
Naive Bayes
SVM
J48
Random Forest
A c c u r a c y ( In % )
F - V a lu e ( I n % )
Machine Learning Classifiers 110
Genetic Search Greedy Search
100
Best First Search Rank Search
90
Accracy (In %)
As discussed in the preceding sections, four feature subset search methods were used to obtain most informative feature subsets. By using Genetic, Greedy Step-Wise, Best First and Rank subset feature search techniques, the most informative features subset were selected. Initially, 48 best features out of 1500 initial created features for Enron dataset, 35 best features out of 1414 features for SpamAssassin dataset, and 50 best features out of 1658 features were selected for testing concerned classifiers. The result presented in Table 2, 3, 4 and Figure 1, 2, 3 demonstrate that Greedy Step-Wise search method has given the best strength in all three datasets with performance accuracy between 87.6% and 95.2% for Enron dataset and 96.4% and 97.8% for SpamAssassin dataset and 93.2% to 97.8% for LinSpam dataset. However, Best first search method was second best amongst all with accuracy between 80.7% and 94.1% for Enron dataset and 92.8% and 98.2% for SpamAssassin dataset and 93.1% to 97.8% for LinSpam dataset
120 Genetic Search Best First Search Rank Search
100 90 Genetic
Bayesian
Naive Bayes
SVM
J48
Random Forest
120 Genetic Search Greedy Search
110
Best First Search Rank Search
100 90 Genetic
Bayesian
Naive Bayes
SVM
J48
Random Forest
MachineLearning Classifiers
Figure 3. Accuracy and F-value for LingSpam Dataset
5.3 Analysis with False Positive Rate Although some of machine learning classifiers show good overall classification accuracy, the possibility of misclassification of the positive instances may be higher. Legitimate Emails considered important and if these emails get misclassified as Spam then it may lead to a serious consequences. This problem can be well tackled by considering the False Positive rate (FP Rate) which takes into account how many legitimate emails are misclassified. From the Table 5 and Figure 4, it is clear that SVM and Bayesian Classifier perform better in terms of the FP rate. For these classifiers the FP Rate is low in all three datasets (1.8% to 7.3% for Bayesian classifier and 2.6% to 7.3% for SVM on Enron dataset, 0.1% to 3.6% for Bayesian and 1% to 2.1% for SVM on SpamAssassin dataset as well as 0% to 18.1% for Bayesian and 1% to 7.3% for SVM on LingSpam dataset). The above results are for Genetic, Greedy Stepwise, Best First and Ranker search which indicate that the use of Greedy Stepwise search method for feature selection leads to lower FP rate.
Machine Learning Classifiers
By considering performance accuracy and false positive rate together, SVM classifier with Greedy subset search method is identified to be excellent combination.
Figure 1. Accuracy and F-value for Enron Dataset
Table 5. False Positive Rate of the Classifiers
80 Genetic
Bayesian
Naive Bayes
SVM
J48
Random Forest
120 Genetic Search
110
Gene tic
Greedy Search Best First Search
Baye sian
NB
SVM
J48
RF
Rank Search
100 90 Genetic
Bayesian
Naive Bayes
SVM
J48
Random Forest
Genetic Search
Machine Learning Classifiers 120
F - V a lu e ( In % )
Greedy Search
110
MachineLearning Classfiers F Value (In %)
5.2 Analysis of Feature Selection Methods:
Greedy search
Genetic Search
110
Greedy Search Best First Search Rank Search
100
Best First Search
90 Genetic
Bayesian
Naive Bayes
SVM
J48
Random Forest
Machine Learning Claassifiers
Figure 2. Accuracy and F-value for SpamassAssin Dataset
Rank Search
Enron SpamAssa ssin LingSpam Enron SpamAssa ssin LingSpam Enron SpamAssa ssin
22.6
7.3
10.6
7.3
8.8
8.6
3.5
0.1
0.4
2.1
3.6
3.1
13.1 22.5
18.1 1.8
16.9 4.4
0.6 2.6
1.3 2.7
6.3 2.7
2.4
3.6
4.5
1
2.2
1.7
10 36.9
0 2.5
0 3.7
1.9 3.5
3.8 3.7
3.8 3.6
4.5
0.7
1.1
1.9
3.2
1.5
LingSpam
11.9
0
0.6
2.5
3.1
1.9
Enron SpamAssa ssin LingSpam
25.7
1.1
2.2
2.8
4.6
3.3
4.1
0.1
0.6
2.4
3
2.2
12.5
0
0.6
1.9
3.1
5
40
False Positive Rate (In %)
Enron (Genetic)
35
SpamAssassin (Genetic) LingSpam (Genetic)
30
Enron (Greedy) SpamAssassin (Greedy) LingSpam (Greedy) Enron (Best First)
25
SpamAssassin(Best First)
20
LingSpam (Best First) Enron (Rank) SpamAssassin (Rank)
15
LingSpam (Rank)
10 5 0
Genetic
Bayesian
NB
SVM
J48 Random Forest
Machine Learning Classifers
Figure 4. False Positive Rate of Classifiers
6. CONCLUSION Achieving good classification accuracy of the classifiers using minimum number of features has always been one of the major research objectives in Text classification. This study presents a comparative analysis of four feature subset search methods: Genetic, Greedy Stepwise, Best First, Rank search and their interactions with some Machine Learning Classifiers to find best pair of Feature Subset Selector and Machine Learning Classifier for achieving utmost classification performance. The purpose of this study is successfully achieved. The results lead to the following conclusions: first, among the Machine Learning Classifiers examined SVM has shown best classification accuracy and also the lowest False Positive Rate however, Random Forest was second best; second, Greedy Stepwise Search was found to be the best feature subset selector; third, Greedy Stepwise Subset Selector with SVM classifiers has predicted to be the excellent pair for classification. In the future, same study can be replicated in other applications and also tested on different datasets. Some other Machine Learning classifiers and Feature Subset Evaluator techniques can also be taken to observe the suitable interaction between them for achieving best results for the various text classification applications.
6.
REFERENCES
[1]
Whittaker, S., Bellotti, V., & Moody, P. (2005). Introduction to this special issue on revisiting and reinventing e-mail. Human–Computer Interaction, 20(1-2), 1-9.
[2]
C. C. Lai, "An empirical study of three machine learning methods for spam filtering,"Journal of Knowledge-Based Systems archive, Volume 20, Issue 3, PP. 249-254, April, 2007.
[3] Trivedi, S. K., & Dey, S. (2013). Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams. International Journal of Computer Applications, 66(21).
Symposium on Applications and the Internet, PP 166 – 169, 27-31 Jan. 2003. 5 [6] V.N Vapnik, “An Overview of Statistical Learning Theory”, IEEE Trans.on Neural Network, Vol. 10, No. 5, pp.988-998 , 1999. 6 [7] Androutsopoulos I., J. Koutsias, K.V. Chandrinos, G. Paliouras, and C.D., Spyropoulos. 2000a. An Evaluation of Naive Bayesian Anti-Spam Filtering, Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain, pages 9–17. 7 [8] Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006). Spam Filtering with Naive Bayes–Which Naive Bayes? Third Conference on Email and Anti-Spam (CEAS), pages 125–134. 8 [9] Chen, J. & Chen, Z. (2008), Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759-771. 9 [10] W.A. Awad, and S.M. ELseuofi, “Machine Learning Methods for Spam Classification,” International Journal of Computer Science & Information Technology (IJCSIT), PP 173-184, Vol 3, No 1, Feb 2011. 10 [11] Trivedi, S. K., & Dey, S. (2013). Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting Complex Unsolicited Emails. Journal of Advances in Computer Networks, 1(2). [12] Trivedi, S. K., & Dey, S. (2013, October). Effect of feature selection methods on machine learning classifiers for detecting email spams. In Proceedings of the 2013 Research in Adaptive and Convergent Systems (pp. 3540). ACM. [13] Trivedi, S. K., & Dey, S. (2013). An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails. in Proc. 2013 IEEE 16th International Conference on Computational Science and Engineering, Sydney, Australia published by IEEE computer society. 978-07695-5096-1/13 $31.00 © 2013 IEEE DOI 10.1109/CSE.2013.171 [14] D. Sculley, G. M. Wachman, “Relaxed Online SVMs for Spam Filtering” SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, PP 415-422, ISBN: 978-1-59593-597-7, July 2007. 12 [15] David D. Lewis. 1998. Naive (Bayes) at forty: The independence assumption in information retrieval, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 4 -15. 13 [16] Drucker, H., Wu, D., & Vapnik, V. N. (1999). Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 10(5), 1048–1054. [17] Liu, Bo, Bob McKay, and Hussein A. Abbass. "Improving genetic classifiers with a boosting algorithm." In Evolutionary Computation, 2003. CEC'03. The 2003 Congress on, vol. 4, pp. 2596-2602. IEEE, 2003. 15
[4] J. Goodman, G.V. Cormack, and D. Heckerman, “Spam and the ongoing battle for the inbox,” Communications of the ACM, vol.50, issue 2, pp. 24-33, February 2007
[18] Jiang Hua Li, and Wang Ping (2009), The e-mail filtering system based on improved genetic algorithm. Proceedings of the 2009 International Workshop on Information Security and Application (IWISA 2009), ISBN 978-9525726-06-0. 16
[5] M. Woitaszek, M. Shaaban, and R. Czernikowski “Identifying Junk Electronic Mail in Microsoft Outlook with a Support Vector Machine,” conf. Proceedings, 2003
[19] Xu, Z., Weinberger, K., & Chapelle, O. (2012). The greedy miser: Learning under test-time budgets. arXiv preprint arXiv:1206.6451. 17
[20] Holland, J. H., "Adaptation in Natural and Artificial Systems," University of Michigan Press, Ann Arbor, MI., 1975. 18 [21] Haleh, Vafaie and Ibrahim F. Imam,, 1994, Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search, Proceedings of the 3rd International Fuzzy Systems and Intelligent Control Conference. 19 [22] Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features, Proceedings of ECML '98