A Hybrid Gini PSO-SVM Feature Selection Based on Taguchi Method: An Evaluation on Email Filtering Noormadinah Allias
Megat Norulazmi Megat Mohamed Noor
Universiti Kuala Lumpur Department of MIIT 50250 Kuala Lumpur, Malaysia +60125594249
Universiti Kuala Lumpur Department of MIIT 50250 Kuala Lumpur, Malaysia +60194574644
[email protected]
[email protected]. edu.my
ABSTRACT The flooding of spam emails in email server is an arm- race issue. Even until today, filtering spam from email messages has still become as an ongoing work by researchers. Among all of the methods proposed, methods by using machine learning algorithms have achieved more success in spam filtering. Unfortunately in machine learning, a high dimensionality of features space after preprocessing became as a big hurdle for the classifier. Not only high dimensionality issues, the excessive number of features also can degrade the classification results. Thus in this paper, we proposed two stages of feature selection based on Taguchi methods to reduce the high dimensionality of features and obtain a good classification result for spam filtering. Firstly, we apply Gini Index feature selection to reduce the dimension of terms; and then we applied Taguchi method to assist Gini Index and PSOSVM in selecting the best combination of parameter settings. This method is trained and tested on a Lingspam dataset. The performance of the proposed method is compared with the traditional feature selection and current work by another researcher. The result showed that our proposed method produced a good precision and recall result with the lowest number of features.
Categories and Subject Descriptors D.3.3 [Programming Languages]: Machine Learning, Natural Language Processing
General Terms Algorithms, Performance
Keywords Feature selection, Spam filter, Particle Swarm Optimization, Taguchi method, Orthogonal array
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IMCOM (ICUIMC)’14, January 9–11, 2014, Siem Reap, Cambodia. Copyright 2014 ACM 978-1-4503-2644-5 …$15.00.
Mohd. Nazri Ismail National Defence University of Malaysia Faculty of Defence Science and Technology Kem Perdana Sungai Besi 57000 Kuala Lumpur +60132617009
[email protected]
1. INTRODUCTION Nowadays, the internet has changed the way how we communicate with each other. Even though users are separated by distances, the availability of exchange information can be performed merely within a few seconds. By using the internet, all types of communication including phone conversation, short message service (SMS), tele-conference, video call, and email, the process of communication becomes faster as compared to the traditional methods that we had back then. This technology has benefited users in terms of time and cost; even if the users are separated by thousands of miles. However, the savviness of what email can contribute to mankind is often misused by irresponsible people and companies to send advertisements as a way to gain profit. They are constantly sending a lot of advertisements without users' request. Usually, this type of unwanted advertising is known as spam. Spam email has become a serious growing issue, which not only consume users' time to remove the messages, but also cause many problems such as limit the mailbox space, wasting network bandwidth and even able to lead to financial losses due to unintended deleting of important emails. According to a report published by McAfee in March 2009, a cost of lost in productivity per day for a user approximately equal to $0.50, based on user's having to spend 30 seconds to deal with two spam messages each day while the spam filter working at 95 percent accuracy. Therefore, the productivity loss per employee per year due to spam is approximately equal to $182.50 [1]. Until today, a recent study by IT Security company Karpesky Lab, found that more than 70% of emails sent in Q2 2013 were actually spam, an increase of more than 4% over Q1 totals [2]. Due to the incremental number of spam in email traffic, a lot of methods have been proposed to filter spam email. Among all of the methods proposed, methods by using machine learning algorithms have achieved more success. However, in machine learning, filtering spam email is related to text categorization problems. A high dimensionality of features space in email messages has become as a big hurdle. This is due to thousands of words can be extracted from the email, and it has definitely exceeded the number of emails. To reduce a high dimensionality of the feature space, methods based on feature selection and feature extraction have been introduced.
Based on the literature, there is no prior work applied Taguchi method on feature selection to assist Gini Index and PSO- SVM in having the best parameter setting on spam domain. This method is evaluated on the public email dataset and compared with work by previous researchers and traditional feature selection method, including Gini Index, and Information Gain(IG). The remainder of this paper is organized as follows: Section 2 present the feature selection strategies. Methodology is described in Section 3. Section 4 presents the performance evaluation used for comparing the achieved results. Section 5 describes the experimental results. Finally, Section 6 discussed conclusion and future works.
2. FEATURE SELECTION STRATEGIES A variety of feature selection methods have been developed to tackle the issue of high dimensionality. The major challenge of feature selection is to extract a set of features, as small as possible, that accurately classifies the learning examples. Gini Index and Information Gain(IG) are among the well known filter types of feature selection method used as a benchmark comparison in spam domain. Gini Index is an improved version of the method originally used to find the best split of attributes in decision trees. It has simpler computation than the other methods in general [3].Other than Gini Index, Information Gain(IG) is one of the features selection widely used in spam filtering. Information Gain(IG) is an information theory method, which often used in machine learning. It is mainly used to measure the amount of information that the feature can offer to the classification system. The larger the value of Information Gain (IG) is, the more significant the features are [4]. However, instead of using Gini Index and Information Gain(IG), there is another method based on evolutionary feature selection proposed by previous researchers such as genetic algorithm and particle swarm optimization. Wang, G., Y.-n. Liu, et al [5] had proposed a new fuzzy adaptive multi population genetic algorithm(FAMGA) in order to reduce the high dimensionality of features and automatically to find the best feature subset to classify spam email. FAMGA consists of multiple subpopulations, and each population runs independently. The method is tested on the Ling- Spam email dataset. The result of the experiment showed that the proposed method improves the performance of spam filtering, and is better than other methods of feature selection. Hao, W., et al [6] proposed a fuzzy adaptive particle swarm optimization(FAPSO) to find an optimal feature subset. The proposed method is divided into three stages, namely core feature subset selection, feature subset selection, and spam filtering. LingSpam email dataset is used in the experiment. The numerical result and statistical analysis showed that the proposed approach is capable of finding an optimal feature subset from a large noisy dataset due to high dimensional noise. In addition, NSFF performs significantly better than the other methods in terms of prediction accuracy with a smaller subset of features. The next chapter discusses the methodology used in this experiment.
on raw data, which is tokenization, transform case, stop words and stemming.We used Ling-Spam [7] email corpus to evaluate the method proposed with 10- fold stratified cross validation. The Lingspam corpus consist of 2412 Linguist messages and 481 spam messages from the Linguist list.
Training dataset
Testing dataset
Pre- processing
Pre- processing
tokenization
tokenization
transform case
transform case
removing stopword
removing stopword
stemming
stemming
term weighting
term weighting
Feature ranking and Dimension Reduction Gini Index
Apply Taguchi method
top k features
Feature optimization
Classification
PSO- SVM
Naive Bayes
Figure 1 . The structure of proposed feature selection method
3.1 Pre- processing Tokenization is used to extract the words in the message body. After a tokenization process, all of the words will be transformed into lowercase. Then, the unnecessary words that occur in many messages are eliminated by using stopword removal. E.g; "is", "a", "the". Porter algorithm is used as a stemming method to reduce the words into a root form. E.g; "cooking", "cooked" to "cook". Finally, all of the document is converted into vector of space model using TF- IDF algorithm. After the final stage of preprocessing, the features ranking and dimension reduction are performed on the next stage.
3.2 Feature ranking and dimension reduction Gini Index is used to select and ranks the attributes accordingly to how important they are. The attributes are ranked in decreasing order. Attribute with the highest weighting score is selected; varying from 25 features until 125 features. All of the selected features will become as input for PSO-SVM and mapped with parameter settings by using an Orthogonal Array(OA) retrieved from Taguchi method.
3. METHODOLOGY All of the experiments have been carried by using Rapid Miner 5.3.008 on AMD A6- 3420M APU with Radeon (tm) HD Graphic 1.5 GHz with 8 GB RAM. A standard preprocessing is performed
3.3 Feature Optimization There is no prior work taking the advantages of the filter and the wrapper types of feature selection method together in spam
domain, except works by [8] that implemented it on handwritten Chinese character dataset and several other typical data sets. However, in our method, we had hybrid both of the filter and wrapper types of feature selection and implemented it with Taguchi method.
3.3.1 Particle Swarm Optimization(PSO) Particle Swarm Optimization was proposed by [9] in 1995. It is a global optimization techniques inspired by the social interaction of animals, such as birds, fishes, and even human beings. It is based on the observation that social sharing of information may provide some kind of evolutionary advantage to the individuals of a population, enhancing their capability to solve complex problems. Instead of using evolutionary operators to manipulate the individuals, like in other evolutionary computational algorithms, each individual in PSO flies in the search space with a velocity which is dynamically adjusted according to its own flying experience and its companions' flying experience. Each individual is treated as a volume- less particle( a point) in the D- dimensional space. The ith particle is represented as Xi = (xil, xi2, ... , xiD). The best previous position (the position giving the best fitness value position giving the best fitness value) of the ith particle is recorded and represented as Pi = (pil, pi2, ... , piD). The index of the best particle among all the particles in the population is represented by the symbol g. The rate of the position change (velocity) for particle i is represented as Vi= - (vil, viz, . . . , viD) to the following equation: Vid = Vid + C1 * rand() * (Pid - Xid) + C2 * Rand() * (Pgd- Xid) (1) Xid = Xid + Vid
(2)
The particles are manipulated according where cI and c2 are two positive constants, and rand() and Rand() are two random functions in the range [0,1]. In PSO, a parameter called inertia weight is brought in for balancing the global and local search. A large inertia weight facilitates a global search while a small inertia weight facilitates a local search[9]. In this research PSO is integrated with SVM for better optimization.
3.3.2 Support Vector Machine (SVM) Support vector Machine(SVM) method is a new and very popular technique for data categorization that is used in the machine learning community. SVM has been proven to be very effective in the field of text categorization because it can handle the high dimensional data by using kernels [10].
3.4 Implementation of Taguchi method Taguchi method was first introduced by Dr. G. Taguchi in 1985. It is a robust design approach, uses many ideas from statistical experimental design for evaluating and implementing improvements in products, processes, and equipment [11]. In Taguchi method, an Orthogonal Array(OA) is used to reduce the number of experiments and obtain good experimental results. Therefore, a selection of best combinations of k attributes with parameters setting in PSO-SVM is a challenging task. Thus, by referring to parameter setting from [12, 13], a Taguchi method is applied to assist PSO-SVM in selecting the best combination settings for the value of k from Gini Index, population, iteration and inertia.
To implement the Taguchi method, we need to identify number of control factors and levels. After the identification, all of the factors and levels are inserted into Minitab statistical software, to produce an appropriate Orthogonal Array(OA) which consist of the best mapping of k values and parameter settings for PSOSVM. Through this method, we only need to run 25 experiments instead of 625 experiments. Table 1. An orthogonal array table mapping between attributes and PSO parameter setting Level
k
Population
Iteration
Inertia
L1
25
1
10
0.9
L2
25
3
20
0.8
L3
25
5
30
0.7
L4
25
7
40
0.6
L5
25
9
50
0.5
L6
50
1
20
0.7
L7
50
3
30
0.6
L8
50
5
40
0.5
L9
50
7
50
0.9
L10
50
9
10
0.8
L11
75
1
30
0.5
L12
75
3
40
0.9
L13
75
5
50
0.8
L14
75
7
10
0.7
L15
75
9
20
0.6
L16
100
1
40
0.8
L17
100
3
50
0.7
L18
100
5
10
0.6
L19
100
7
20
0.5
L20
100
9
30
0.9
L21
125
1
50
0.6
L22
125
3
10
0.5
L23
125
5
20
0.9
L24
125
7
30
0.8
L25
125
9
40
0.7
4. PERFORMANCE EVALUATIONS Precision and recall are usually employed to evaluate the accuracy of spam filtering results.
4.1 Precision: Precision measures how many messages, classified as spam, are truly spam. This also reflects the amount of legitimate e-mail mistakenly classified as spam. The higher the spam precision is, the fewer legitimate e-mail have been mistakenly filtered [14]. It is defined as follows: P=
TP
(3)
TP+FP
4.2 Recall: Recall measures the percentage of spam that can be filtered by an algorithm or model. High spam recall ensures that the filter can protect the users from spam effectively [14]. R=
TP
(4)
TP+FN
Figure 3 . Precision and recall rate (%) by IG, Gini and Hybrid Gini PSO-SVM with Taguchi Figure 3. shows a comparison result of traditional feature selection method, which is Information Gain (IG), Gini Index, and a Hybrid Gini PSO-SVM with Taguchi. As shown, IG and Gini Index achieved a high precision result which is 100%, as well as a hybrid Gini PSO-SVM with Taguchi. However, our proposed methods seem to be able to give a very competitive result; just by using a small number of features only. Recall results showed that our proposed method is higher than IG and Gini Index, which is 91.36%, 82.73%, and 82.27% respectively.
Where TP, FP, and FN represent the number of true positive, false positive and false negative respectively.
5. EXPERIMENTAL RESULTS Figure.2 and Figure.3 show comparative results of the proposed method.
Figure 2. Precision and recall rate(%) by FAMGA, NFSS and Hybrid Gini PSO-SVM with Taguchi As shown in Figure 2, our proposed method achieved a higher precision rate compared with FAMGA and NFSS. In this experiment, our precision is 100.00%, while FAMGA and NFSS only achieved 98.32%, 97.83% respectively. From the observation, the best parameter setting within Gini Index and PSO-SVM plays an important role for the higher precision result. This showed that in dimensionality reduction issues, number of features is not the only factor that influences the classification result, but the other factors such as population, inertia and iteration also play important roles and need to be considered. However, in terms of recall, even though the proposed method achieved 91.36%, lower than FAMGA and NFSS which is 97.88% and 97.15%, but our method is still able to give a competitive result.
Table 2. Performance of FAMGA, NFSS and Hybrid Gini PSO-SVM with Taguchi vs. number of features on LingSpam corpus Methods
No.of features
Precision (%)
FAMGA(2010)
454
98.32
NFSS(2011)
483
97.83
Hybrid Gini PSO-SVM with Taguchi
100
100
Table 2 shows the performance of our Hybrid Gini PSO-SVM with Taguchi, FAMGA, and NFSS. Based on the table above, we can see that by using 100 features only, our result for precision rate is better than FAMGA and NFSS. In addition, our proposed method selected the least number of features and the number of selected features was lower than the other two methods significantly. Table 3. Result of IG, Gini Index and Hybrid Gini PSO-SVM with Taguchi vs. number of features on LingSpam corpus on LingSpam corpus Methods
No.of features
Precision (%)
Information Gain(IG)
125
100
Gini Index
125
100
Hybrid Gini PSO-SVM with Taguchi
100
100
By referring to Table 3, Information Gain(IG) and Gini Index feature selection methods have given the best result when the number of features reached by 125, except method proposed by us, which is 100 only. This result showed that even there are a lot of feature selection methods nowadays, IG and Gini Index are among the most effective feature reduction in feature selection, and our methods seem to be a good competitor to both of these algorithms.
6. CONCLUSION AND FURTHER WORKS In this study, we proposed a method that combines a filter and wrapper based approach of feature selection with Taguchi method. The filter type used in this experiment is Gini Index while PSO-SVM is used for wrapper. Gini Index is used as a feature ranking and the top k number of attributes will be selected as an input to PSO-SVM. Taguchi method is employed to map values of k, population, iteration and inertia weight by using an orthogonal array that can help to search for the best combinations of parameter setting. We compared the performance of our proposed method in terms of precision, and recall with well- known feature selection algorithm and method proposed by [5][6] as a benchmark comparison. The result showed that our proposed method is higher than [5][6] for precision with the lowest number of attributes. Comparing with IG and Gini Index, our Hybrid Gini PSO-SVM with Taguchi is better for recall result. Future extension of this work is to test the proposed method on different data sets and domain by measuring the robustness of the method. Another interesting extension of this study would be to implement Taguchi methods with the feature extraction algorithm.
7. ACKNOWLEDGMENTS The authors are grateful to the reviewers who give thoughtful comments to improve the quality of this paper.
8. REFERENCES [1] Almeida, T. A., A. Yamakami, et al. 2010. Probabilistic anti-spam filtering with dimensionality reduction. Proceedings of the 2010 ACM Symposium on Applied Computing. ACM, Sierre, Switzerland. [2] Gudkova, D. Spam in Q2 2013. Available from: http://www.securelist.com/en/analysis/204792297/Spam_in_ Q2_2013.
[3] Uysal, A. K. and S. Gunal 2012. A novel probabilistic feature selection method for text classification. KnowledgeBased Systems. [4] Yang, J., Y. Liu, et al. 2011. A new feature selection algorithm based on binomial hypothesis testing for spam filtering. Knowledge-based Systems 24 (6): 904-914. [5] Wang, G., Y.-n. Liu, et al. 2010. A New Fuzzy Adaptive Multi-Population Genetic Algorithm Based Spam Filtering Method. Information Engineering and Computer Science (ICIECS), 2010 2nd International Conference on, IEEE. [6] Hao, W., et al. A Novel Spam Filtering Framework Based on Fuzzy Adaptive Particle Swarm Optimization. in Intelligent Computation Technology and Automation (ICICTA), 2011 International Conference on. [7] Androutsopoulos, I., J. Koutsias, et al. 2000. An evaluation of naive bayesian anti-spam filtering. arXiv preprint cs/0006013. [8] Zhang, L.-X., J.-X. Wang, et al. 2003. A novel hybrid feature selection algorithm: using ReliefF estimation for GAWrapper search. Machine Learning and Cybernetics, 2003 International Conference on, IEEE.. [9] Kennedy, J. and R. Eberhart.1995.Particle swarm optimization. in Neural Networks. Proceedings., IEEE International Conference on. 1995: IEEE. [10] Meng, J., H. Lin, et al. 2011. A two-stage feature selection method for text categorization. Computers & Mathematics with Applications 62(7): 2793-2800. [11] Wei-Chih, H. and T.-Y. Yu. 2009. E-mail spam filtering using support vector machines with selection of kernel function parameters. Innovative Computing, Information and Control (ICICIC), 2009 Fourth International Conference on, IEEE. [12] Shi, Y. and R.C. Eberhart. 1999. Empirical study of particle swarm optimization in Evolutionary Computation.. CEC 99. Proceedings of the 1999 Congress on. 1999: IEEE. [13] Unler, A. and A. Murat. 2010. A discrete particle swarm optimization method for feature selection in binary classification problems. European Journal of Operational Research 206(3): 528-539. [14] Zhu, Y. and Y. Tan. 2011. A local-concentration-based feature extraction approach for spam filtering. Information Forensics and Security, IEEE Transactions on 6(2): 486-497.