AN ANTI-SPAM SYSTEM USING ARTIFICIAL NEURAL NETWORKS AND GENETIC ALGORITHMS
ABDUELBASET M. GOWEDER*, TARIK RASHED **, ALI S. ELBEKAIE***, and HUSIEN A. ALHAMMI **** *
The High Institute of Surman for Comperhensive Professions, Surman-Libya
[email protected] ** The High Institute of Zahra for Comperhensive Professions, Zahra-Libya
[email protected] *** The High Institute of Computer Technology, Tripoli-Libya
****
[email protected]
The High Institute of Zawia for Comperhensive Professions, Zawia-Libya
[email protected]
Abstract Nowadays, e-mail is widely becoming one of the fastest and most economical forms of communication .Thus, the e-mail is prone to be misused. One such misuse is the posting of unsolicited, unwanted e-mails known as spam or junk e-mails. This paper presents and discusses an implementation of an Anti-spam filtering system, which uses a Multi-Layer Perceptron (MLP) as a classifier and a Genetic Algorithm (GA) as a training algorithm. Standard genetic operators and advanced techniques of GA algorithm are used to train the MLP. The implemented filtering system has achieved an accuracy of about 94% to detect spam e-mails, and 89% to detect legitimate e-mails.
Keywords: Artificial Neural Networks, Genetic Algorithms, Spam Emails, Legitimate Emails, Arabic Spam, Text Classification.
1 INTRODUCTION Spam is becoming an increasingly large problem. Many Internet Service Providers (ISPs) receive over a billion spam messages per day. Much of these e-mails are filtered before they reach end users. Content-Based filtering is a key technological method to e-mail filtering. The spam e-mail contents usually contain common words called features. Frequency of occurrence of these features inside an e-mail gives an indication that the e-mail is a spam or legitimate [1, 11, 26, 28].The spam filtering is high sensitive application of text classification (TC) task. Because spam e-mails contain high noise, and redundant data to bypass filtering systems, a pre–processing of e-mails is required in order to split contents of e-mails from HTML tags (e-mail structure) and decide which e-mail information to use. The e-mail information is organized in as a set of fields, for example: From, To, Cc, Subject, and Body fields. In addition, we should handle the cases when some words appear in different forms (e.g.: CLICK, C*L*I*C*K, N-O-W, now!). In other languages such as Arabic, some words are also occur in different forms (e.g.: Altehk(أﻟﺘﺤ ﻖ, "Join"), Altehk!(!أﻟﺘﺤﻖ, "Join!"), and Edkat*(*إﺿﻐﻂ, "Click*")). For Arabic spam emails, some of the challenges which we encountered in features reduction and selection phases are: some Arabic letters have many orthographical forms such as Alan( أﻻن, "NOW"), Elan(إﻻن,"NOW"), and Alan( اﻻن,"NOW"). In addition, some Arabic emails usually include English words which need to be considered when designing and implementing an Arabic spam filtering system. This paper is organized as follows: Section 2 gives a theoretical background for the research and a review of
relevant recent work. Section 3 provides a description of Genetic Algorithms. Section 4 describes the MultiLayer Feed Forward Artificial Neural Networks. Section 5 discusses the experimental work, the results of the experiments conducted and includes an analysis of these results. Section 6 presents the conclusion drawn by the researchers.
2 BACKGROUND AND LITERATURE REVIEW The success of statistical-probabilities algorithms and machine learning algorithms in text categorization (TC) has led researchers to explore these algorithms to be applied in anti-spam filtering [9, 10, 18]. Various techniques to extract features from e-mail have been proposed and implemented. Payne and Edwards [20] have used features consisting of words in the From and Subject fields. Segal et.al. [23] developed the MailCat system. They have used the information in the To, Cc, Subject, From, and Body Fields. Jason and Rennie [12] developed the ifile system and used the words found in the From, Subject, and Body Fields. Graham [11] extracted e-mail features from all fields in the Header and Body of e-mails. In this paper, we have used the e-mail features that found in the From, Subject, and Body fields. There are three common and intuitive representations found in text categorization and they are called: Term Frequency (TF), Term Frequency Inverse Document Frequency (TF-IDF) weight representation and semantic approaches. Jason and Rennie [12]; and Boone [3] have used TF representation in the ifile filtering system. Segal, et.al. [23] have used TF-IDF weighting scheme to develop the MailCat text classifier. Boone [3] has showed that the TF-IDF weighting scheme captures the idea that the
forward and recurrent neural networks. Yao. X. and Liu. Y. [29] reviewed the different combinations between ANN and GA, and used GA to evolve ANN connection weights, architectures, learning rules, and input features. Prados. D. [19] reported in his paper that the GA-based training algorithm is more useful for training ANN epically when simple ANN topology used.
Subject words will occur frequently in the document on a given topic. Liao, et.al. [15] have compared between TF and TF-IDF feature representations. They have concluded that the TF-IDF features representation is better than the TF representation. Scott and Matwin [24] have discussed semantic approach representation in text classification. Their approach was focused on words meanings by clustering words which have the same meaning together. The TF-IDF representations have a greater advantage over semantic approaches and TF. This is because TFIDF shows the degree of information represented by feature occurrences in e-mails. Features reduction often applied to reduce the size of features extracted from emails. Almost all techniques for features reductions consider stop-words removal. Normalizing some Arabic alphabet letters are very useful and necessary reduction step which converts some Arabic letters such as: Alef()ا with hamza( )ءabove or below or Madda(~) above into the Arabic letter Alef( )ا, and Waw( )وwith hamza( )ءor Madda(~) above into the Arabic letter Waw()و. Features selection approaches are usually employed to reduce the size of the feature set, and to select a subset of the original features. Chi-square test is used as a selection method [15, 25, 30]. Boone [3] and Salton [25] have used the TF-IDF as a feature selection and weighting scheme. They have found that the TF-IDF scheme is useful for the features size reduction. Joachims [13] has used information gain to select a subset of features. Liao, et. al. [15] have showed that the TF-IDF has similar performance to information gain and Chi-square test methods. The TF-IDF feature selection method is proposed to select the most discriminative features while eliminating irrelevant ones among arbitrarily constructed e-mail feature sets. Some algorithms are developed to classify and filter e-mails. The RIPPER algorithm [4] is an algorithm that employs rule-based to filtering e-mails. Drucker, et. al. [8] proposed an SVM algorithm for spam categorization. Jason [12] and Rennie [14] have demonstrated that the SVM is costly to train and requires significant time to classify. Sahami, et. al. [22] proposed Bayesian junk Email filter using bag-of-words representation and Naïve Bayes algorithm. Graham [11] described a simple implementation of the Naïve Bayes algorithm. Chuan, et. al. [7] proposed a Learning Vector Quantizers (LVQ) based on neural network Anti-spam e-mail approach. Özgür, et. al. [17] proposed an Anti-spam filtering method based on ANN and Bayesian networks for English languages in general and for Turkish in particular. Clark, et. al. [5] used the bag-of-words representation and ANN for automated spam filtering system. Previous researches have shown that ANN can achieve very accurate results, that are sometimes more accurate than those of the TC classifiers [27]. Some researchers used GA's as Alternative approach for training ANN [16]. Branke, J. [2] discussed how the genetic algorithm can be used to assist in designing and training. Riley. J. [21] described a method of utilizing genetic algorithms to train fixed architecture feed-
3 A GENETIC ALGORITHM A GA is used in the system proposed by this paper for training the MLP. Training the MLP based on the GA will benefit from the GA properties which are parallel interactions process between a numbers of different chromosomes information (genes) in population pool of candidate solutions. This leads to create new several chromosomes information. In this paper, GA chromosome of the MLP is encoded as weights (w1, w2, …,wn) where n is the number of MLP connections and each gene is a real value number in the interval [-10, 10]. There are two genetic operators. The first one is referred to as the uniform crossover operator. Its occurrence is based on crossover probability (Cp). The crossover occurs, if the generated random value number which is between [0, 1] is greater than or equals the Cp. The second genetic operator is called mutation which simply involves changing the genes values by adding the gene value to a uniformly random-generated number. Mutation occurs with a probability equals one for the chromosome that has not crossed, and with a probability equals (1-Cp) for the chromosome that has crossed. The mutation function can be computed according to the following equation: Uvalue = random_value[0,1] * (Min_bound Max_bound) + Max_b) (3.1) Where: Min_bound= Min_b*Random_value[0,1]*Generation_Rate (3.2) Max_bound= Max_b*Random_value(0,1)*Generation_Rate (3.3) Min_b: lower interval value = -3, Max_b: upper interval value = +3. Generation_Rate= log 10 (Max_Gen)-(Cur_Gen) / log 10(Max_Gen) (3.4) Max_Gen: Maximum number of Generation, Cur_Gen: Current generation.
3.1 A FITNESS FUNCTION The fitness function was absolute sum of the output differences between actual and desired output of a chromosome over all training data. The fitness function is computed by the following equation: The fitness function =
C ∑ desired_ou tput(i) − actual_out put(i) (3.5) c =1 Where:
2
2.
C: is the number of chromosomes in the population pool. desired_ouput(i) : indicates the e-mail class which is either a spam (represented by the value 0.1) or legitimate (represented by the value 0.9), for an e-mail i. actual_output(i) : is the expected output value of chromosome c over all e-mails in the training data.
The fitness value (the MLP Error) is less than or equals to 0.05.
5 THE EXPERIMENTAL WORK In this section, we first present the data sets that we used to conduct our evaluation experiments. Next, a pre-processing of our data and an implementation of our system are given. Then, evaluation measures to assess our system are described. Finally, a set of experiments are presented followed by the results and their discussion.
3.2 ELITISM STRATEGY AND A RANKBASED SELECTION
5.1 THE DATA SETS
A rank-based selection is needed to make few copies of a set of best chromosomes. Equation 3.6 was used to calculate the number of copies for each chromosome depending on an ordered set. Copies = (q - ((Chr_order - 1) * p)) * Chrs (3.6) Where: Chr_order: is the order number of chromosome in population pool list. Chrs: is number of chromosomes in the population pool list. q = 2 /Chrs. p = q / (Chrs - 1).
Three different e-mail data sets are used to conduct our experiments. These data sets are collected from different sources [6, 31]. Table 5.1 shows these three data sets. Table 5. 1:The Data Sets (corpora).
4 AN ARTIFICIAL NEURAL NETWORK (ANN) The ANN used in our system is the key component that does the filtering operation. The MLP architecture is a full connection feed-forward with inputs depending on the number of selected features. Each input is corresponding to a single feature which is converted to the TF-IDF weight and organized as TF-IDF vector features with a class label spam (0.1) or legitimate (0.9). The MLP output is a single output. Training is done by constructing one target output for legitimate or spam emails, and training with the appropriate output value for the input data. By observation, a threshold value is chosen to be 0.6. On the basis of the output, a value less than 0.6 is thresholded to be 0.1, otherwise the value is thresholded to be 0.9. Training the MLP is performed using one and two hidden layers. A number of hidden neurons and one output neuron with sigmoid activation function are used. English and Arabic data sets are tested on different combinations (5, 10, 15, 20, and 30) of hidden neurons. Training the MLP is achieved through the use of the GA which is described in Section 3.2. The training procedure starts with 20 chromosomes. Other experiments are conducted on different number of chromosomes (e.g.: 40 and 60). Initial chromosome genes values were real numbers in the interval [-10, 10]. A training procedure was repeated many times with many different e-mail training data, over several generations until one of the following conditions are met: 1. The maximum number (set to be 50,000) of generations is reached.
Corpus Name
No. of Spam E-mails
No. of Legitimate E-mails
Total
SpamAssassin TREC 2005 The Arabic Corpus
630 630 56
370 370 16
1000 1000 72
5.2 TRAINING AND TEST DATA Each data set was equally split into two sets (50% for training and 50% for test data). Table (5.2) shows the training and test data for each corpus (data set). Table 5. 2: Training and Test Data.
SpamAssassin The Arabic TREC Corpus Corpus Corpus Trainin Test Training Test Training Test set g set set set set set Spam 326 289 365 373 28 28 Legiti174 211 135 127 8 8 mate Total 500 500 500 500 36 36
5.3 DATA PRE-PROCESSING Data pre-processing is an analysis of the textual data and an extraction of information from e-mails. The general procedure for data pre-processing can be described according to the following steps: (i) Deletion: Remove irrelevant elements of e-mails, and select segments suitable for processing (e.g., Subject and Body Fields). (ii) Normalization: For Arabic emails, convert some Arabic letters which have the same shape such as: Alef( )اwith hamza( )ءabove or below or Madda(~) above into the Arabic letter Alef()ا, and Waw( )وwith hamza( )ءor Madda(~) above into the Arabic letter Waw()و.
3
(iii)Tokenization: Divide the message into semantically coherent segments (e.g.: words, other character strings). (iv) Representation: Convert the email message into a vector of values, where each value in this vector represents an e-mail feature. (v) Selection: Delete the least predictive e-mail features using the TF-IDF weighting scheme. The highest values of TF-IDF features are selected to represent the set of training features.
5.5 EVALUATION MEASURES The performance of spam filtering techniques is determined by two well known measures used in text classification. These measures are precision and recall [5, 15] which can be computed as follows:
N SS (5.1) ( N SS + N LS ) N LL Legitimate Precision (LP) = (5.2) ( N LL + N SL ) N SS Spam Recall (SR) = (5.3) ( N SS + N SL ) N LL Legitimate Recall (LR) = (5.4) ( N LL + N LS ) Spam Precision (SP) =
5.4 IMPLEMENTATION We have implemented an Anti spam system that runs under Windows XP platform. The code is written using Visual Basic.net. The system was built from scratch without using any ANN or GA libraries. The system has three main modules, these are: (1) A features extraction and reduction module. (2) A features weighting and selection module. (3) A classifier module, which consists of an MLP classifier and GA.
Where: NSS = the number of spam messages correctly classified as spam. NSL = the number of spam messages incorrectly
5.4.1 THE FEATURES EXTRACTION AND REDUCTION MODULE
classified as legitimate. NLL = the number of legitimate messages correctly
This module is concerned with the features extraction and reduction. It first tokenizes each e-mail included in the training data set. Then, a bag-of-words is created for each e-mail data set. No stemming was applied. Next, words that appear only three times and less in each corpus were discarded. Finally, words that are 20 characters in length or longer were removed from the email. As a result, the initial number of unique features is reduced from about 4108 to 981 for Arabic and English corpus. For SpamAssissn corpus, the initial number of features is reduced from 22000 to 3200. While for the TREC corpus, the features are reduced from 29000 to 4000.
classified as legitimate. NLS = the number of legitimate messages incorrectly classified as spam.
5.6 EXPERIMENTS The purpose of these experiments is to evaluate the performance of the MLP in spam filtering and the efficiency of GA in training the MLP. A series of tests are performed on a small problem (the XOR) to discover the best GA parameters (e.g., mutation probabilities, crossover probabilities, and population size) that give the best performance of the MLP. The best obtained GA parameters are used to train our MLP classifier.
5.4.2 THE FEATURES WEIGHTING AND SELECTION MODULE The implementation of feature selection using the TFIDF scheme was carried out after the construction of the bag-of-words. The selection of the best features is done by sorting the TF-IDF features in a descending order. We then decide how many features we might include. The experiments are conducted using different number of selected features.
5.6.1 THE XOR PROBLEM The XOR problem was the first problem to be solved using the MLP trained by the GA. This problem has become a standard example used by many researchers to explain the training process. Table 5.3 shows the different values of mutation, crossover probabilities, and population size for each experiment. Table 5.3 clearly shows that experiment 2 recorded the minimum time to train the MLP using the GA for the XOR problem.
5.4.3 THE CLASSIFIER MODULE The MLP architecture is a full connection feed-forward with inputs depending on the number of selected features. Each input is converted to the TF-IDF weight and organized as TF-IDF vector features with a class label spam (0.1) or legitimate (0.9). Two matrices are used to calculate the outputs of every layer. The first matrix is concerned with the MLP inputs organized as vectors. Each vector consists of a set of TF-IDF values. The second matrix contains a set of chromosomes which represent the weight associated with every MLP input.
4
LP% 88.7 50.5 65.9
LR% 67.7 83.4
93.8
68.6 84.9
SP%
51.5 68.5
95.3
56.7 34.2
91.3
78.1 30.8
61.4
77.9 52.9
93.8
56.4 56.7
57.0
64.2 31.6
43.6
97.1
15 neurons
36.4
47..3
SR%
59.0 96.6
A series of experiments were conducted to train our MLP using the GA parameters obtained from experiment 2 described in the previous section. These experiments are intended to train our MLP using the GA on three different data sets. Despite the fact that the training process is accomplished, there are some cases where combinations of the MLP parameters have led to a failure due to the low rates of SR, SP, LR, and LP evaluation measures. Other combinations of the MLP parameters were ignored and the processes of training were terminated because the training time exceeded 60 hours and the MLP errors were slightly improved. One of training processes that are terminated is the process where the experiment used 250 features as input, the first layer contained 30 neurons, and the second layer contained 15 neurons. The following sections present the results of the experiments conducted on three different data sets.
10 neurons
47.8
5.6.2 THE MLP AND GA CLASSIFIER
55.3
0 neurons
30 neurons
SP% LR% LP% SR% SP% LR% LP% SR%
10 neurons
60.5
Table 5. 5: The Results of SpamAssassin Data Set: (200 input features).
5.6.3 THE SPAMASSASSIN DATA SET RESULTS
68.7
64.4
52.3
61.2
83.4
83.4
61.1
88.5
91.7
89.0
85.0
10 neurons
94.8
94.7
69.4
87.5
90.6
87.6
83.7
71.2
63.9
45.0
53.3
15 neurons
48.8
54.7
84.0
71.8
54.9
71.6
51.5
54.1
40.2
37.7
Layer 2
0 neurons
30 neurons
SP% LR% LP% SR% SP% LR% LP% SR% SP% LR% LP%
SR%
10 neurons
Layer 1 20 neurons
65.3
Ps=10 Ps=20 Ps=40 Ps=60 Ps=10 Ps=20 Ps=40 Ps=60
Layer 1 20 neurons
42.5
Cp=0.7 Cp=0.7 Cp=0.7 Cp=0.7 Cp=0.7 Cp=0.7 Cp=0.7 Cp=0.7
Table 5. 4: The Results of SpamAssassin Data Set: (150 input features).
Training Time in Seconds (s) ≅10s ≅5s ≅15s >20s ≅15s ≅15s >20s ≅40s
57.9
Mp=0.3 Mp=0.3 Mp=0.3 Mp=0.3 Mp=0.5 Mp=0.5 Mp=0.5 Mp=0.5
Ps
69.5
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8
Cp
70.5
Mp
92.6
Table 5. 3: The GA Parameters for the XOR Problem.
Experiment name
5.6.4 THE TREC DATA SET RESULTS
In this experiment, we have trained the MLP using the GA on the SpamAssassin data set. This section presents the obtained results using 150 and 200 different features. Tables 5.4 and 5.5 show the SR, SP, LR, and LP values using 150 and 200 input features respectively. It can be observed from Table 5.4 that the best results as highlighted are obtained using the MLP which consists of one hidden layer with 30 neurons. These results were achieved through a gradual improvement of the initial error rate which was 123 in the first generation and it took about 29147 generations to reach the error value of 0.049. Table 5.5 also shows that the best results as highlighted are obtained using the MLP which consists of one hidden layer with 30 neurons. These results were achieved through a gradual improvement of the initial error rate which was 100 in the first generation and it took about 32654 generations to reach the error value of 0.0432.
In this experiment, the MLP has been trained using the GA on the TREC data set. The obtained results using 150 and 200 different features are given in tables 5.6 and 5.7. These tables show the SR, SP, LR, and LP values using 150 and 200 input features respectively. In general, the results show low rates, because the TREC corpus contains large number of spam e-mails that are highly similar to legitimates e-mails (hard spam). It can be observed from Table 5.6 that the best results as highlighted are obtained using the MLP which consists of two hidden layers with 30 neurons in the first layer and 10 neurons in the second one. These results were achieved through a gradual improvement of the initial error rate which was 124 in the first generation and it took about 34765 generations to reach the error value of 0.0498. Table 5.7 also shows that the best results as highlighted are obtained using the MLP which consists of two hidden layers with 30 neurons in the first layer and 10 neurons in the second one. These results were achieved through a gradual improvement of the initial error rate
5
which was 112 in the first generation and it took about 34565 generations to reach the error value of 0.046.
Table 5. 8: The Results of Arabic Data Set: (50 input features).
Table 5. 6: The Results of TREC Data Set: (150 input features).
75
90
42 50 29.
100 62
71 57
71
50 31.
100
87 75
84
95 88.
88
100
75 53.
100
30 25
96
50 50
61
82
15 neurons
80
100
67
100
10 neurons
57.
Layer 2
39.8
40.3 88.7
93.8
0 neurons
37.3
40.7
56.5 95.3 53.7
57.0 91.3 50.1
38.0
72.9 72.9
60.1 29.0
27.9
40 76.6 48.8
13.1 50.1
64.7
30.3
41.7 44.5
34.8 40.1
46.4
41.4
30 neurons
SR% SP% LR% LP% SR% SP% LR% LP% SR% SP% LR% LP%
30 neurons
SP% LR% LP% SR% SP% LR% LP% SR% SP% LR% LP%
49.4
15 neurons
49.1
30.1
10 neurons
39.1
Layer 2
0 neurons
55.8
SR%
10 neurons
10 neurons
82
Layer 1 20 neurons
Layer 1 20 neurons
Table 5. 7: The Results of TREC Data Set: (200 input features). Table 5. 9: The Results of Arabic Data Set: (90 input features).
86
28
50
100
100
36
75
31
64
85
62
33
35
67
87
88
62
100
90
53
87
35
72
96
28
100
18.
75
100
50
86
15 neurons
71
10 neurons
46
SP% LR% LP% SR% SP% LR% LP% SR% SP% LR% LP%
SR%
In this experiment, we have trained the MLP using the GA on the Arabic data set. This section presents the obtained results using 50 and 90 different features. Tables 5.8 and 5.9 show the SR, SP, LR, and LP values using 50 and 90 input features respectively. It can be observed from Table 5.8 that the best results as highlighted are obtained using the MLP which consists of one hidden layer with 15 neurons. These results were achieved through a gradual improvement of the initial error rate which was 8 in the first generation and it took about 3546 generations to reach the error value of 0.048. Table 5.9 also shows that the best results as highlighted are obtained using the MLP which consists of one hidden layer with 15 neurons. These results were achieved through a gradual improvement of the initial error rate which was 7 in the first generation and it took about 4546 generations to reach the error value of 0.039.
30 neurons
89
Layer 2
5.6.5 THE ARABIC DATA SET RESULTS
35
30.1 86.8
0 neurons
31.5
39.8
42.5 90.9
87.6 40.7
44.9
32.5
41.0 45.4
90.3 35.2
63.2
72.9 72.9 85.7
54.4 64.5 85.9
23.5 35.9 63.6
32.1 29.6 42.0
41.7 44.5 46.4
Layer 1 20 neurons
10 neurons
100
30 neurons
SP% LR% LP% SR% SP% LR% LP% SR% SP% LR% LP% 45.5 36.0 57.6
15 neurons
22.8
10 neurons
53.2
Layer 2
0 neurons
35.6
SR%
10 neurons
Layer 1 20 neurons
5.7 THE OVERALL PERFORMANCE The results of our experiments indicate that our implemented MLP classifier using the GA performed significantly well. The overall accuracy rates are about 94% to detect spam e-mails. On the other hand, the accuracy rates are about 89% to detect legitimate emails.
5.8 DISCUSSION OF THE RESULTS An analysis of the results and a deep understanding of the experiments produced a set of remarks as follows: (1) The best input features for English e-mails were 150 that generate the best results comparable to the 200 input features. For Arabic e-mails, 90 input features are considered to be the best input features. This implies that the sucess rates are apparently influnced by the number of input feaures. (2) Words in legimate e-mails are as important as words in spam e-mails for the filtering process. By obsrvation, most misclassifications were e-mails containing only
6
[5] Clark, et. al., "A Neural Network Based Approach to Automated E-Mail Classification”, IEEE/WIC International Conference on Web Intelligence, 2003. [6] Cormack, G. Lynam, T., “Spam Corpus Creation for TREC”, Second conference of E-mail and Anti-spam, 2005. [7] Chuan, Z., et. al., “A LVQ-based neural network anti-spam e-mail approach”, Proceedings of the 5th International Conference, Singapore, 2004. [8] Drucker, H., et. al., “Support Vector Machines for Spam Categorization”, In IEEE Transactions on Neural Networks, 1999. [9] Flavio, D., et. al., ”Spam Filter Analysis”, University of Nijmegen, the Netherlands, 2003. [10] Goodman, J., “Spam: Technologies and Policies”, Microsoft Research, 2003 [11] Graham, P., “A Plan for Spam”, MIT Conference on Spam, 2003. [12] Jason, D., Rennie, M., "ifile: An Application of Machine Learning to E−Mail Filtering", Text Mining Workshop, Boston, U.S.A, 2000. [13] Joachims, T., "Text Categorization with Support Vector Machines: Learning with Many Relevant Features", Proceedings of ECML-98, 10th European Conference on Machine Learning, 1997. [14] Kolcz, A., Alspector, J., "SVM-based filtering of email spam with content-specific misclassification costs", In Proceedings of the Workshop on Text Mining, IEEE International Conference on Data Mining. San Jose, California, 2001. [15] Liao, C., Alpha, S., "Dixon.P, "Feature Preparation in Text Categorization", Oracle Corporation, 2004. [16] Montana. D., Davis, L., "Training feed-forward neural networks using genetic algorithms", In Proceedings of the 11th International on Artificial Intelligence, 1989 [17] Ozgur, L., et. al., “Adaptive Turkish Anti-spam Filtering”, International Twelfth Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN), 2003. [18] Oda, T., White, T., “Increasing the Accuracy of a Spam-detecting Artificial Immune System”, In the Congress on Evolutionary Computation Proceedings, Canberra, Australia, 2003. [19] Prados. D., “Training multilayered neural networks by replacing the least fit hidden neurons,” In Proceedings IEEE SOUTHEASTCON’ 2002, 2002. [20] Payne, T., Edwards, P., "Interface Agents that Learn: An Investigation of Learning Issues in a Mail Agent Interface". Applied Artificial Intelligence, 1997. [21] Riley. J., "An evolutionary approach to training Feed-Forward and Recurrent Neural Networks", Master thesis of Applied Science in Information Technology, Department of Computer Science, Royal Melbourne Institute of Technology, Australia, 2002. [22] Sahami, M., et. al., “A Bayesian Approach to Filtering Junk E-Mail, In Learning for Text Categorization”, AAAI Technical Report, U.S.A, 1998.
one or two words, or Arabic e-mails which have Arabic mixed with English words. (3) A wise setting of the number of hidden layers and the number of neurons can significantly dcrease the MLP error rates. (4) The initial parameters that were used during the development of the GA were Mp = 0.3, Cp = 0.7, Ps=20, and maximum generations were set to be 50,000. These settings were suitable for e-mail filtering domain. Increasing the population size gives less chances of good chromosomes to appear in the next generation using the rank-based selection. The GA works better using many inputs (spam filtering) than using a few inputs (the XOR problem).
6 CONCLUSION An anti-spam filtering system was proposed which uses the multi-layer artificial neural network trained by the genetic algorithm. The results clearly show that the Subject and Body fields can contain enough information to classify e-mails into spam or legitimate. The results have also shown that the MLP with 15-30 neurons in the first hidden layer are sufficient to filter both easy spam and easy legitimate e-mails. The MLP architecture used to develop our system is good for filtering e-mails, if we do not take into account the long time needed to train the MLP. We have also investigated the effects of several GA parameters. The parameters that have been found to be the most significant to the performance of the classifier are: size of the population pool, crossover, mutation probabilities and mutation method. It is important to remember that e-mail filtering is high sensitive application of textual classification problem. The classifier must be able to handle many input features, with low false positive and low false negative.
ACKNOWLEDGEMENT We would like to express our gratitude to the Libyan General Secretariat for Human Resources and Training for supporting this work.
REFERENCES [1] Bruening, P., “Technological Responses to the Problem of Spam: Preserving Free Speech and Open Internet Values”, First Conference on E-mail and AntiSpam, 2004. [2] Branke, J., "Evolutionary algorithms for neural network design and training", In Proceedings 1st Nordic Workshop on Genetic Algorithms and its Applications, Finland, 1995. [3] Boone, G., "Concept Features in Re:Agent, an Intelligent E-mail Agent", The Second international Conference on Autonomous Agents, 1998 [4] Cohen, W., “Learning Rules that Classify E-mail”, In AAAI Spring Symposium on Machine Learning in Information Access, California, 1996.
7
[23] Segal, R., et. al., " MailCat: An intelligent assistant for organizing e-mail", Proceedings of the Third International Conference on Autonomous Agents, 1999. [24] Scott, S., Matwin, S., "Feature engineering for text classification", Proceedings of ICML-99, 16th International Conference on Machine Learning, 1999. [25] Salton, G., Buckley, C., "Term Weighting Approaches in Automatic Text Retrieval", Information Processing and Management, Vol. 24, No.5, P513, 1988. [26] Urnkranz, J., "A Study Using n-gram Features for Text Categorization", Austrian Research Institute, 1998. [27] Vinther, M., "Intelligent junk mail detection using Neural networks", URL:http://www.logicnet.dk/reports/JunkDetection/Jun kDetection.pdf, 2002. [28] William, S., et. al., “A Unified Model of Spam Filtration”, MIT Spam Conference, Cambridge, 2005. [29] Yao. X., Liu. Y., "A new evolutionary system for evolving artificial neural networks", IEEE Transactions on Neural Networks, 1997. [30] Yang, Y., Pedersen. J., "A comparative study on feature selection in text categorization", In Proceedings of ICML-97, 14th International Conference on Machine Learning, U.S.A, 1997. [31]URL:http://www.spamassassin.org/publiccorpus/
8