IJICIS, Vol.9, No.2 JULY 2009
ARTIFICIAL IMMUNE SYSTEM FOR SPAM FILTERING M. H. Haggag
I. E. Fattoh
Computer Science Department, Faculty of Computers & Information, Helwan University, Helwan - Egypt
[email protected]
Computer Science Department, Faculty of Information Technology, Misr University for Science & Technology, 6 October, Egypt.
[email protected]
Abstract: Spam has made the mail system unreliable because mail can be caught falsely by spam filtering before being delivered to the recipient conversely spam mails can be found in the recipient mail box. Artificial Immune System (AIS) model is inspired from the natural immune system. In this paper an AIS model is proposed to satisfy the spam filtering mechanism. The body of the e-mail messages is used in both training and testing processes. The words that constitute the e-mail are weighted and used in evaluating the affinity between an antibody and an antigen. A learning phase is provided to reward the cells that correctly recognize the spam e-mails. Instead of using clonal selection, negative selection is used in the training phase which resulted in enhanced testing performance due to the reduced amount of detectors. Compared with other approaches, the percentage of training set elements is far below others, which assures enhanced performance as well. The system was tested against the publicly available (PU) Corpora of spam and non-spam. Results share considerable achievement compared with different techniques in this aspect. Keywords: Artificial Immune System, Negative Selection, Clonal Selection, Spam Filtering.
1. Introduction The natural immune system defends the body against harmful diseases and infections. It is capable of recognizing and eliminating any foreign cell or molecule. To achieve its tasks, the immune system has evolved complex pattern recognition and response mechanisms following various differential pathways. The core job of natural immune system is to distinguish between self and harmful non-self elements. The harmful non-self elements of particular interest are the pathogens. Accordingly the Artificial immune System (AIS) can distinguish between self of legitimate email (non-spam) and a non-self of illegitimate email (spam). The main difference between natural and artificial immune filtering system is that the self elements in the natural immune system is constant throughout life, while in the artificial immune filtering system they are changing. There are two inter-related systems by which the body identifies foreign materials: the innate immune system and the adaptive immune system [1, 2]. In innate immune system; the body is born with the ability to recognize certain microbes and immediately destroy them. Adaptive immune system uses generated antigen receptors which are clonally distributed on the two types of lymphocytes; B cells and T cells which are maturated in bone marrow and thymus. The immune system has a set of global features that enable it to fulfill the recognition. One of these features is The Clonal Selection which describes the basic features of an immune response to an antigenic stimulus. When a B-cell receptor recognizes a non-self antigen with a certain affinity, it is selected to proliferate and produce new antibodies. Proliferation in the case of immune cells is a sexual process, where the cells divide themselves. In addition to proliferating and differentiating the activated B cells with high affinities are selected to become memory cells with long life spans [3]. Another feature is the Negative selection, which presents a method to perform pattern recognition by storing information about the complement set (nonself) of the patterns to be recognized (self). 117
Haggag & Fattoh: Artificial Immune System For Spam Filtering
From information-processing perspective, the immune system is parallel and distributed adaptive system with partial decentralized control mechanism. Immune system utilizes feature extraction, learning, storage memory, and associative retrieval in order to solve recognition and classification tasks. In particular, it learns to recognize relevant patterns, remember patterns that have been seen previously, and has the ability to efficiently construct pattern detectors. The overall behavior is an emergent property of many local interactions. Such information-processing capabilities of the immune system provide many useful aspects in the field of computation [4]. The problem of spam filtering appeared several years ago, it is a part of the information filtering problem. The word spam is used generally to represent a variety of junk, especially unwanted junk email [5]. There are many techniques applied to solve this problem like gray listing, Bayesian filtering, white and black lists, and real time black lists. This paper presents a new mechanism for controlling spam using an artificial Immune system (AIS) based on the negative and clonal selection mechanisms. The following sections are organized as follows; the coming section details historical attempts in the immune system and spam filtering literature. The proposed model is presented in section 3. Finally, experimental results and system evaluation are discussed in section 4. 2. Literature Review of the Previous Work Many publications have addressed the spam filtering problem, and provided solutions with different approaches to it. Zhao and Zhang [6] presented a rough set based model to classify emails into three categories (spam, no-spam and suspicious), rather than two classes (spam and non-spam). Results compared with popular classification methods like Naive Bayes classification, showed that the discussed error ratio for a non-spam is discriminated to spam has been reduced. Wu [7] has proposed a novel anti-spam system which utilizes visual clues, including the embedded-text information. The experimental results showed that the proposed anti-spam filter is both effective and efficient. It was indicated that the proposed anti-spam filter can bring extra filtering power to the existing text-based anti-spam filters. Brakto et al. [8] investigated an approach to spam filtering based on adaptive statistical data compression models. These models were employed as probabilistic text classifiers based on character-level or binary sequences. The results of this approach showed that compression models are very robust to the type of noise introduced in the text by typical obfuscation tactics used by spammers. They concluded that updating compression models adaptively to the target document is beneficial for classification. Hsiao et al., [9] proposed a classification approach to improving the effectiveness of spam filtering on the issue of skewed class distributions in. This approach first clusters concepts into several subconcepts, and extract equal-sized keywords for each sub-concept. Performance results of that approach are comparable to that of KNN; moreover its speed is faster than KNN when the problem of skewed classes is not critical. Also that approach performs well in terms of speed and recall than KNN when the problem of skewed classes is critical. Oda and White [10] proposed an artificial immune system model to protect users email from spam email. The resulting system classifies the messages with similar accuracy to other spam filters, although it uses fewer detectors. Correct classification was 99% of non-spam and 70% of spam with a default threshold. Socker et al., [11] introduced another artificial immune system for e-mail classification (AISEC). They used email header (subject and sender) in the training and testing. AISEC used the 118
IJICIS, Vol.9, No.2 JULY 2009
Hamming method to calculate the affinity between an AB and an AG. Clonal selection method is used in its training phase to increase the initial number of detector. AISEC resulted in classification accuracy of 89.09%. Sarafijanovic and Le Boudec [12] proposed a new artificial immune system for collaborative spam filtering. The system used two techniques for collaborative spam filtering. The first is local processing of the signatures created from the emails prior to deciding whether and which of the generated signatures will be exchanged with other collaborating anti-spam systems. The second is the representation of the email content, based on a sampling of text strings of a predefined length and at random positions within the emails, and a use of a custom similarity hashing of these strings. Results achieved promising detection under modest collaboration. Yue et al., [13] proposed a behavior-based collaborative algorithm for e-mail servers using an artificial immune system. This approach has shown its characteristics of reliability, efficiency and scalability, the presented method could be used in conjunction with other filtering systems to minimize errors. 3. Proposed Artificial Immune System for Spam Filtering (AISSF) In the proposed spam filtering model, both spam and non-spam emails are considered in the training process. The following sections explain the behavior of the proposed model. Before drilling into the algorithm the following immune system considerations, being contributed in the proposed model are to be highlighted.
Detectors representation: The main component of the spam immune system is the detector, which are called antibodies (AB). Antibodies are needed to validate the contents of an email message. The antibodies can be described by attribute vectors X={x1, x2,….,xn}. To convert an email message to an antibody, firstly we compute the local weight of each word in the email. The local weight is calculated in a normalized basis by counting the occurrences of each word relative to the number of email words. Top 30 words having highest weights are then selected. These weights are then normalized with respect to all of words being selected as to constitute that antibody as indicated by the following formula. Actual _ Weight Min _ Weight (1) Generalized _ Weight Max _ Weight Min _ Weight Where Actual Weight represents the local weight for that word, Min_Weight represents the minimum local weight of top 30 words being considered for that email, and Max_Weight represents the maximum local weight of top 30 words being considered for that email. Using generalized weight maps word's weighting into a normalized scale characterized by top 30 words representing the email.
Weighted affinity: affinity mechanism in [11] is based on the count of similar words in both emails being checked. The proposed affinity evaluation approach is based on local weights rather than counts of the similar words of the 2 emails. The next step is to check this affinity against a threshold to specify the class of the email. The weighted affinity method will be explained in a later section.
Dynamic thresholding: No static value for the threshold is used in the proposed model. Instead, a threshold function that directly depends on the number of similar words in both of the checked emails has been proposed.
119
Haggag & Fattoh: Artificial Immune System For Spam Filtering
1 Threshold (2) no.of matched words As stated in equation 2, threshold is inversely proportional to the number of matched words. This way, every time the number of matched words between the two emails increased, the threshold value between them will decrease and vise versa. This dynamic calculation of the threshold value allows adaptation according to the degree of similarity between the emails being checked.
Gene library: All words extracted from the training elements and memory cells are kept in the gene library. This library is used to perform mutation. A word from this library will replace a word from a cell’s feature vector as will be described in the algorithm.
Antibody cashing: Words having highest weights from the gene library are cashed in an antibody buffering layer. Testing the system with cashed antibody enhanced the performance of the system too much. Although the accuracy of system is far below normal if used standalone as will be seen in experimental results.
Detectors sorting: detectors set have been sorted after testing a number of times. This sorting enhances the performance of the filtering process. The sorting depends on the lifespan of the e-mail. Results showed that processing time has been reduced half of its value.
User feedback: no deletion for e-mails classified as spam. Instead, the system stores it in a temporary folder. If the user deleted the e-mail from this temporary folder, this indicates that the system has performed correct classification. Accordingly the cell is rewarded by being allowed to reproduce. On the other hand, if the user doesn't delete the e-mail, it is assumed that the system had bad classification. As a result, artificial cells that recognized the antigen may be deleted.
Cell killing process: Increase in population size results from clonal selection process. The system gives any new B-cell a finite lifespan when created. It may lengthen its life span by recognizing new spam e-mails continually and can safe its lifespan by not recognizing any non-spam e-mail. If new B-cell recognized non-spam e-mail falsely, it is deleted. Memory cells may also die, although these cells have proved their worth. For this reason memory cells are not directly deleted. When a new memory cell is added to the memory cell set, the lifespan of all memory cells recognizing this memory cell will be shrinking. If the lifespan value of a memory cell reaches zero, it is removed from the system. This prevents from producing more memory cells that cover the same area, while one is sufficient. AISSF algorithm: In this section, the proposed algorithm, which we call AISSF, for spam filtering is described in details. AISSF is content-based technological filtering. All aspects described in above section are contributed to the model and coherently affect the system performance. The first step is to train the detectors. As mentioned previously, the system uses both spam and nonspam e-mails in the training process. Once the training has been achieved, each antibody exist in the result of training represents an example of a predefined spam e-mail. Cashed antibody layer is constructed from the gene library based on the highest weighted words in the gene library. The new coming email needs to be classified is called antigen (AG). To test an AG, it is first preprocessed per detectors definition discussed earlier. The antigen will be presented to all antibodies and the affinity is then calculated, as below. If the value of affinity is higher than a threshold value, it is classified as spam; otherwise it is classified as non-spam and allowed to pass to the user’s inbox. 121
IJICIS, Vol.9, No.2 JULY 2009
According to user's feedback, different possibilities may exist. If the antigen is classified as non-spam and the user moved it to the spam folder which means that the classification was false, the system then adds this antigen to the antibodies in the training set to prohibit it from appearing in the inbox another time. If the antigen is classified as a spam it will be moved to a spam folder. If the user deleted the antigen from the spam folder he is confirming that it represents a spam email which means that the classification was true. All antibodies recognized the AG are rewarded by increasing their lifespan. If the user moved the AG to his e-mail box, then classification process proceeds false, and the antibodies recognized this antigen are removed from the training set to prevent it from false classification another time. Antibody succeeded in recognition with highest affinity is rewarded by moving it to the set of memory cell; if it was not there. It is further selected for reproduction by clonal selection process. The reproduction process is combined with a cell death process by deleting cells with law lifespan near to zero, allowing adaptive and dynamic process.
Figure 1: AISSF processing cycle Another adaptive learning aspect provided is that each time a new AG is tested; classification results update the training set by either adding new memory cells antibodies or by removing harmful 121
Haggag & Fattoh: Artificial Immune System For Spam Filtering
unused antibodies, if exists. Training set elements keep information about antibodies' lifespan which are used throughout the system performance. Figure 1 shows the behavioral flow of the proposed model. The algorithm consists of two main phases, training and testing. Training is established using the negative selection method and it results in training set of only spam e-mails. Testing has two consecutive tasks, classifying the new email and updating system libraries according to user's feedback. High level system modules are expressed in the following pseudo code. AISSF algorithm Train (spam emails, non-spam emails) Convert the new email to antigen (ag) Test (ag) If the ag in non-spam then Pass the ag to the user inbox If the user moved the ag to spam folder then Add the ag to the antibodies Else move the ag to spam folder If the user removed the ag from spam folder then Reward the antibodies recognized the ag If the user moved the ag to user inbox then Delete the antibodies recognized the ag falsely
Method train (spam emails, non-spam emails) Convert each training element to antibody (ab) format Choose random set of training elements to be memory cells (mc) Convert each self element to antigen (ag) format Generate gene library Set the initial value to life span each ab Set the initial value to life span mc AB_SET=Negative selection (training elements, self elements)
Method affinity (Ab1, Ab2) Aff = 0 Foreach (word in Ab1) Foreach (word in Ab2) If (Ab1.word = Ab2.word) Aff = Aff+ (Ab1.word.weight * Ab2.word.weight) Return Aff / No. of Matched Words
Method test (ag) Foreach (ab in (new B-Cell AND MC)) If (affinity (ab, ag) > threshold) Return ag as spam Return ag as non-spam
Training Training is used to generate the gene library that contains all words contained in the antibodies. The gene library is utilized in the processes of clonal selection and antibody cashing. Training elements classified as spam e-mails (non self) chosen and a random number of them specified as memory cells. Initial lifespan for each of them is then initialized. A random set of non-spam e-mails (self) is selected as well. Negative selection method will be applied to both sets. Any non self (spam) 122
IJICIS, Vol.9, No.2 JULY 2009
element, which recognizes any self cell, will be deleted from the training elements. The remaining elements will be used as antibodies in the testing phase. Weighted Affinity Affinity, a value ranging from 0 to 1, is a coherent step used to measure the similarity between an AB and an AG. If the matching measure is greater than a threshold, then the antibody recognizes the antigen. Consider AB1 = {(Xi ,Wi ), (Xi+1 ,Wi+1 ), ………, (Xn ,Wn )}, and AB2 = {(Xj ,Wj ), (Xj+1 ,Wj+1 ),…….., (Xn ,Wn )} are two cells, where Xi and Xj represents the words and Wi and Wj represents the weights of these words. If, for example Xi matches Xj, then their normalized weights Wi and Wj are used to calculate the weighted affinity as shown in formula 3. The weighted affinity is based on the words being matched between both antibodies. n
Weighted _ Affinity
n
W *W i 1 j 1
i
j
no. of matched words
(3)
Testing Completing training, testing against unclassified emails could be applied. In testing process, firstly the AG need to be tested is preprocessed like an AB, mentioned earlier. Secondly this AG is checked against both memory cells and antibodies in the training set. The checking process evaluates the affinity between the AG and each memory cell or AB. If memory cell or an AB recognizes this AG, then it is classified as spam; otherwise, it is classified as non-spam. Update process depends on user's feedback on the correctness of the testing. The system responds to either new emails to be tested or user feedback and the necessary action will be invoked accordingly. Clonal selection The method clonal selection module simulates the clonal selection function of the natural immune system. The algorithm computes the affinity between an AB and an AG. It then generates number of cloning proportional to the affinity. The number of mutation is inversely proportional to the affinity. Clonal selection is used in the update process after the testing phase, when any AB or memory cell recognizes an AG as spam and the user has confirmed this recognition. CLONALG [6] is the algorithm used in processing clonal selection. Negative selection Negative selection behavior is very similar to that of natural immune system. The importance of this component is that any AB in the AB-set that recognize any cell in the self-set is removed to prevent recognizing self cells and classifying them as non-spam. 4. Results and Discussion Public corpora used in the training and testing called PU [14]. The PU collection contains 4 corpuses named PU1, PU2, PU3, and PUA. These four corpuses contain private mail boxes of four different users in encrypted form. Attachments, HTML tags and mail headers except the subject have been removed from the mail messages. In evaluating the classification accuracy of the proposed model; the four scenarios shown in figure 2 have been considered. 123
Haggag & Fattoh: Artificial Immune System For Spam Filtering
(1) NS NS Non- spam classified as non-spam
(2) NSS Non- spam classified as spam
(3) SNS Spam classified as non- spam
(4) SS Spam classified as spam
Figure 2: Classification scenarios for spam filtering problem Cases 1 and 4 represent optimal responses, whiles cases 2 and 3 are bad classification. The common evaluation measures; precision, recall, and accuracy are defined as follows [9]:
SS (4) S NS S S SS Spam Recall = (5) NS S S S NS NS S S Accuracy = (6) NS NS NS S S NS S S Spam Precision =
Where, precision defines the percentage of spam in the predicted spam, recall defines the percentage of true spam predicted correctly, and accuracy defines the percentage of all e-mails that are classified correctly by the system. In addition, two important measures must be taken in consideration while evaluating spam filtering system, false positive and false negative. False positive defines the percentage of non-spam wrongly identified as spam. False negative defines the percentage of spam wrongly identified as non-spam [10]. A spam filter should have as low false positive as possible. 4.1 Experimental Results The system has been trained with 15% of spam messages and 10% of non-spam messages of the four corpuses. Remainder of corpuses messages is used for testing. The system parameters have been as follows: Testing and updating threshold = (1/ no. of matched words). The clonal constant =5. Mutation constant = 5. AB initial lifespan = 100. MC initial lifespan = 25. Results listed in table 1, shows the system classification accuracy for the four testing corpuses. Two experiments were conducted in this aspect: 1. Separate testing for each corpus, directly being tested solely after the training. 2. Cumulative testing of the corpuses (PU1 and PU2), (PU1, PU2 and PU3) and (PU1, PU2, PU3 and PUA) in sequence after training. Results proved that system results are satisfactory in small volumes of testing sets, which ensures reliability of the proposed model even in the early stages with low learning achievements. Antibody cashing Results showed that increasing the number of cashed ABs results in an increase of the recognized AGs. To achieve 42 truly classified e-mails, AB cashing of 30 words were used. For higher number 124
IJICIS, Vol.9, No.2 JULY 2009
of truly classified e-mails, 50 words resulted in 57 truly detected emails, 60 resulted in 71 detected emails, and 100 resulted in 109 detected emails. Table 2 lists performance measures with and without the involvement of antibody cashing. It is clear that time processing for spam classification in presence of antibody cashing is far below that with no cashing.
Table 1: AISSF accuracy results Corpora Name
99.02 95.76 97.66 98.25
97.12 89.68 97.53 96.14
98.33 97.13 98.06 97.33
False Positive% 1. 25 2. 07 1. 00 1. 85
95.38 96.34 96.32
98.09 97.63 97.37
97.79 97.68 97.49
1.58 1.42 1.47
Precision% Recall% Accuracy%
PU1 PU2 PU3 PUA PU1, PU2 PU1,PU2,PU3 PU1,PU2,PU3,PUA
False Negative% 0. 42 0. 80 0. 94 0. 82 0.63 0.91 1.04
Table 2: AB cashing time measures S S (true negative) Time with AB cashing (msec) Time without AB cashing (msec) 42 of 2818 3516 5188 57 of 2818 6500 7375 71 of 2818 7141 9718 109 of 2818 15469 16125 Training set sorting Further performance enhancement is provided by sorting only half of the elements of the training set according to the antibody's lifespan in a descending order. False positive, recall, and accuracy results are enhanced by this process due to reducing number of detectors which will decrease opportunity of misclassifying non spam email. Precision and false negative values are affected negatively but still give satisfactory values. Results shown in table 3, list system accuracy results when sorting half of the training elements and a comparison when testing has been done with and without sorting.. It is clears from the results that sorting the training set elements has enhanced performance to almost double. Table 3: training set sorting time measures Corpus
Precision
Recall
Accuracy
F.P
F.N
Classification time with sorting (sec)
Classification time without sorting (sec)
PU1 PU2 PU3 PUA
98.53 94.07 97.17 97.17
98.05 92.5 98.35 96.48
98.54 97.45 98.2 96.50
0.836 1.43 0.658 1.646
0.627 1.115 1.144 1.852
26.718 11.093 84.421 49.843
59.563 16.281 205.000 73.156
Dynamic thresholding The use of dynamic threshold in the proposed model has greatly enhanced the classification accuracy. In this experiment, four approaches were exposed. Two of static values (T3 = 0.2 and T4 = 0.3) and two of dynamic thresholding with the following basis of calculation: T1 =1/ no. of matched words 125
Haggag & Fattoh: Artificial Immune System For Spam Filtering
T2=1/ sum of weights of AB recognized words Table 4: accuracy of static and dynamic thresholding T1 %
T2 %
T3 = 0.2 %
T4 = 0.3 %
False positive
1.54
5.66
19.22
10.48
False negative
0.75
21.00
10.01
20.37
Precision
95.12
70.21
87.27
72.09
Recall
97.67
84.17
72.81
74.1
Accuracy
97.71
86.75
77.41
76.77
System has been evaluated for the four threshold options. Table 4 shows average accuracy measures for the four corpuses. It is clear from the results that threshold T1 resulted in the best results for all measures; false positive, false negative, precision, recall and accuracy. Static thresholds showed worse results relative to T2 in false positive, false negative, recall, and accuracy evaluation. 4.2 AISSF vs. Naïve Bayes., Flexible Bayes, SVM, LogitBoost, and HOVOLD In this section AISSF results have been evaluated against Naïve Bayes., Flexible Bayes, Support Vector Machine (SVM), LogitBoost [15], and HOVOLD [16]. The same PU corpora were used in all systems. 90% from each corpus are used for training and 10% for testing. Table 5 shows the Precision “P”, Recall “R”, and Accuracy “Acc” results for AISSF vs. the other five mentioned systems. The following can be noticed from the results: AISSF shows the best results in all measures in PU3 which has the highest number of e-mails, which implies that AISSF accuracy increases as the number of e-mails increases. AISSF has better accuracy in the majority of the experiment measures. AISSF has the highest average in all measures, precision, recall, and accuracy. Figure 3 shows the precision, recall, and accuracy, for the four corpuses, using AISSF relative to others. AISSF vs. AISEC and Naïve Bayesian AISSF has been evaluated as well with an immune system for email classification AISEC [11], which classifies emails as interesting and none interesting. AISEC used the header of the email (sender and subject) rather than using the body of the email message, AISSF’s approach in both training and testing. According to AISEC results, compared with Naive Bayes. [11], Naïve Bayes is 14.27% better than AISEC in precision measure, AISEC is 16.48% better than Naïve Bayes in recall measure, and AISEC is 1.17% better than Naïve Bayes in accuracy measure. On the other side, AISSF results showed 5.6% better than Naïve Bayes in precision measure, 3.18% in recall measure, and 3.41% in accuracy measure. From these evaluations, it can be concluded that AISSF is better than Naïve Bayes in all accuracy measures and better than AISEC in precision and accuracy measures.
126
IJICIS, Vol.9, No.2 JULY 2009
Table 5: Accuracy results vs. Naïve Bayes., Flexible Bayes, SVM, LogitBoost, and HOVOLD Corpora
Measure unit
Naïve Bayes.%
Flexible Bayes.%
SVM%
LogitBoost%
HOVOLD%
89.58
96.92
93.96
95.22
95.35
R
99.02 97.12
99.38
97.08
95.63
93.13
98.12
ACC.
98.33
94.59
97.34
95.32
94.86
97.06
P
95.76
80.77
90.57
88.71
89.46
87.00
R
89.68
90
79.29
79.29
77.86
97.14
ACC.
97.13
93.66
94.22
93.66
93.66
96.2
P PU1
PU2
PU3
PUA
Average
AISSF%
P
97.66
93.59
95.78
96.48
94.31
96.02
R
97.53
94.84
90.55
94.67
92.42
96.92
ACC.
98.06
94.79
94.04
96.08
94.14
96.83
P
98.25
95.11
96.75
92.83
89.62
97.91
R
96.14
94.04
91.58
93.33
90.88
93.68
ACC.
97.33
94.47
94.21
92.89
89.82
95.79
P
97.67
89.76
95.005
93
92.15
94.07
R
95.12
94.57
89.63
90.73
88.57
96.47
ACC.
97.71
94.38
94.95
94.49
93.12
96.47
Figure 3: AISSF vs. other systems
127
Haggag & Fattoh: Artificial Immune System For Spam Filtering
5. Conclusions This paper has introduced an approach to spam filtering inspired from the natural immune system. Although being trained with both spam and non spam elements, the artificial model drops the cells that wrongly recognize new antigens during the testing process. Classification is based on the body text of e-mails rather than the header part, which ensures more accurate classification results. The model proposed some contributions to detectors representation, affinity evaluation, and dynamic thresholding function. The system accuracy increases as the number of tested elements increase. Comparing the accuracy of AISSF with different algorithms, results showed that AISSF yields good results even with small number of training elements. Using negative selection in training avoids misclassifying of non spam emails and decreases the initial number of detectors, which in role enhances the system performance. Performance has been enhanced as well by antibody cashing and sorting the detectors’ set according to their lifespan. The system has been evaluated using public corpora of spam and non-spam e-mails, and the results were encouraging.
128
IJICIS, Vol.9, No.2 JULY 2009
References: 1. Decastro L. N., and Vonzuben F.J., "Artificial Immune Systems: Part I – Basic Theory and Applications". RT – DCA 01/99, December1999 2. Decastro L. N., "Immune Cognition, Micro-evolution, and a Personal Account on Immune Engineering". S.E.E.D. Journal (Semiotics, Evolution, Energy, and Development). Universidad de Toronto, 3(3). 2003 3. Zhang L., Zhong Y., and Li.P., "Applications of Artificial Immune Systems in Remote Sensing Image Classification". Proceedings of 20th Congress of International Society for Photogrammetry and Remote Sensing, pp. 397-401.2004 4. Dasgupta D., "Advances in Artificial Immune Systems". November 2006, IEEE Computational intelligence magazine 5. Oda T., and White T., "Spam Detection Using Artificial Immune System". Master’s thesis, Carleton University, January 2005 6. Zhao W., and Zhang Z., "An Email Classification Model Based on Rough Set Theory". Proceedings of the 2005 International Conference on Active Media Technology, (AMT 2005), pp. 403 - 408, 2005. 7. Wu C., "embedded-text detection and its application to anti-spam filtering". Master’s thesis, University of California, 2005 8. Brakto A., Cormack G., Filipi B., Lynam T., and Zupan B., "Spam Filtering Using Statistical Data Compression Models". Journal of Machine Learning Research 7 (2006), pp. 2673-2698 9. Hsiao W., Chang T., and Hu G., "A Cluster-Based Approach to Filtering Spam under Skewed Class Distributions". IEEE, 2007 10. Oda T., White T., "Increasing the Accuracy of a Spam-Detecting Artificial Immune System". The 2003 Congress on Evolutionary Computation, 1:390–396, December 2003. 11. Socker A., Freitas A., and Timmis J., "AISEC: an Artificial Immune System for E-mail Classification". IEEE Volume: 1, pp. 131- 138, Vol.1 2003 12. Sarafijanovic S., and Le Boudec J., "Artificial Immune System for Collaborative Spam Filtering". Technical Report LCA-REPORT-2007-008, EPFL, September 2007. 13. Yue X., Abraham A., Chi Z., Hao Y., and Mo H., "Artificial Immune System Inspired Behavior Based Anti-Spam Filter, Soft Computing - A Fusion of Foundations, Methodologies and Applications". Springer-Verlag , 2007. 14. PU public corpora hosted in http://www.iit.demokritos.gr/skel/i-config/ 15. Androutsopoulos I., Paliouras G., and Michelakis E., "Learning to Filter Unsolicited Commercial E-Mail". NCSR \Demokritos" Technical Report, No. 2004/2, March 2004. 16. Hovold J., "Naive Bayes Spam Filtering Using Word-Position-Based Attributes". In 2nd conference on Email and Anti-Spam, Stanford, CA, 2005
129