Advances in Spam Filtering Techniques

Advances in Spam Filtering Techniques Tiago A. Almeida and Akebo Yamakami

Abstract Nowadays e-mail spam is not a novelty, but it is still an important rising problem with a big economic impact in society. Fortunately, there are different approaches able to automatically detect and remove most of those messages, and the best-known ones are based on machine learning techniques, such as Na¨ıve Bayes classifiers and Support Vector Machines. However, there are several different models of Na¨ıve Bayes filters, something the spam literature does not always acknowledge. In this work, we present and compare seven different versions of Na¨ıve Bayes classifiers, the well-known linear Support Vector Machine and a new method based on the Minimum Description Length principle. Furthermore, we have conducted an empirical experiment on six public and real non-encoded datasets. The results indicate that the proposed filter is fast to construct, incrementally updateable and clearly outperforms the state-of-the-art spam filters.

1 Introduction E-mail is one of the most popular, fastest and cheapest means of communication which has become a part of everyday life for millions of people, changing the way we work and collaborate. The downside of such a success is the constantly growing volume of e-mail spam we receive. The term spam is generally used to denote an unsolicited commercial e-mail. Spam messages are annoying to most users because they clutter their mailboxes. It can be quantified in economical terms since many hours are wasted everyday by workers. It is not just the time they waste reading the spam but also the time they spend removing those messages.

Tiago A. Almeida and Akebo Yamakami School of Electrical and Computer Engineering, University of Campinas – UNICAMP, 13081-970, Campinas, Sao Paulo, Brazil, e-mail: {tiago,akebo}@dt.fee.unicamp.br

1

2

Tiago A. Almeida and Akebo Yamakami

The amount of spam is frightfully increasing. The average of spams sent per day increased from 2.4 billion in 20021 to 300 billion in 20102 representing more than 90% of all incoming e-mail. On a worldwide basis, the total cost in dealing with spam was estimated to rise from US$ 20.5 billion in 2003, to US$ 198 billion in 2009. Many methods have been proposed to automatically classify messages as spams or legitimates. Among all techniques, machine learning algorithms have achieved more success [8]. Those methods include approaches that are considered topperformers in text categorization, like support vector machines and Na¨ıve Bayes classifiers. A relatively recent method for inductive inference which is still rarely employed in text categorization tasks is the Minimum Description Length principle. It states that the best explanation, given a limited set of observed data, is the one that yields the greatest compression of the data [7, 11, 22]. In this work, we present a spam filtering approach based on the Minimum Description Length principle and compare its performance with seven different models of Na¨ıve Bayes classifiers and the linear Support Vector Machine. Here, we carry out a evaluation with the practical purpose of filtering e-mail spams in order review and compare the currently top-performers spam filters. We have conducted an empirical experiment using six well-known, large, and public databases and the reported results indicate that our approach outperforms currently established spam filters. Separated pieces of this work was presented at IEEE ICMLA 2009 [2], ACM SAC 2010 [3, 4] and IEEE IJCNN 2010 [1]. Here, we have connected all ideas in a very consistent way. We have also offered a lot more details about each study and significantly extended the performance evaluation. The remainder of this chapter is organized as follows. Section 2 presents the basic concepts regarding the main spam filtering techniques. In Section 3, we describe a new approach based on the Minimum Description Length principle. Section 4 presents details of the Na¨ıve Bayes algorithms applied in spam filtering domain. The linear Support Vector Machine classifier is described in Section 5. Experimental results are showed in Section 6. Finally, Section 7 offers conclusions and outlines for future works.

2 Basic concepts In general, the machine learning algorithms applied to spam filtering can be summarized as follows. Given a set of messages M = {m1 , m2 , . . . , m j , . . . , m|M | } and category set C = {spam (cs ), legitimate (cl )}, where m j is the jth mail in M and C is the possible label set, the task of automated spam filtering consists in building a Boolean catego1 2

See http://www.spamlaws.com/spam-stats.html See www.ciscosystems.cd/en/US/prod/collateral/cisco_2009_asr.pdf

Advances in Spam Filtering Techniques

3

rization function Ω (m j , ci ) : M × C → {True, False}. When Ω (m j , ci ) is True, it indicates message m j belongs to category ci ; otherwise, m j does not belong to ci . In the setting of spam filtering there exist only two category labels: spam and legitimate (also called ham). Each message m j ∈ M can only be assigned to one of them, but not to both. Therefore, we can use a simplified categorization function Ωspam (m j ) : M → {True, False}. Hence, a message is classified as spam when Ωspam (m j ) is True, and legitimate otherwise. The application of supervised machine learning algorithms for spam filtering consists of two stages: 1. Training. A set of labeled messages (M ) must be provided as training data, which are first transformed into a representation that can be understood by the learning algorithms. The most commonly used representation for spam filtering is the vector space model, in which each document m j ∈ M is transformed into a real vector x j ∈ ℜ|Φ | , where Φ is the vocabulary (feature set), and the coordinates of x j represent the weight of each feature in Φ . Then, we can run a learning algorithm over the training data to create a classifier Ωspam (xj ) → {True, False}. 2. Classification. The classifier Ωspam (xj ) is applied to the vector representation of a message x to produce a prediction whether x is spam or not.

3 Spam filter based on data compression model The Minimum Description Length (MDL) principle is a formalization of Occam’s Razor in which the best hypothesis for a given set of data is the one that yields compact representations. The traditional MDL principle states that the preferred model results in the shortest description of the model and the data, given this model. In other words, the model that best compresses the data is selected. This model selection criterion naturally balances the complexity of the model and the degree to which this model fits the data. This principle was first introduced by Rissanen [22] and it becomes an important concept in information theory. The main concept behind the MDL principle can be presented as following. Let Z be a finite or countable set and let P be a probability distribution on Z . Then there exists a prefix code C for Z such that for all z ∈ Z , LC (z) = ⌈−log2 P(z)⌉. C is called the code corresponding to P. Similarly, let C be a prefix code for Z . Then there exists a (possibly defective) probability distribution P such that for all z ∈ Z , −log2 P′ (z) = LC′ (z). P′ is called the probability distribution corresponding to C′ . Thus, large probability according to P means small code length according to the code corresponding to P and vice versa [7, 11, 22]. The goal of statistical inference may be cast as trying to find regularity in the data. Regularity may be identified with ability to compress. MDL combines these two insights by viewing learning as data compression: it tells us that, for a given set of hypotheses H and data set D, we should try to find the hypothesis or combination of hypotheses in H that compresses D most [7, 11, 22].

4


In essence, compression algorithms can be applied to text categorization by building one compression model from the training documents of each class and using these models to evaluate the target document. In this way, given a set of pre-classified training messages M , the task is to assign a target e-mail m with an unknown label to one of the classes c ∈ {spam, ham}. So, the method measures the increase of the description length of the data set as a result of the addition of the target document. Finally, it chooses the class for which the description length increase is minimal. Assuming that each class (model) c is a sequence of terms extracted from the messages and inserted into the training set, each term (token) t from m has a code length Lt based on the sequence of terms presented in the messages of the training set of c. The length of m when assigned to the class c corresponds to the sum of |m| all code lengths associated with each term of m, Lm = ∑i=1 Lti . We calculate Lti = ⌈−log2 Pti ⌉, where P is a probability distribution related with the terms of class. Let nc (ti ) the number of times that ti appears in messages of class c, then the probability that any term belongs to c is given by the maximum likelihood estimation: Pti =

nc (ti ) + |Φ1 | nc + 1

where nc corresponds to the sum of nc (ti ) for all terms which appear in messages that belongs to c and |Φ | is the vocabulary size. In this work, we assume that |Φ | = 232 , that is, each term in an uncompress mode is a symbol with 32 bits. This estimation reserves a “portion” of probability to words which the classifier has never seen before. Basically, the MDL spam filter classify a message by following these steps: 1. Tokenization: the classifier extract all terms of the new message m = {t1, . . . ,t|m| }; 2. Compute the increase of the description length when m is assigned to each class c ∈ {spam, ham}: !' & |m| nspam (ti ) + |Φ1 | Lm (spam) = ∑ − log2 nspam + 1 i=1 |m|

&

Lm (ham) = ∑ − log2 i=1

nham (ti ) + |Φ1 | nham + 1

!'

3. if Lm (spam) > Lm (ham), then m is classified as spam; otherwise, m is labeled as ham. 4. Training method. In the following, we offer more details about the steps 1 and 4.


5

3.1 Preprocessing and tokenization We did not perform language-specific preprocessing techniques such as word stemming, stop word removal, or case folding, since other works found that such techniques tend to hurt spam-filtering accuracy [8, 21, 31]. Tokenization is the first stage in the classification pipeline; it involves breaking the text stream into terms (“words”), usually by means of a regular expression. We consider in this work that terms start with a printable character; followed by any number of alphanumeric characters, excluding dots, commas and colons from the middle of the pattern. With this pattern, domain names and mail addresses will be split at dots, so the classifier can recognize a domain even if subdomains vary [27]. As proposed by Drucker et al. [9] and Metsis et al. [21], we do not consider the number of times a term appears in each message. In this way, each term is computed only once per message it appears.

3.2 Training method Spam filters generally build their predicting models by learning from examples. A basic training method is to start with an empty model, classify each new sample and train it in the right class if the classification is wrong. This is known as train on error (TOE). An improvement to this method is to train also when the classification is right, but the score is near the boundary that is, train on or near error (TONE). This method is also called thick threshold training [27]. The advantage of TONE over TOE is that it accelerates the learning process by exposing the filter to additional hard-to-classify samples in the same training period. Therefore, we employ the TONE as training method used by the proposed MDL anti-spam filter. A good point of the MDL classifier is that we can start with an empty training set and according to the user feedback the classifier builds the models for each class. Moreover, it is not necessary to keep the messages used for training since the models are incrementally building by the term frequencies.

4 Na¨ıve Bayes spam filters Probabilistic classifiers are historically the first proposed filters. From Bayes’ theorem and the theorem of the total probability, the probability for a message with vector x = hx1 , . . . , xn i belongs to a category ci ∈ {cs , cl } is: P(ci |x) =

P(ci ).P(x|ci ) . P(x)

6


Since the denominator does not depend on the category, Na¨ıve Bayes (NB) filter classifies each message in the category that maximizes P(ci ).P(x|ci ). In the spam filtering domain it is equivalent to classify a message as spam (cs ) whenever P(cs ).P(x|cs ) > T, P(cs ).P(x|cs ) + P(cl ).P(x|cl ) with T = 0.5. By varying T , we can opt for more true negatives (legitimate messages correctly classified) at the expense of fewer true positives (spam messages correctly classified), or vice-versa. The a priori probabilities P(ci ) can be estimated as occurrences frequency of documents belonging to the category ci in the training set M , whereas P(x|ci ) is practically impossible to estimate directly because we would need in M some messages identical to the one we want to classify. However, the NB classifier makes a simple assumption that the terms in a message are conditionally independent and the order they appear is irrelevant. The probabilities P(x|ci ) are estimated differently in each NB model. Despite the fact that its independence assumption is usually oversimplistic, several studies have found the NB classifier to be surprisingly effective in the spam filtering task [5, 18]. The NB classifiers are the most employed in proprietary and open-source systems proposed for spam filtering [18, 21, 28]. However, there are different models of Na¨ıve Bayes filters, something the spam literature does not always acknowledge. In the following, we describe seven different models of NB spam filter available in the literature.

4.1 Basic Na¨ıve Bayes We call Basic NB the first NB spam filter proposed by Sahami et al. [23]. Let Φ = {t1 , . . . ,tn } the set of terms, each message m is represented as a binary vector x = hx1 , . . . , xn i, where each xk shows whether or not tk will occur in m. The probabilities P(x|ci ) are calculated by: n

P(x|ci ) = ∏ P(tk |ci ), k=1

and the criterion for classifying a message as spam is: P(cs ). ∏nk=1 P(tk |cs ) > T. ∑ci ∈{cs ,cl } P(ci ). ∏nk=1 P(tk |ci ) Here, probabilities P(tk |ci ) are estimated by: P(tk |ci ) =

|Mtk ,ci | , |Mci |


7

where |Mtk ,ci | is the number of training messages of category ci that contain the term tk , and |Mci | is the total number of training messages that belong to the category ci .

4.2 Multinomial term frequency Na¨ıve Bayes The multinomial term frequency NB (MN TF NB) represents each message as a set of terms m = {t1 , . . . ,tn }, computing each one of tk as how many times it appears in m. In this sense, m can be represented by a vector x = hx1 , . . . , xn i, where each xk corresponds to the number of occurrences of tk in m. Moreover, each message m of category ci can be interpreted as the result of picking independently |m| terms from Φ with replacement and probability P(tk |ci ) for each tk [20]. Hence, P(x|ci ) is the multinomial distribution: n

P(tk |ci )xk . xk ! k=1

P(x|ci ) = P(|m|).|m|!. ∏

Thus, the criterion for classifying a message as spam becomes: P(cs ). ∏nk=1 P(tk |cs )xk > T, ∑ci ∈{cs ,cl } P(ci ). ∏nk=1 P(tk |ci )xk and the probabilities P(tk |ci ) are estimated as a Laplacian prior: P(tk |ci ) =

1 + Ntk ,ci , n + Nci

where Ntk ,ci is the number of occurrences of term tk in the training messages of category ci , and Nci = ∑nk=i Ntk ,ci .

4.3 Multinomial Boolean Na¨ıve Bayes The multinomial Boolean NB (MN Boolean NB) is similar to the MN TF NB, including the estimates of P(tk |ci ), except that each attribute xk is Boolean. Note that, these approaches do not take into account the absence of terms (xk = 0) from the messages. Schneider [24] demonstrates that MN Boolean NB may perform better than MN TF NB. This is because the multinomial NB with term frequency attributes is equivalent to a NB version with the attributes modeled as following Poisson distributions in each category, assuming that the message length is independent of the category. Therefore, the multinomial NB may achieve better performance with Boolean attributes, if the term frequencies attributes do not follow Poisson distributions.

8


4.4 Multivariate Bernoulli Na¨ıve Bayes Let Φ = {t1 , . . . ,tn } the set of terms. The multivariate Bernoulli NB (MV Bernoulli NB) represents each message m by computing the presence and absence of each term. Therefore, m can be represented as a binary vector x = hx1 , . . . , xn i, where each xk shows whether or not tk will occur in m. Moreover, each message m of category ci is seen as the result of n Bernoulli trials, where at each trial we decide whether or not tk will appear in m. The probability of a positive outcome at trial k is P(tk |ci ). Then, the probabilities P(x|ci ) are computed by: n

P(x|ci ) = ∏ P(tk |ci )xk .(1 − P(tk |ci ))(1−xk ) . k=1

The criterion for classifying a message as spam becomes: P(cs ). ∏nk=1 P(tk |cs )xk .(1 − P(tk |cs ))(1−xk ) > T, ∑ci ∈{cs ,cl } P(ci ). ∏nk=1 P(tk |ci )xk .(1 − P(tk |ci ))(1−xk ) and probabilities P(tk |ci ) are estimated as a Laplacian prior: P(tk |ci ) =

1 + |Mtk ,ci | , 2 + |Mci |

where |Mtk ,ci | is the number of training messages of category ci that comprise the term tk , and |Mci | is the total number of training messages of category ci . For more theoretical explanation, consult Metsis et al. [21] and Losada and Azzopardi [17].

4.5 Boolean Na¨ıve Bayes We denote as Boolean NB the classifier similar to the MV Bernoulli NB with the difference that this does not take into account the absence of terms. Hence, the probabilities P(x|ci ) are estimated only by: n

P(x|ci ) = ∏ P(tk |ci ), k=1

and the criterion for classifying a message as spam becomes: P(cs ). ∏nk=1 P(tk |cs ) > T, ∑ci ∈{cs ,cl } P(ci ). ∏nk=1 P(tk |ci ) where probabilities P(tk |ci ) are estimated in the same way as used in the MV Bernoulli NB.


9

4.6 Multivariate Gauss Na¨ıve Bayes Multivariate Gauss NB (MV Gauss NB) uses real-valued attributes by assuming that each attribute follows a Gaussian distribution g(xk ; µk,ci , σk,ci ) for each category ci , where the µk,ci and σk,ci of each distribution are estimated from the training set M . The probabilities P(x|ci ) are calculate by n

P(x|ci ) = ∏ g(xk ; µk,ci , σk,ci ), k=1

and the criterion for classifying a message as spam becomes: P(cs ). ∏nk=1 g(xk ; µk,cs , σk,cs ) > T. ∑ci ∈{cs ,cl } P(ci ). ∏nk=1 g(xk ; µk,ci , σk,ci )

4.7 Flexible Bayes Flexible Bayes (FB) works similar to MV Gauss NB. However, instead of using a single normal distribution for each attribute Xk per category ci , FB represents the probabilities P(x|ci ) as the average of Lk,ci normal distributions with different values for µk,ci , but the same one for σk,ci : P(xk |ci ) =

1 Lk,ci

Lk,ci

∑ g(xk ; µk,ci ,l , σci ),

l=1

where Lk,ci is the amount of different values that the attribute Xk has in the training set M of category ci . Each of these values is used as µk,ci ,l of a normal distribution of the category ci . However, all distributions of a category ci are taken to have the same σci = √ 1 . |Mci |

The distribution of each category becomes narrower as more training messages of that category are accumulated. By averaging several normal distributions, FB can approximate the true distributions of real-valued attributes more closely than the MV Gauss NB when the assumption that attributes follow normal distribution is violated. For further details, consult John and Langley [13] and Androutsopoulos et al. [5]. Table 1 summarizes all NB spam filters presented in this section3 .

3

The computational complexities are according to Metsis et al. [21]. At classification time, the complexity of FB is O(n.|M |) because it needs to sum the Lk distributions.

10


Table 1 Na¨ıve Bayes spam filters. Complexity on Classification

NB Classifier

P(x|ci )

Training

Basic NB MN TF NB MN Boolean NB MV Bernoulli NB Boolean NB MV Gauss NB

∏nk=1 P(tk |ci ) ∏nk=1 P(tk |ci )xk ∏nk=1 P(tk |ci )xk ∏nk=1 P(tk |ci )xk .(1 − P(tk |ci ))(1−xk ) ∏nk=1 P(tk |ci ) ∏nk=1 g(xk ; µk,ci , σk,ci ) Lk,c 1 ∑l=1i g(xk ; µk,ci ,l , σci ) ∏nk=1 Lk,c

O(n.|M |) O(n.|M |) O(n.|M |) O(n.|M |) O(n.|M |) O(n.|M |)

Flexible Bayes

i

O(n.|M |)

O(n) O(n) O(n) O(n) O(n) O(n) O(n.|M |)

5 Support Vector Machines Support vector machine (SVM) is one of the most successful techniques used in text classification [8, 10]. In this method a data point is viewed as a p-dimensional vector and the approach aims to separate such points with a (p − 1)-dimensional hyperplane. This is called a linear classifier. There are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. Therefore, SVM chooses the hyperplane so that the distance from it to the nearest data point on each side is maximized. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier (Figure 1) [29].

Fig. 1 Maximum-margin hyperplane and margins for a SVM trained with samples from two classes.

SVMs belong to a family of generalized linear classifiers. A special property is that they simultaneously minimize the empirical classification error and maximize


11

the geometric margin; hence they are also known as maximum margin classifiers. For further details about the implementation of SVMs in spam filtering domain, consult Cormack [8], Drucker et al. [9], Hidalgo [12], Kolcz and Alspector [14], Sculley and Wachman [25], Sculley et al. [26] and Liu and Cui [16].

6 Experimental results We carried out this study on the six well-known, large, real and public Enron datasets4 . The corpora are composed of legitimate messages extracted from the mailboxes of six former employees of the Enron Corporation. For further details about the dataset statistics and composition, refer to Metsis et al. [21]. Tables 2, 3, 4, 5, 6, and 7 present the performance achieved by each classifier for each Enron dataset. Bold values indicate the highest score. In order to provide a fair evaluation, we consider the most important measures the Matthews correlation coefficient (MCC) [1–4] and the weighted accuracy rate (Accw %) [5] achieved by each filter. Additionally, we present other well-known measures as spam recall (Sre%), legitimate recall (Lre%), spam precision (Spr%), legitimate precision (Lpr%), and total cost ratio (TCR) [5]. It is important to note that TCR offers an indication of the improvement provided by the filter. A greater TCR indicates better performance, and for TCR < 1, not using the filter is better. On the other hand, the MCC returns a real value between −1 and +1. A coefficient equals to +1 indicates a perfect prediction; 0, an average random prediction; and −1, an inverse prediction. It can be calculated using the following equation [6, 19]: MCC = p

(|T P|.|T N |) − (|F P|.|F N |)

(|T P| + |F P|).(|T P| + |F N |).(|T N | + |F P|).(|T N | + |F N |)

,

where |T P|, |F P|, |T N | and |F N | corresponds to the amount of true positives, false positives, true negatives and false negatives, respectively. Table 2 Enron 1 – Results achieved by each filter Measures Basic

Bool

MN TF MN Bool

MV Bern

MV Gauss

Flex Bayes

SVM

MDL

Sre(%) Spr(%) Lre(%) Lpr(%) Accw (%) TCR MCC

96.00 51.61 63.32 97.49 72.78 1.064 0.540

82.00 75.00 88.86 92.37 86.87 2.206 0.691

72.00 61.71 81.79 87.76 78.96 1.376 0.516

78.67 87.41 95.38 91.64 90.54 3.061 0.765

87.33 86.18 94.29 94.81 92.28 3.750 0.813

83.33 87.41 95.11 93.33 91.70 3.488 0.796

92.00 92.62 97.01 96.75 95.56 6.552 0.892

91.33 85.09 93.48 96.36 92.86 4.054 0.831

82.67 62.00 79.35 91.82 80.31 1.471 0.578

4 The Enron datasets are available at http://www.iit.demokritos.gr/skel/ i-config/.

12


Table 3 Enron 2 – Results achieved by each filter Measures Basic

Bool

MN TF MN Bool

MV Bern

MV Gauss

Flex Bayes

SVM

MDL


95.33 81.25 92.45 98.30 93.19 3.750 0.836

75.33 96.58 99.08 92.13 93.02 3.659 0.812

65.33 81.67 94.97 88.87 87.39 2.027 0.652

62.67 94.95 98.86 88.52 89.61 2.459 0.717

68.67 98.10 99.54 90.25 91.65 3.061 0.776

90.67 90.67 96.80 96.80 95.23 5.357 0.875

91.33 99.28 99.77 97.10 97.61 10.714 0.937

80.00 97.57 99.31 93.53 94.38 4.545 0.850

74.00 98.23 99.54 91.77 93.02 3.659 0.814


Bool

MN TF MN Bool

MV Bern

MV Gauss

Flex Bayes

SVM

MDL


99.33 99.33 99.75 99.75 99.64 75.000 0.991

57.33 100.00 100.00 86.27 88.41 2.344 0.703

100.00 84.75 93.28 100.00 95.11 5.556 0.889

52.67 89.77 97.76 84.70 85.51 1.875 0.613

52.00 96.30 99.25 84.71 86.41 2.000 0.644

91.33 96.48 98.76 96.83 96.74 8.333 0.917

90.00 100.00 100.00 96.40 97.28 10.000 0.931

57.33 100.00 100.00 86.27 88.41 2.344 0.703

62.00 100.00 100.00 87.58 89.67 2.632 0.737


Bool

MN TF MN Bool

MV Bern

MV Gauss

Flex Bayes

SVM

MDL


98.00 100.00 100.00 94.34 98.50 50.000 0.962

93.78 100.00 100.00 84.27 95.33 16.071 0.889

98.22 100.00 100.00 94.94 98.67 56.250 0.966

94.44 100.00 100.00 85.71 95.83 18.000 0.900

94.89 100.00 100.00 86.71 96.17 19.565 0.907

98.89 100.00 100.00 96.77 99.17 90.00 0.978

97.11 100.00 100.00 92.02 97.83 34.615 0.945

94.67 100.00 100.00 86.21 96.00 18.750 0.903

96.89 100.00 100.00 91.46 97.67 32.143 0.941


Bool

MN TF MN Bool

MV Bern

MV Gauss

Flex Bayes

SVM

MDL


87.23 100.00 100.00 76.14 90.93 7.830 0.815

88.86 100.00 100.00 78.53 92.08 8.976 0.835

98.10 92.56 80.67 94.53 93.05 10.222 0.828

86.68 96.37 92.00 73.80 88.22 6.033 0.743

88.86 98.79 97.33 78.07 91.31 8.178 0.814

89.40 99.70 99.33 79.26 92.28 9.200 0.837

99.73 98.39 96.00 99.31 98.65 52.571 0.967

89.67 98.80 97.33 79.35 91.89 8.762 0.825

94.29 100.00 100.00 87.72 95.95 17.524 0.909


13


Bool

MN TF MN Bool

MV Bern

MV Gauss

Flex Bayes

SVM

MDL


66.89 99.67 99.33 50.00 75.00 3.000 0.574

76.67 99.42 98.67 58.50 82.17 4.206 0.661

96.22 92.32 76.00 87.02 91.17 8.491 0.757

92.00 94.95 85.33 78.05 90.33 7.759 0.751

89.78 98.30 95.33 75.66 91.17 8.491 0.793

89.78 95.28 86.67 73.86 90.05 6.818 0.727

98.67 95.48 86.00 95.56 95.50 16.667 0.878

86.00 98.98 97.33 69.86 88.33 6.716 0.757

92.89 97.21 92.00 81.18 92.67 10.227 0.816

Regarding the results achieved by the classifiers, the MDL spam filter outperformed the other classifiers for the majority e-mail collections used in our empirical evaluation. It is important to realize that in some situations the MDL performs much better than SVM and NB classifiers. For instance, for Enron 1 (Table 2), MDL achieved spam recall rate equal to 92% while SVM attained 83.33%, even thought MDL presented better legitimate recall. It means that for Enron 1 MDL was able to recognize more than 8% of spams than SVM, representing an improvement of 10.40%. In a real situation, this difference would be extremely important. Note that, the same result can be found for Enron 2 (Table 3), Enron 5 (Table 6) and Enron 6 (Table 7). Both methods, MDL and SVM, achieved similar performance with no significant statistical difference just for Enron 3 (Table 4) and Enron 4 (Table 5). The results indicate that the data compression model is more efficient to distinguish messages as spams or legitimates. It attained an accuracy rate higher than 95% and high precision × recall rates for all datasets indicating that the MDL classifier makes few mistakes. We also verify that the MDL classifier achieved MCC score higher than 0.87 for all tested corpus. It indicates that the proposed filter almost accomplished a perfect prediction (MCC = 1.000) and it is much better than not using a filter (MCC = 0.000). Among the evaluated NB classifiers, the results indicate that all of them achieved similar performance with no significant statistical difference. However, they had achieved lower results than MDL and linear SVM which attained accuracy rate higher than 90% for all Enron datasets. Moreover, according to the results found by Schneider [24], in our experiments the NB filters that use real and integer attributes did not achieved better results than Boolean ones. However, Metsis et al. [21] showed that flexible Bayes are less sensitive to the threshold T . It indicates that it is able to attain a high spam recall even though a high legitimate recall is required.

14


7 Conclusions In this paper, we have presented a new spam filtering approach based on the Minimum Description Length principle. We have also compared its performance with the linear Support Vector Machine and seven different models of Na¨ıve Bayes classifiers, something the spam literature does not always acknowledge. We have conducted an empirical experiment using six well-known, large, and public databases and the reported results indicate that the proposed classifier outperforms currently established spam filters. It is important to emphasize that MDL spam filter acquired the best average performance for all analyzed databases presenting an accuracy rate higher than 95% for all e-mail datasets. Actually, we are conducting more experiments using larger datasets as TREC05, TREC06 and TREC07 corpora [8] in order to reinforce the validation. We also intend to compare the approaches with other commercial and open-source spam filters, such as Bogofilter, SpamAssassin, OSBF-Lua, among others. Future works should take into consideration that spam filtering is a coevolutionary problem, because while the filter tries to evolve its prediction capacity, the spammers try to evolve their spam messages in order to overreach the classifiers. Hence, an efficient approach should have an effective way to adjust its rules in order to detect the changes of spam features. In this way, collaborative filters [15] could be used to assist the classifier by accelerating the adaptation of the rules and increasing the classifiers’ performance. Moreover, spammers generally insert a large amount of noise in spam messages in order to make the probability estimation more difficult. Thus, the filters should have a flexible way to compare the terms in the classifying task. Approaches based on fuzzy logic [30] could be employed to make the comparison and selection of terms more flexible. Acknowledgements The authors would like to thank J. Almeida for his very constructive suggestions and the Brazilian Coordination for the Improvement of Higher Level Personnel (Capes) for financial support.

References [1] Almeida, T. and Yamakami, A. (2010). Content-Based Spam Filtering. In Proceedings of the 23rd IEEE International Joint Conference on Neural Networks, pages 1–7, Barcelona, Spain. [2] Almeida, T., Yamakami, A., and Almeida, J. (2009). Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters. In Proceedings of the 8th IEEE International Conference on Machine Learning and Applications, pages 517–522, Miami, FL, USA. [3] Almeida, T., Yamakami, A., and Almeida, J. (2010a). Filtering Spams using the Minimum Description Length Principle. In Proceedings of the 25th ACM Symposium On Applied Computing, pages 1856–1860, Sierre, Switzerland.


15

[4] Almeida, T., Yamakami, A., and Almeida, J. (2010b). Probabilistic Anti-Spam Filtering with Dimensionality Reduction. In Proceedings of the 25th ACM Symposium On Applied Computing, pages 1804–1808, Sierre, Switzerland. [5] Androutsopoulos, I., Paliouras, G., and Michelakis, E. (2004). Learning to Filter Unsolicited Commercial E-Mail. Technical Report 2004/2, National Centre for Scientific Research “Demokritos”, Athens, Greece. [6] Baldi, P., Brunak, S., Chauvin, Y., Andersen, C., and Nielsen, H. (2000). Assessing the Accuracy of Prediction Algorithms for Classification: An Overview. Bioinformatics, 16(5), 412–424. [7] Barron, A., Rissanen, J., and Yu, B. (1998). The Minimum Description Length Principle in Coding and Modeling. IEEE Transactions on Information Theory, 44(6), 2743–2760. [8] Cormack, G. (2008). Email Spam Filtering: A Systematic Review. Foundations and Trends in Information Retrieval, 1(4), 335–455. [9] Drucker, H., Wu, D., and Vapnik, V. (1999). Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, 10(5), 1048–1054. [10] Forman, G., Scholz, M., and Rajaram, S. (2009). Feature Shaping for Linear SVM Classifiers. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 299–308, Paris, France. [11] Gr¨unwald, P. (2005). A Tutorial Introduction to the Minimum Description Length Principle. In P. Gr¨unwald, I. Myung, and M. Pitt, editors, Advances in Minimum Description Length: Theory and Applications, pages 3–81. MIT Press. [12] Hidalgo, J. (2002). Evaluating Cost-Sensitive Unsolicited Bulk Email Categorization. In Proceedings of the 17th ACM Symposium on Applied Computing, pages 615–620, Madrid, Spain. [13] John, G. and Langley, P. (1995). Estimating Continuous Distributions in Bayesian Classifiers. In Proceedings of the 11st International Conference on Uncertainty in Artificial Intelligence, pages 338–345, Montreal, Canada. [14] Kolcz, A. and Alspector, J. (2001). SVM-based Filtering of E-mail Spam with Content-Specific Misclassification Costs. In Proceedings of the 1st International Conference on Data Mining, pages 1–14, San Jose, CA, USA. [15] Lemire, D. (2005). Scale and Translation Invariant Collaborative Filtering Systems. Information Retrieval, 8(1), 129–150. [16] Liu, S. and Cui, K. (2009). Applications of Support Vector Machine Based on Boolean Kernel to Spam Filtering. Modern Applied Science, 3(10), 27–31. [17] Losada, D. and Azzopardi, L. (2008). Assessing Multivariate Bernoulli Models for Information Retrieval. ACM Transactions on Information Systems, 26(3), 1–46. [18] Marsono, M., El-Kharashi, N., and Gebali, F. (2009). Targeting Spam Control on Middleboxes: Spam Detection Based on Layer-3 E-mail Content Classification. Computer Networks, 53(6), 835–848. [19] Matthews, B. (1975). Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. Biochimica et Biophysica Acta, 405(2), 442– 451.

16


[20] McCallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classication. In Proceedings of the 15th AAAI Workshop on Learning for Text Categorization, pages 41–48, Menlo Park, CA, USA. [21] Metsis, V., Androutsopoulos, I., and Paliouras, G. (2006). Spam Filtering with Naive Bayes - Which Naive Bayes? In Proceedings of the 3rd International Conference on Email and Anti-Spam, pages 1–5, Mountain View, CA, USA. [22] Rissanen, J. (1978). Modeling by Shortest Data Description. Automatica, 14, 465–471. [23] Sahami, M., Dumais, S., Hecherman, D., and Horvitz, E. (1998). A Bayesian Approach to Filtering Junk E-mail. In Proceedings of the 15th National Conference on Artificial Intelligence, pages 55–62, Madison, WI, USA. [24] Schneider, K. (2004). On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification. In Proceedings of the 4th International Conference on Advances in Natural Language Processing, pages 474–485, Alicante, Spain. [25] Sculley, D. and Wachman, G. (2007). Relaxed Online SVMs for Spam Filtering. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 415–422, Amsterdam, The Netherlands. [26] Sculley, D., Wachman, G., and Brodley, C. (2006). Spam Filtering using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers. In Proceedings of the 15th Text REtrieval Conference, pages 1–10, Gaithersburg, MD, USA. [27] Siefkes, C., Assis, F., Chhabra, S., and Yerazunis, W. (2004). Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 410–421, Pisa, Italy. [28] Song, Y., Kolcz, A., and Gilez, C. (2009). Better Naive Bayes Classification for High-precision Spam Detection. Software – Practice and Experience, 39(11), 1003–1024. [29] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. SpringerVerlag, New York, NY, USA. [30] Zadeh, L. (1965). Fuzzy sets. Information and Control, 8(3), 338–353. [31] Zhang, L., Zhu, J., and Yao, T. (2004). An Evaluation of Statistical Spam Filtering Techniques. ACM Transactions on Asian Language Information Processing, 3(4), 243–269.