Compression-based Spam Filter

8 downloads 92747 Views 189KB Size Report
dealing with junk email was estimated to rise from US$. 20.5 billion ... currently appear to be the best spam filters available in the ...... Bulk Email Categorization.
SECURITY AND COMMUNICATION NETWORKS Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec

RESEARCH ARTICLE

Compression-based Spam Filter Tiago A. Almeida1∗ and Akebo Yamakami2 1

˜ Carlos – UFSCar, 18052–780, Sorocaba, SP – Brazil Department of Computer Science, Federal University of Sao

2

School of Electrical and Computer Engineering, University of Campinas – UNICAMP, 13083–970, Campinas, SP – Brazil

ABSTRACT Nowadays e-mail spam is not a novelty, but it is still an important problem with a high impact on the economy. Spam filtering poses a special problem in text categorization, in which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on a compression-based model. We have conducted an empirical experiment on eight public and real non-encoded datasets. The results indicate that the proposed filter is fast to construct, incrementally updateable and clearly outperforms c 2012 John Wiley & Sons, Ltd. established spam classifiers. Copyright KEYWORDS Compression-based model; spam filter; text categorization; knowledge-based system; machine learning ∗

Correspondence

E-mail: [email protected] Received . . .

1. INTRODUCTION

methods which take into account the sender’s domain, clustering, and others. However, machine learning algo-

The term spam is used to denote an unsolicited commercial

rithms have been achieved more success [1]. These meth-

e-mail. According to updated reports, the amount of spam

ods include approaches that are considered top-performers

is still increasing. The average of spam sent per day

in text categorization, like Rocchio [2, 3], Boosting [4, 5],

increased from 2.4 billion in 2002∗ to 300 billion in

Na¨ıve Bayes classifiers [6, 7, 8, 9, 10, 11] and Support



2010 . The cost in terms of lost productivity in the United

Vector Machines [12, 13, 14, 15, 16]. The two latter

States has reached US$ 21.58 billion annually, while the

currently appear to be the best spam filters available in the



worldwide productivity is estimated to be US$ 50 billion .

literature [1, 17]. For more details about spam filters, refer

On a worldwide basis, the information technology cost of

to the surveys [18, 19, 1, 20, 21, 22].

dealing with junk email was estimated to rise from US$ 20.5 billion in 2003, to US$ 198 billion in 2010.

A relatively recent approach for inductive inference which is rarely employed in text categorization is the

In recent years, many methods have been proposed to

Minimum Description Length principle. It states that the

automatic spam filtering, such as: white and blacklists,

best explanation, given a limited set of observed data,

challenge and response systems, rule-based approaches,

is the one that yields the greatest compression of the data [23, 24, 25].



See http://www.spamlaws.com/spam-stats.html † See www.ciscosystems.cd/en/US/prod/collateral/vpndevc/ cisco 2009 asr.pdf ‡ See http://www.rockresearch.com/news 020305.php

c 2012 John Wiley & Sons, Ltd. Copyright Prepared using secauth.cls [Version: 2010/06/28 v2.00]

In this paper, we present a spam classifier based on the Minimum Description Length principle and compare its performance with seven different models of

1

Compression-based Spam Filter

T.A. Almeida & A. Yamakami

the well-known Na¨ıve Bayes classifiers and the linear

1. Training. A set of labeled messages (M) must

Support Vector Machine. We have conducted an empirical

be provided as training data, which are first

experiment using eight well-known, large, and public

transformed into a representation that can be

databases and the reported results states that our approach

understood by the learning algorithms. The most

outperforms currently established spam filters.

commonly used representation for spam filtering is

Fragments of this work were previously presented at

the vector space model, in which each document

ACM SAC 2010 [26, 10]. Here, we have connected all

mj ∈ M is transformed into a real vector ~xj ∈

ideas in a very consistent way and offered a lot more

ℜ|Φ| , where Φ is the vocabulary (feature set),

details about each study and significantly extended the

and the coordinates of ~xj represent the weight of

performance evaluation. To be clear: first, we present more

each feature in Φ. Then, we can run a learning

details about the proposed method, its main features and

algorithm over the training data to create a classifier

how it works. Second, we compare the new approach

Ψspam (x~j ) → {true, false}.

with several established classifiers instead of only two, as

2. Classification. The classifier Ψspam (x~j ) is applied to

presented in the mentioned papers. Third, and the most

the vector representation of a message ~x to produce

important, we evaluate the new classifier using public and

a prediction whether ~x is spam or not.

larger email corpora, such as: TREC06 and CEAS08. The remainder of this paper is organized as follows:

2.1. Message representation

Section 2 presents the general concepts regarding spam classifiers. In Section 3, we offer details about the

Each message m is composed by a set of tokens

proposed compression-based approach. The experiments

m = {t1 , . . . , t|m| }, where each token tk may be a

and results are showed in Section 4. Finally, Section 5

term (or word) (“viagra”, for example), a set of

offers conclusions and outlines for future work.

terms (“to be removed”), or a single character (“$”). Thus, we can represent each e-mail by a vector ~x = hx1 , . . . , x|m| i, where x1 , . . . , x|m| are values of the attributes X1 , . . . , X|m| associated with the tokens

2. GENERAL CONCEPTS

t1 , . . . , t|m| . In the simplest case, each token represents a single term and all attributes are Boolean: Xi = 1, if the

In general, the machine learning algorithms applied to spam filtering can be summarized as follows. Given

a

set

of

{m1 , m2 , . . . , mj , . . . , m|M| }

and (cl )},

(cs ),

legitimate

Alternatively, attributes may be an integer computed M=

messages

C = {spam

a

category where

message contains ti or Xi = 0, otherwise.

mj

by token frequencies (TF) representing how many times

set

each token appears in the message. A third alternative is

is

to associate each attribute Xi to a normalized TF, xi =

the j th e-mail in M and C is the possible label

n(ti ) , |m|

where n(ti ) is the number of occurrences of the

set, the task of automated spam filtering consists

token represented by Xi in m, and |m| is the length of

in

function

m measured in token occurrences. Normalized TF takes

Ψ(mj , ci ) : M × C → {true, false}. When Ψ(mj , ci )

building

a

Boolean

categorization

into account the token repetition versus the size of message

is true, it indicates message mj belongs to category ci ;

[27].

otherwise, mj does not belong to ci . In the setting of spam filtering there are only two category labels: spam and legitimate (also called ham). Each message mj ∈ M can only be assigned to one of them, but not to both. Therefore, we can use

3. COMPRESSION-BASED SPAM FILTER

a simplified categorization function Ψspam (mj ) : M → {true, false}. Hence, a message is classified as spam when Ψspam (mj ) is true, and legitimate otherwise. The application of supervised machine learning algorithms for spam filtering consists of two stages: 2

Compression-based techniques seem to be a promising alternative to categorization tasks. Teahan and Harper [28] and Frank et al. [29] performed extensive experiments to evaluate the performance of different approaches for text c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls

T.A. Almeida & A. Yamakami

Compression-based Spam Filter

categorization on the standard Reuters-21578 collection

This idea can be applied to all sorts of inductive

and compared compression-based algorithms, such as

inference problems, but it turns out to be most fruitful

prediction by partial matching (PPM) with Na¨ıve Bayes

in problems of model selection and, more generally,

classifiers and support vector machine. According to the

dealing with overfitting [25]. An important property of

found results, compression-based models perform better

MDL methods is that they provide automatically and

than word-based Na¨ıve Bayes techniques and approach the

inherently protection against overfitting and can be used

performance of linear support vector machine.

to estimate both the parameters and the structure of a

A relatively recent method for inductive inference

model. In contrast, to avoid overfitting when estimating the

which is still rarely used in text categorization is the

structure of a model, traditional methods such as maximum

Minimum Description Length (MDL) principle. It states

likelihood must be modified and extended with additional,

that the best explanation, given a limited set of observed

typically ad hoc principles [25].

data, is the one that yields the greatest compression of the data [23, 24, 25].

Consider the following example. Suppose we flip a coin 1,000 times and we observe the numbers of heads and

The purpose of statistical modeling is to discover

tails. We consider two model classes: the first consists of

regularities in observed data. The success in finding such

a code that represents each outcome with a 0 for heads

regularities can be measured by the length with which the

or a 1 for tails. This code represents the hypothesis that

data can be described. This is the rationale behind the MDL

the coin is fair. The code length according to this code is

principle [23]. The fundamental idea is that any regularity

always exactly 1,000 bits. The second model class consists

in a given set of data can be used to compress the data.

of all codes that are efficient for a coin with some specific

According to the traditional MDL principle, the favorite

bias, representing the hypothesis that the coin is not fair.

model results in the shortest description of the model and

Say that we observe 510 heads and 490 tails. Then the

the data, given this model. In other words, the model that

code length according to the best code in the second

best compresses the data is selected. This model selection

model class is shorter than 1,000 bits. For this reason

criterion naturally balances the complexity of the model

a naive statistical method might put forward this second

and the degree to which this model fits the data. This

hypothesis as a better explanation for the data. However,

principle was first introduced by Rissanen [23] and it

in an MDL approach we would have to construct a single

becomes an important concept in information theory.

code based on the hypothesis, we can not just use the best

Let Z be a finite or countable set and let P be

one. A simple way to do it would be to use a two-part code,

a probability distribution on Z. Then there exists a

in which we first specify which element of the model class

prefix code C for Z such that for all z ∈ Z, LC (z) =

has the best performance, and then we specify the data

⌈−log2 P (z)⌉. C is called the code corresponding to P .

using that code. We will need quite a lot of bits to specify

Similarly, let C be a prefix code for Z. Then there

which code to use; thus the total codelength based on the

exists a (possibly defective) probability distribution P such

second model class would be larger than 1,000 bits. Thus if





that for all z ∈ Z, −log2 P (z) = LC ′ (z). P is called

we follow an MDL approach the conclusion has to be that

the probability distribution corresponding to C ′ . Thus,

there is not enough evidence in support of the hypothesis

large probability according to P means small code length

that the coin is biased, even though the best element of the

according to the code corresponding to P and vice versa

second model class provides better fit to the data. Consult

[23, 24, 25].

Grunwald [25] for further details.

The goal of statistical inference may be cast as trying

In essence, compression algorithms can be applied to

to find regularity in the data. Regularity may be identified

text categorization by building one compression model

with ability to compress. MDL combines these two

from the training documents of each class and using these

insights by viewing learning as data compression: it tells

models to evaluate the target document.

us that, for a given set of hypotheses H and data set D, we should try to find the hypothesis or combination of hypotheses in H that compresses D most [23, 24, 25].

3.1. The proposed spam filter Given a set of classified training messages M, the task is to assign a target message m with an unknown label to one of

c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls

3

Compression-based Spam Filter

T.A. Almeida & A. Yamakami

the classes c ∈ {spam, ham}. First, the method measures the increase of the description length of the data set as a

In the following, we offer more information regarding steps 1 and 4.

result of the addition of the target document. Finally, it chooses the class for which the description length increase is minimal. We consider in this work, each class (model) c as a sequence of tokens extracted from the messages and inserted into the training set. Each token t from m has a code length Lt based on the sequence of tokens presented in the messages of the training set of c. The length of m when assigned to the class c corresponds to the sum of all code lengths associated with each token of m, Lm = P|m| i=1 Lti . We calculate Lti = ⌈−log2 Pti ⌉, where P is a probability distribution related with the tokens of class. Let

nc (ti ) the number of times that ti appears in messages of class c, then the probability that any token belongs to c is given by the maximum likelihood estimation: Pt i =

nc (ti ) +

3.2. Preprocessing Tokenization involves breaking the text stream into tokens, usually by means of a regular expression. We consider in this work that tokens start with a printable character; followed by any number of alphanumeric characters, excluding dots, commas and colons from the middle of the pattern. With this pattern, domain names and mail addresses will be split at dots, so the classifier can recognize a domain even if subdomains vary [30]. As proposed by Drucker et al. [12] and Metsis et al. [31], we do not consider the number of times a tokens appears in each message. So, each tokens is computed only once per message it appears. It is important to observe that we did not perform

1 |∆|

language-specific preprocessing techniques, such as: word stemming, stop word removal, or case folding. However,

nc + 1

where nc corresponds to the sum of nc (ti ) for all tokens

we use an e-mail-specific preprocessing before the

which appear in messages that belongs to c and |∆| is the

classification stage. We employ the Jaakko Hyvattis

vocabulary size. In this work, we set |∆| = 232 , that is,

normalizemime§ . This algorithm converts the character set

each token in an uncompress mode is a symbol with 32

to UTF-8, decoding Base64, Quoted-Printable and URL

bits. This estimation reserves a “portion” of probability to

encoding and adding warn tokens in case of encoding

terms which the classifier has never seen before.

errors. It also appends a copy of HTML/XML message

The proposed spam filter classifies a message using the

bodies with most tags removed, decodes HTML entities and limits the size of attached binary files.

following steps:

3.3. Training stage 1. Tokenization: the classifier extracts all tokens of the message m = {t1 , . . . , t|m| };

The training stage is basically responsible to update and store the number of times each token appears in the

2. The method calculates the increase of the description length when m is assigned to each class c ∈ {spam, ham} by the following equations:

messages of each class. Therefore, for each message m = {t1 , . . . , t|m| } to be trained, the MDL spam filter performs the following simple steps: For each token ti of m do:

Lm (spam) =

& |m| X

− log2

i=1

nspam (ti ) +

1 |∆|

nspam + 1

!'

1. Search for ti in the training database; 2. If ti is found then update the number of messages on the class of m that ti has appeared, otherwise insert ti in the database.

Lm (ham) =

|m| X i=1

&

− log2

nham (ti ) +

1 |∆|

nham + 1

!'

3. If Lm (spam) < Lm (ham), then m is classified as

An advantage of the MDL classifier is that we can start with an empty training set and according to the user feedback the classifier builds the models for each class.

spam; otherwise, m is labeled as ham. 4. If necessary, the training method is called. 4

§

Available at http://hyvatti.iki.fi/jaakko/spam

c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls

T.A. Almeida & A. Yamakami

Compression-based Spam Filter

Moreover, it is not necessary to keep the messages used

As pointed out by Almeida et al. [11], to provide a fair

for training since the models are incrementally building

evaluation, we consider as the most important measures

by the token frequencies. As the tokens presented in the

the Matthews correlation coefficient (M CC) [33] and the

training set are kept in a lexicographical order, so the

weighted accuracy rate (Accw %) [27] achieved by each

computational complexity to train each message is of the

filter.

order of O(|m|.log n), where |m| is the number of tokens

M CC provides a balanced evaluation of the prediction,

presented in the message and n is the amount of tokens in

especially if the two classes are of different sizes [11, 34,

the training set. Therefore, besides the proposed approach

35]. It is given by the following equation.

is incrementally updateable, it is also very fast to construct, especially when compared with other established methods. Note that, for training, the Na¨ıve-Bayes classifier has a

M CC = p

computational complexity equivalent to O(|m|.n) [31, 11] and the linear SVM, O(|m|.n2 ) [32].

(A.B) − (C.D) (A + C).(A + D).(B + C).(B + D)

A basic training approach is to start with an empty

Here, A, B, C, and D correspond to the amount of true

model, classify each new sample and train it in the right

positive, true negative, false positive, and false negative

class if the classification is wrong. This is known as train

samples, respectively.

on error (TOE). An improvement to this method is to train

M CC returns a value inside a predefined range,

also when the classification is right, but the score is near

which provides more information about the classifiers

the boundary – that is, train on near error (TONE) [30].

performance. It returns a real value between −1 and +1.

The advantage of TONE over TOE is that it accelerates

A coefficient equals to +1 indicates a perfect prediction;

the learning process by exposing the filter to additional

0, an average random prediction; and −1, an inverse

hard-to-classify samples in the same training period.

prediction.

Therefore, we employ the TONE as training method used by the proposed spam filter.

Additionally, we present other well-known measures, such as: spam recall (Sre%), legitimate recall (Lre%), spam precision (Spr%), and legitimate precision (Lpr%). The results achieved by the MDL spam filter are compared with the ones attained by methods considered

4. EXPERIMENTS AND RESULTS

the actual top-performers in spam filtering: seven First, we evaluated the proposed approach using the six ¶

different models of Na¨ıve Bayes (NB) classifiers (Basic

well-known, large, real and public Enron datasets . The

NB [6], Multinomial term frequency NB [MN TF

corpora are composed by legitimate messages extracted

NB] [36], Multinomial Boolean NB [MN Bool NB] [37],

from the mailboxes of six former employees of the Enron

Multivariate Bernoulli NB [MV Bern NB] [38], Boolean

Corporation. For further details about the dataset statistics,

NB [Bool NB] [31], Multivariate Gauss NB [Gauss

refer to Metsis et al. [31]. Tables II, III, IV, V, VI, and VII present the performance achieved by each classifier for each Enron

NB] [31], and Flexible Bayes [Flex Bayes] [39]) and linear Support Vector Machine (SVM) with Boolean attributes [12, 13, 14, 15]. Table I summarizes all compared Na¨ıve Bayes spam

dataset. We highlight the highest score in bold. According to Cormack [1], the filters should be judged along four dimensions: autonomy, immediacy,

filters. For further information, consult Almeida et al. [9, 10, 11]. A

spam identification, and non-spam identification. However,

comprehensive tables

and

set

of

figures,

results, is

including

it is not obvious how to measure any of these dimensions

all

available

separately, nor how to combine these measurements into a

http://www.dt.fee.unicamp.br/˜tiago/

single one for the purpose of comparing filters.

research/spam/spam.htm.

at

The MDL spam filter outperformed the compared methods for the majority e-mail datasets used in our ¶ The Enron datasets are available at http://www.iit.demokritos.gr/ skel/i-config/.

c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls

empirical evaluation. Note that, in some situations the

5

Compression-based Spam Filter

T.A. Almeida & A. Yamakami

Table I. Na¨ıve Bayes spam filters.

Q|m|

P (tk |ci )

O(|m|.n)

O(|m|)

P (tk |ci )xk

O(|m|.n)

O(|m|)

P (tk |ci )xk

O(|m|.n)

O(|m|)

P (tk |ci )xk .(1 − P (tk |ci ))(1−xk )

O(|m|.n)

O(|m|)

P (tk |ci )

O(|m|.n)

O(|m|)

g(xk ; µk,ci , σk,ci )

O(|m|.n)

O(|m|)

O(|m|.n)

O(|m|.n)

Basic NB

k=1

Q|m|

MN TF NB

k=1

Q|m|

MN Boolean NB MV Bernoulli NB Boolean NB

k=1

Q|m|

k=1

Q|m|

k=1

MV Gauss NB Flexible Bayes

Complexity on Training Classification

P (~x|ci )

NB Classifier

Q|m|

k=1

Q|m|

1 k=1 Lk,c

i

PLk,ci l=1

g(xk ; µk,ci ,l , σci )

Table II. Enron 1 – Results achieved by each filter.

Classifiers Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL

Sre(%)

Spr(%)

Lre(%)

Lpr(%)

Accw (%)

M CC

91.33 82.00 82.67 72.67 96.00 85.33 86.67 83.33 92.00

85.09 73.21 60.19 60.56 52.55 89.51 88.44 87.41 92.62

93.48 87.77 77.72 80.71 64.67 95.92 95.38 95.11 97.01

96.36 92.29 91.67 87.87 97.54 94.13 94.61 93.33 96.75

92.86 86.10 79.15 78.38 73.75 92.86 92.86 91.70 95.56

0.831 0.676 0.560 0.508 0.551 0.824 0.825 0.796 0.892

Table III. Enron 2 – Results achieved by each filter.

Classifiers Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL

Sre(%)

Spr(%)

Lre(%)

Lpr(%)

Accw (%)

M CC

80.00 75.33 76.00 66.00 95.33 75.33 64.00 90.67 91.33

97.56 96.58 98.28 81.82 81.25 98.26 97.96 90.67 99.28

99.31 99.08 99.54 94.97 92.45 99.54 99.54 96.80 99.77

93.53 92.13 92.36 89.06 98.30 92.16 88.96 96.80 97.10

94.38 93.02 93.53 87.56 93.19 93.36 90.46 95.23 97.31

0.850 0.812 0.827 0.657 0.836 0.823 0.743 0.875 0.937

MDL performs much better than SVM and NB classifiers.

spam recall rate equal to 92% while SVM attained

For instance, for Enron 1 (Table II), MDL achieved

83.33%. It means that for Enron 1 MDL was able to

6

c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls

T.A. Almeida & A. Yamakami

Compression-based Spam Filter

Table IV. Enron 3 – Results achieved by each filter.

Classifiers

Sre(%)

Spr(%)

Lre(%)

Lpr(%)

Accw (%)

M CC

Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL

58.00 62.00 60.00 100.00 95.33 55.33 52.67 91.33 90.00

100.00 100.00 100.00 85.23 87.73 97.65 97.53 96.48 100.00

100.00 100.00 100.00 93.53 95.02 99.50 99.50 98.76 100.00

86.45 87.58 87.01 100.00 98.20 85.65 86.78 96.83 96.40

88.59 89.67 89.13 95.29 95.11 87.50 72.83 96.74 97.28

0.708 0.737 0.723 0.893 0.881 0.676 0.656 0.917 0.931

Table V. Enron 4 – Results achieved by each filter.

Classifiers Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL

Sre(%)

Spr(%)

Lre(%)

Lpr(%)

Accw (%)

M CC

95.33 94.00 97.11 98.22 98.00 94.22 95.78 98.89 97.11

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00

87.72 84.75 92.02 94.94 94.34 85.23 88.76 96.77 92.02

96.50 95.50 97.83 98.67 98.50 95.67 96.83 99.17 97.83

0.914 0.893 0.945 0.966 0.962 0.896 0.922 0.978 0.945

Table VI. Enron 5 – Results achieved by each filter.

Classifiers Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL

Sre(%)

Spr(%)

Lre(%)

Lpr(%)

Accw (%)

M CC

90.22 88.59 95.11 98.10 85.87 88.59 91.58 89.40 99.73

98.81 100.00 100.00 91.86 100.00 99.39 98.54 99.70 98.39

97.33 100.00 100.00 78.67 100.00 98.67 96.67 99.33 96.00

80.22 78.12 89.29 94.40 74.26 77.89 82.39 79.26 99.31

92.28 91.89 96.53 92.47 89.96 91.51 93.05 92.28 98.65

0.832 0.832 0.922 0.814 0.799 0.821 0.845 0.837 0.967

recognize over 8% more of spams than SVM, representing

The results show that the data compression model is

an improvement of 10.40%. The same result can be

more efficient to distinguish messages as spams or hams

found for Enron 2 (Table III), Enron 5 (Table VI), and

than other compared spam filters. It achieved an impressive

Enron 6 (Table VII). Both methods, MDL and SVM,

average accuracy rate higher than 97% and high precision

achieved similar performance with no significant statistical

× recall rates for all datasets indicating that the MDL spam

difference only for Enron 3 (Table IV) and Enron 4

filter makes few mistakes. We also verify that the MDL

(Table V).

classifier achieved an average M CC score higher than

c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls

7

Compression-based Spam Filter

T.A. Almeida & A. Yamakami

Table VII. Enron 6 – Results achieved by each filter.

Classifiers Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL

Sre(%)

Spr(%)

Lre(%)

Lpr(%)

Accw (%)

M CC

87.78 75.78 93.33 96.00 66.67 89.56 94.22 89.78 98.67

99.00 99.42 97.45 92.31 99.67 98.05 97.03 95.28 95.48

97.33 98.67 92.67 76.00 99.33 94.67 91.33 86.67 86.00

72.64 57.59 82.25 86.36 49.83 75.13 84.05 73.86 95.56

90.17 81.50 93.17 91.00 74.83 90.83 93.50 89.00 95.50

0.781 0.651 0.828 0.753 0.572 0.785 0.833 0.727 0.878

0.925 for all tested e-mail collections. It clearly indicates that the proposed filter almost accomplished a perfect prediction (M CC = 1.000) and it is much better than not

LAM % = logit−1

using a filter (M CC = 0.000). Among the evaluated NB classifiers, the results indicate

! logit(f pr) + logit(f nr) , 2

where

that all of them achieved similar performance with no ! p . logit(p) = log 100 − p

significant statistical difference. However, they achieved lower results than MDL and SVM, which attained accuracy rate higher than 90% for the most of Enron spam datasets. To reinforce the validation we performed an other experiment to evaluate the MDL spam filter on more realistic and larger datasets: TREC06 and CEAS08. The TREC Spam Track and CEAS Spam Challenge are the largest and most realistic laboratory evaluations to date. The main task at TREC 2006 – the immediate feedback task – simulates the on-line deployment of a spam filter with idealized user feedback [1]. On the other hand, the 2008 CEAS Spam Challenge evaluated the filters using a live simulation task which offers delayed partial feedback in which feedback is delayed and given only for messages addressed to a subset of the recipients. The main goal is to simulate a real mailbox in order to explore the impact of imperfect or limited user feedback. To offer a fair evaluation and reproducible results, we followed exactly the same guidelines used in the challenges including the same measures: spam caught (%), blocked ham (%), LAM% and (1-AUC)% [1]. The Logistic Average Misclassification (LAM %) is a

This measure imposes no a priori relative importance on ham or spam misclassification, and rewards equally a fixed-factor improvement in the odds of either [1]. The lower the LAM %, the better the performance. The area under the ROC curve (AU C) provides an estimate of the effectiveness of a soft classifier over all threshold filtering settings. AU C also has a probabilistic interpretation: it is the probability that the classifier will award a random spam message a higher score than a random ham message. The (1 − AU C)% is often used instead of AU C due to the fact that in spam filtering the AU C value is very close to 1 [1]. Again, we compared the performance achieved by the MDL spam filter against: • Na¨ıve Bayes classifier: the open-source and well-known 0.94.0,

Bogofilter

default

classifier

parameters



(version

available

at

http://www.bogofilter.org); • SVM classifier: relaxed linear support vector machine [40].

single quality measure, based only on the filter’s binary

Table VIII present the results achieved by MDL spam

classifications. Let the false positive rate (f pr) and false

filter and established spam filters on immediate feedback

negative rate (f nr), LAM % is given by

task for the TREC06 data set. Table IX present the results achieved by the filters for CEAS08 data set on

8

c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls

T.A. Almeida & A. Yamakami

Compression-based Spam Filter

live simulation. The results by each e-mail collection are

To evaluate the proposed method we have conducted

ordered by decreasing performance in the (1 − AU C)%

empirical experiments using real, large and public datasets

statistic.

and the results indicate that the proposed classifier

Table VIII. TREC 2006 – Results achieved by each filter.

outperforms currently established spam filters. Actually, we are conducting more experiments to compare our approach with other established commercial

Filters

(1 − AU C)%

SC%

BH%

LAM %

MDL SVM NB

0.053 0.060 0.084

96.92 99.51 91.77

0.19 0.71 0.00

0.52 0.62 0.00

and open-source spam filters, such as: SpamAssassin, ProcMail, OSBF-Lua and others.

ACKNOWLEDGMENT Table IX. CEAS 2008 – Results achieved by each filter.

This work is partially supported by the Brazilian funding agencies CNPq, CAPES and FAPESP.

Filters

(1 − AU C)%

SC%

BH%

LAM %

MDL NB SVM

0.0129 0.0157 0.0221

95.79 89.04 98.96

0.08 0.00 0.51

0.21 0.00 0.49

REFERENCES 1. Cormack G. Email Spam Filtering: A Systematic Review. Foundations and Trends in Information

The MDL spam filter achieved very good performance

Retrieval 2008; 1(4):335–455.

for all tested e-mail data sets independent on the task. If

2. Joachims T. A Probabilistic Analysis of the Rocchio

we take into account that the (1 − AU C)% is the standard

Algorithm with TFIDF for Text Categorization.

measure used in the spam challenges, we can conclude that

Proceedings of 14th International Conference on

the proposed approach outperforms the top-performance

Machine Learning, Nashville, TN, USA, 1997; 143–

spam filters. Note that, although the MDL has not achieved

151.

the best SC% rates, it attained very low BH% and

3. Schapire R, Singer Y, Singhal A. Boosting and

LAM % rates for all tested data sets. It indicates that the

Rocchio Applied to Text Filtering. Proceedings of the

proposed filter rarely misclassify a legitimate message. It is

21st Annual International Conference on Information

a remarkable feature since in spam filtering false positives

Retrieval, Melbourne, Australia, 1998; 215–223.

are considered much worse than false negatives. Despite

4. Carreras X, Marquez L. Boosting Trees for Anti-

the fact that Bogofilter (NB classifier) attained the best

Spam Email Filtering. Proceedings of the 4th Inter-

BH% rates for all used e-mail data sets, it achieved the

national Conference on Recent Advances in Natural

worst SC% rates.

Language Processing, Tzigov Chark, Bulgaria, 2001; 58–64. 5. Reddy C, Park JH. Multi-Resolution Boosting for

5. CONCLUSIONS AND FURTHER WORK

Classification and Regression Problems. Knowledge and Information Systems 2010; . 6. Sahami M, Dumais S, Hecherman D, Horvitz E.

In this paper, we have presented a spam filtering approach

A Bayesian Approach to Filtering Junk E-mail.

based on data compression model that have proved to be

Proceedings of the 15th National Conference on

fast to construct and incrementally updateable. We have

Artificial Intelligence, Madison, WI, USA, 1998; 55–

also compared its performance with the well-known linear

62.

Support Vector Machine and different models of Na¨ıve

7. Androutsopoulos I, Koutsias J, Chandrinos K,

Bayes classifiers, something the spam literature does not

Paliouras G, Spyropoulos C. An Evalutation of Naive

always acknowledge.

Bayesian Anti-Spam Filtering. Proceedings of the

c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls

9

Compression-based Spam Filter

11st European Conference on Machine Learning, Barcelona, Spain, 2000; 9–17. 8. Song Y, Kolcz A, Gilez C. Better Naive Bayes Classification for High-precision Spam Detection. Software – Practice and Experience 2009; 39(11):1003–1024. 9. Almeida T, Yamakami A, Almeida J. Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters. Proceedings of the 8th IEEE International Conference on Machine

T.A. Almeida & A. Yamakami

Computers and Security 2006; 25(8):566–578. 19. Blanzieri E, Bryl A. A Survey of Learning-based Techniques of Email Spam Filtering. Artificial Intelligence Review 2008; 29(1):335–455. 20. Guzella T, Caminhas W. A Review of Machine Learning Approaches to Spam Filtering. Expert Systems

with

Applications

September

2009;

36(7):10 206–10 222. 21. Caruana G, Li M. A Survey of Emerging Approaches

Learning and Applications, Miami, FL, USA, 2009;

to Spam Filtering. ACM Computing Surveys 2012;

517–522.

44(2):1–27.

10. Almeida T, Yamakami A, Almeida J. Probabilistic

22. Almeida TA, Yamakami A. Advances in Spam

Anti-Spam Filtering with Dimensionality Reduction.

Filtering Techniques. Computational Intelligence for

Proceedings of the 25th ACM Symposium On Applied

Privacy and Security, Studies in Computational

Computing, Sierre, Switzerland, 2010; 1804–1808.

Intelligence, vol. 394, Elizondo D, Solanas A,

11. Almeida T, Almeida J, Yamakami A. Spam Filtering:

Martinez-Balleste A (eds.). Springer, 2012; 199–214.

How the Dimensionality Reduction Affects the

23. Rissanen J. Modeling by Shortest Data Description.

Accuracy of Naive Bayes Classifiers. Journal of Internet Services and Applications 2011; 1(3):183– 200. 12. Drucker H, Wu D, Vapnik V. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks September 1999; 10(5):1048–1054.

Automatica 1978; 14:465–471. 24. Barron A, Rissanen J, Yu B. The Minimum Description Length Principle in Coding and Modeling. IEEE Transactions on Information Theory 1998; 44(6):2743–2760. 25. Gr¨unwald P. A Tutorial Introduction to the Minimum Description Length Principle. Advances in Min-

13. Kolcz A, Alspector J. SVM-based Filtering of E-mail

imum Description Length: Theory and Applications,

Spam with Content-Specific Misclassification Costs.

Gr¨unwald P, Myung I, Pitt M (eds.). MIT Press, 2005;

Proceedings of the 1st International Conference on Data Mining, San Jose, CA, USA, 2001; 1–14.

3–81. 26. Almeida T, Yamakami A, Almeida J. Filtering Spams

14. Hidalgo J. Evaluating Cost-Sensitive Unsolicited

using the Minimum Description Length Principle.

Bulk Email Categorization. Proceedings of the 17th

Proceedings of the 25th ACM Symposium On Applied

ACM Symposium on Applied Computing, Madrid,

Computing, Sierre, Switzerland, 2010; 1856–1860.

Spain, 2002; 615–620. 15. Forman G, Scholz M, Rajaram S. Feature Shaping

27. Androutsopoulos I, Paliouras G, Michelakis E. Learning to Filter Unsolicited Commercial E-

for Linear SVM Classifiers. Proceedings of the

Mail. Technical Report 2004/2, National Centre for

15th ACM SIGKDD International Conference on

Scientific Research “Demokritos”, Athens, Greece

Knowledge Discovery and Data Mining, Paris, France, 2009; 299–308.

March 2004. 28. Teahan W, Harper D. Using Compression-Based Lan-

16. Almeida T, Yamakami A. Content-Based Spam

guage Models for Text Categorization. Proceedings

Filtering. Proceedings of the 23rd IEEE International

of the 2001 Workshop on Language Modeling and

Joint Conference on Neural Networks, Barcelona,

Information Retrieval, Pittsburgh, PA, USA, 2001; 1–

Spain, 2010; 1–7.

5.

17. Almeida TA, Yamakami A. Facing the Spammers:

29. Frank E, Chui C, Witten I. Text Categorization Using

A Very Effective Approach to Avoid Junk E-mails.

Compression Models. Proceedings of the 10th Data

Expert Systems with Applications 2012; :1–5.

Compression Conference, Snowbird, UT, USA, 2000;

18. Carpinter J, Hunt R. Tightening the Net: A Review of

555–565.

Current and Next Generation Spam Filtering Tools.

10

c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls

T.A. Almeida & A. Yamakami

Compression-based Spam Filter

30. Siefkes C, Assis F, Chhabra S, Yerazunis W. Combining Winnow and Orthogonal Sparse Bigrams

for Classification: An Overview. Bioinformatics May 2000; 16(5):412–424.

for Incremental Spam Filtering. Proceedings of the

36. McCallum A, Nigam K. A Comparison of Event

8th European Conference on Principles and Practice

Models for Naive Bayes Text Classication. Proceed-

of Knowledge Discovery in Databases, Pisa, Italy,

ings of the 15th AAAI Workshop on Learning for Text Categorization, Menlo Park, CA, USA, 1998; 41–48.

2004; 410–421. 31. Metsis V, Androutsopoulos I, Paliouras G. Spam

37. Schneider K. On Word Frequency Information

Filtering with Naive Bayes - Which Naive Bayes?

and Negative Evidence in Naive Bayes Text

Proceedings of the 3rd International Conference on

Classification. Proceedings of the 4th International

Email and Anti-Spam, Mountain View, CA, USA,

Conference on Advances in Natural Language Processing, Alicante, Spain, 2004; 474–485.

2006; 1–5. 32. Bordes A, Ertekin S, Weston J, Bottou L. Fast Kernel

38. Losada D, Azzopardi L. Assessing Multivariate

Classifiers with Online and Active Learning. Journal

Bernoulli Models for Information Retrieval. ACM

of Machine Learning Research 2005; 6:1579–1619.

Transactions on Information Systems June 2008;

33. Matthews B. Comparison of the Predicted and

26(3):1–46.

Phage

39. John G, Langley P. Estimating Continuous Distribu-

Lysozyme. Biochimica et Biophysica Acta October

tions in Bayesian Classifiers. Proceedings of the 11st

1975; 405(2):442–451.

International Conference on Uncertainty in Artificial

Observed

Secondary

Structure

of

T4

34. Almeida T, Hidalgo JG, Yamakami A. Contributions

Intelligence, Montreal, Canada, 1995; 338–345.

to the Study of SMS Spam Filtering: New

40. Sculley D, Wachman G, Brodley C. Spam Filtering

Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering, Mountain

using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers. Proceedings

View, CA, USA, 2011; 259–262.

of the 15th Text REtrieval Conference, Gaithersburg,

35. Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen

MD, USA, 2006; 1–10.

H. Assessing the Accuracy of Prediction Algorithms

c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls

11