SECURITY AND COMMUNICATION NETWORKS Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec
RESEARCH ARTICLE
Compression-based Spam Filter Tiago A. Almeida1∗ and Akebo Yamakami2 1
˜ Carlos – UFSCar, 18052–780, Sorocaba, SP – Brazil Department of Computer Science, Federal University of Sao
2
School of Electrical and Computer Engineering, University of Campinas – UNICAMP, 13083–970, Campinas, SP – Brazil
ABSTRACT Nowadays e-mail spam is not a novelty, but it is still an important problem with a high impact on the economy. Spam filtering poses a special problem in text categorization, in which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on a compression-based model. We have conducted an empirical experiment on eight public and real non-encoded datasets. The results indicate that the proposed filter is fast to construct, incrementally updateable and clearly outperforms c 2012 John Wiley & Sons, Ltd. established spam classifiers. Copyright KEYWORDS Compression-based model; spam filter; text categorization; knowledge-based system; machine learning ∗
Correspondence
E-mail:
[email protected] Received . . .
1. INTRODUCTION
methods which take into account the sender’s domain, clustering, and others. However, machine learning algo-
The term spam is used to denote an unsolicited commercial
rithms have been achieved more success [1]. These meth-
e-mail. According to updated reports, the amount of spam
ods include approaches that are considered top-performers
is still increasing. The average of spam sent per day
in text categorization, like Rocchio [2, 3], Boosting [4, 5],
increased from 2.4 billion in 2002∗ to 300 billion in
Na¨ıve Bayes classifiers [6, 7, 8, 9, 10, 11] and Support
†
2010 . The cost in terms of lost productivity in the United
Vector Machines [12, 13, 14, 15, 16]. The two latter
States has reached US$ 21.58 billion annually, while the
currently appear to be the best spam filters available in the
‡
worldwide productivity is estimated to be US$ 50 billion .
literature [1, 17]. For more details about spam filters, refer
On a worldwide basis, the information technology cost of
to the surveys [18, 19, 1, 20, 21, 22].
dealing with junk email was estimated to rise from US$ 20.5 billion in 2003, to US$ 198 billion in 2010.
A relatively recent approach for inductive inference which is rarely employed in text categorization is the
In recent years, many methods have been proposed to
Minimum Description Length principle. It states that the
automatic spam filtering, such as: white and blacklists,
best explanation, given a limited set of observed data,
challenge and response systems, rule-based approaches,
is the one that yields the greatest compression of the data [23, 24, 25].
∗
See http://www.spamlaws.com/spam-stats.html † See www.ciscosystems.cd/en/US/prod/collateral/vpndevc/ cisco 2009 asr.pdf ‡ See http://www.rockresearch.com/news 020305.php
c 2012 John Wiley & Sons, Ltd. Copyright Prepared using secauth.cls [Version: 2010/06/28 v2.00]
In this paper, we present a spam classifier based on the Minimum Description Length principle and compare its performance with seven different models of
1
Compression-based Spam Filter
T.A. Almeida & A. Yamakami
the well-known Na¨ıve Bayes classifiers and the linear
1. Training. A set of labeled messages (M) must
Support Vector Machine. We have conducted an empirical
be provided as training data, which are first
experiment using eight well-known, large, and public
transformed into a representation that can be
databases and the reported results states that our approach
understood by the learning algorithms. The most
outperforms currently established spam filters.
commonly used representation for spam filtering is
Fragments of this work were previously presented at
the vector space model, in which each document
ACM SAC 2010 [26, 10]. Here, we have connected all
mj ∈ M is transformed into a real vector ~xj ∈
ideas in a very consistent way and offered a lot more
ℜ|Φ| , where Φ is the vocabulary (feature set),
details about each study and significantly extended the
and the coordinates of ~xj represent the weight of
performance evaluation. To be clear: first, we present more
each feature in Φ. Then, we can run a learning
details about the proposed method, its main features and
algorithm over the training data to create a classifier
how it works. Second, we compare the new approach
Ψspam (x~j ) → {true, false}.
with several established classifiers instead of only two, as
2. Classification. The classifier Ψspam (x~j ) is applied to
presented in the mentioned papers. Third, and the most
the vector representation of a message ~x to produce
important, we evaluate the new classifier using public and
a prediction whether ~x is spam or not.
larger email corpora, such as: TREC06 and CEAS08. The remainder of this paper is organized as follows:
2.1. Message representation
Section 2 presents the general concepts regarding spam classifiers. In Section 3, we offer details about the
Each message m is composed by a set of tokens
proposed compression-based approach. The experiments
m = {t1 , . . . , t|m| }, where each token tk may be a
and results are showed in Section 4. Finally, Section 5
term (or word) (“viagra”, for example), a set of
offers conclusions and outlines for future work.
terms (“to be removed”), or a single character (“$”). Thus, we can represent each e-mail by a vector ~x = hx1 , . . . , x|m| i, where x1 , . . . , x|m| are values of the attributes X1 , . . . , X|m| associated with the tokens
2. GENERAL CONCEPTS
t1 , . . . , t|m| . In the simplest case, each token represents a single term and all attributes are Boolean: Xi = 1, if the
In general, the machine learning algorithms applied to spam filtering can be summarized as follows. Given
a
set
of
{m1 , m2 , . . . , mj , . . . , m|M| }
and (cl )},
(cs ),
legitimate
Alternatively, attributes may be an integer computed M=
messages
C = {spam
a
category where
message contains ti or Xi = 0, otherwise.
mj
by token frequencies (TF) representing how many times
set
each token appears in the message. A third alternative is
is
to associate each attribute Xi to a normalized TF, xi =
the j th e-mail in M and C is the possible label
n(ti ) , |m|
where n(ti ) is the number of occurrences of the
set, the task of automated spam filtering consists
token represented by Xi in m, and |m| is the length of
in
function
m measured in token occurrences. Normalized TF takes
Ψ(mj , ci ) : M × C → {true, false}. When Ψ(mj , ci )
building
a
Boolean
categorization
into account the token repetition versus the size of message
is true, it indicates message mj belongs to category ci ;
[27].
otherwise, mj does not belong to ci . In the setting of spam filtering there are only two category labels: spam and legitimate (also called ham). Each message mj ∈ M can only be assigned to one of them, but not to both. Therefore, we can use
3. COMPRESSION-BASED SPAM FILTER
a simplified categorization function Ψspam (mj ) : M → {true, false}. Hence, a message is classified as spam when Ψspam (mj ) is true, and legitimate otherwise. The application of supervised machine learning algorithms for spam filtering consists of two stages: 2
Compression-based techniques seem to be a promising alternative to categorization tasks. Teahan and Harper [28] and Frank et al. [29] performed extensive experiments to evaluate the performance of different approaches for text c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls
T.A. Almeida & A. Yamakami
Compression-based Spam Filter
categorization on the standard Reuters-21578 collection
This idea can be applied to all sorts of inductive
and compared compression-based algorithms, such as
inference problems, but it turns out to be most fruitful
prediction by partial matching (PPM) with Na¨ıve Bayes
in problems of model selection and, more generally,
classifiers and support vector machine. According to the
dealing with overfitting [25]. An important property of
found results, compression-based models perform better
MDL methods is that they provide automatically and
than word-based Na¨ıve Bayes techniques and approach the
inherently protection against overfitting and can be used
performance of linear support vector machine.
to estimate both the parameters and the structure of a
A relatively recent method for inductive inference
model. In contrast, to avoid overfitting when estimating the
which is still rarely used in text categorization is the
structure of a model, traditional methods such as maximum
Minimum Description Length (MDL) principle. It states
likelihood must be modified and extended with additional,
that the best explanation, given a limited set of observed
typically ad hoc principles [25].
data, is the one that yields the greatest compression of the data [23, 24, 25].
Consider the following example. Suppose we flip a coin 1,000 times and we observe the numbers of heads and
The purpose of statistical modeling is to discover
tails. We consider two model classes: the first consists of
regularities in observed data. The success in finding such
a code that represents each outcome with a 0 for heads
regularities can be measured by the length with which the
or a 1 for tails. This code represents the hypothesis that
data can be described. This is the rationale behind the MDL
the coin is fair. The code length according to this code is
principle [23]. The fundamental idea is that any regularity
always exactly 1,000 bits. The second model class consists
in a given set of data can be used to compress the data.
of all codes that are efficient for a coin with some specific
According to the traditional MDL principle, the favorite
bias, representing the hypothesis that the coin is not fair.
model results in the shortest description of the model and
Say that we observe 510 heads and 490 tails. Then the
the data, given this model. In other words, the model that
code length according to the best code in the second
best compresses the data is selected. This model selection
model class is shorter than 1,000 bits. For this reason
criterion naturally balances the complexity of the model
a naive statistical method might put forward this second
and the degree to which this model fits the data. This
hypothesis as a better explanation for the data. However,
principle was first introduced by Rissanen [23] and it
in an MDL approach we would have to construct a single
becomes an important concept in information theory.
code based on the hypothesis, we can not just use the best
Let Z be a finite or countable set and let P be
one. A simple way to do it would be to use a two-part code,
a probability distribution on Z. Then there exists a
in which we first specify which element of the model class
prefix code C for Z such that for all z ∈ Z, LC (z) =
has the best performance, and then we specify the data
⌈−log2 P (z)⌉. C is called the code corresponding to P .
using that code. We will need quite a lot of bits to specify
Similarly, let C be a prefix code for Z. Then there
which code to use; thus the total codelength based on the
exists a (possibly defective) probability distribution P such
second model class would be larger than 1,000 bits. Thus if
′
′
that for all z ∈ Z, −log2 P (z) = LC ′ (z). P is called
we follow an MDL approach the conclusion has to be that
the probability distribution corresponding to C ′ . Thus,
there is not enough evidence in support of the hypothesis
large probability according to P means small code length
that the coin is biased, even though the best element of the
according to the code corresponding to P and vice versa
second model class provides better fit to the data. Consult
[23, 24, 25].
Grunwald [25] for further details.
The goal of statistical inference may be cast as trying
In essence, compression algorithms can be applied to
to find regularity in the data. Regularity may be identified
text categorization by building one compression model
with ability to compress. MDL combines these two
from the training documents of each class and using these
insights by viewing learning as data compression: it tells
models to evaluate the target document.
us that, for a given set of hypotheses H and data set D, we should try to find the hypothesis or combination of hypotheses in H that compresses D most [23, 24, 25].
3.1. The proposed spam filter Given a set of classified training messages M, the task is to assign a target message m with an unknown label to one of
c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls
3
Compression-based Spam Filter
T.A. Almeida & A. Yamakami
the classes c ∈ {spam, ham}. First, the method measures the increase of the description length of the data set as a
In the following, we offer more information regarding steps 1 and 4.
result of the addition of the target document. Finally, it chooses the class for which the description length increase is minimal. We consider in this work, each class (model) c as a sequence of tokens extracted from the messages and inserted into the training set. Each token t from m has a code length Lt based on the sequence of tokens presented in the messages of the training set of c. The length of m when assigned to the class c corresponds to the sum of all code lengths associated with each token of m, Lm = P|m| i=1 Lti . We calculate Lti = ⌈−log2 Pti ⌉, where P is a probability distribution related with the tokens of class. Let
nc (ti ) the number of times that ti appears in messages of class c, then the probability that any token belongs to c is given by the maximum likelihood estimation: Pt i =
nc (ti ) +
3.2. Preprocessing Tokenization involves breaking the text stream into tokens, usually by means of a regular expression. We consider in this work that tokens start with a printable character; followed by any number of alphanumeric characters, excluding dots, commas and colons from the middle of the pattern. With this pattern, domain names and mail addresses will be split at dots, so the classifier can recognize a domain even if subdomains vary [30]. As proposed by Drucker et al. [12] and Metsis et al. [31], we do not consider the number of times a tokens appears in each message. So, each tokens is computed only once per message it appears. It is important to observe that we did not perform
1 |∆|
language-specific preprocessing techniques, such as: word stemming, stop word removal, or case folding. However,
nc + 1
where nc corresponds to the sum of nc (ti ) for all tokens
we use an e-mail-specific preprocessing before the
which appear in messages that belongs to c and |∆| is the
classification stage. We employ the Jaakko Hyvattis
vocabulary size. In this work, we set |∆| = 232 , that is,
normalizemime§ . This algorithm converts the character set
each token in an uncompress mode is a symbol with 32
to UTF-8, decoding Base64, Quoted-Printable and URL
bits. This estimation reserves a “portion” of probability to
encoding and adding warn tokens in case of encoding
terms which the classifier has never seen before.
errors. It also appends a copy of HTML/XML message
The proposed spam filter classifies a message using the
bodies with most tags removed, decodes HTML entities and limits the size of attached binary files.
following steps:
3.3. Training stage 1. Tokenization: the classifier extracts all tokens of the message m = {t1 , . . . , t|m| };
The training stage is basically responsible to update and store the number of times each token appears in the
2. The method calculates the increase of the description length when m is assigned to each class c ∈ {spam, ham} by the following equations:
messages of each class. Therefore, for each message m = {t1 , . . . , t|m| } to be trained, the MDL spam filter performs the following simple steps: For each token ti of m do:
Lm (spam) =
& |m| X
− log2
i=1
nspam (ti ) +
1 |∆|
nspam + 1
!'
1. Search for ti in the training database; 2. If ti is found then update the number of messages on the class of m that ti has appeared, otherwise insert ti in the database.
Lm (ham) =
|m| X i=1
&
− log2
nham (ti ) +
1 |∆|
nham + 1
!'
3. If Lm (spam) < Lm (ham), then m is classified as
An advantage of the MDL classifier is that we can start with an empty training set and according to the user feedback the classifier builds the models for each class.
spam; otherwise, m is labeled as ham. 4. If necessary, the training method is called. 4
§
Available at http://hyvatti.iki.fi/jaakko/spam
c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls
T.A. Almeida & A. Yamakami
Compression-based Spam Filter
Moreover, it is not necessary to keep the messages used
As pointed out by Almeida et al. [11], to provide a fair
for training since the models are incrementally building
evaluation, we consider as the most important measures
by the token frequencies. As the tokens presented in the
the Matthews correlation coefficient (M CC) [33] and the
training set are kept in a lexicographical order, so the
weighted accuracy rate (Accw %) [27] achieved by each
computational complexity to train each message is of the
filter.
order of O(|m|.log n), where |m| is the number of tokens
M CC provides a balanced evaluation of the prediction,
presented in the message and n is the amount of tokens in
especially if the two classes are of different sizes [11, 34,
the training set. Therefore, besides the proposed approach
35]. It is given by the following equation.
is incrementally updateable, it is also very fast to construct, especially when compared with other established methods. Note that, for training, the Na¨ıve-Bayes classifier has a
M CC = p
computational complexity equivalent to O(|m|.n) [31, 11] and the linear SVM, O(|m|.n2 ) [32].
(A.B) − (C.D) (A + C).(A + D).(B + C).(B + D)
A basic training approach is to start with an empty
Here, A, B, C, and D correspond to the amount of true
model, classify each new sample and train it in the right
positive, true negative, false positive, and false negative
class if the classification is wrong. This is known as train
samples, respectively.
on error (TOE). An improvement to this method is to train
M CC returns a value inside a predefined range,
also when the classification is right, but the score is near
which provides more information about the classifiers
the boundary – that is, train on near error (TONE) [30].
performance. It returns a real value between −1 and +1.
The advantage of TONE over TOE is that it accelerates
A coefficient equals to +1 indicates a perfect prediction;
the learning process by exposing the filter to additional
0, an average random prediction; and −1, an inverse
hard-to-classify samples in the same training period.
prediction.
Therefore, we employ the TONE as training method used by the proposed spam filter.
Additionally, we present other well-known measures, such as: spam recall (Sre%), legitimate recall (Lre%), spam precision (Spr%), and legitimate precision (Lpr%). The results achieved by the MDL spam filter are compared with the ones attained by methods considered
4. EXPERIMENTS AND RESULTS
the actual top-performers in spam filtering: seven First, we evaluated the proposed approach using the six ¶
different models of Na¨ıve Bayes (NB) classifiers (Basic
well-known, large, real and public Enron datasets . The
NB [6], Multinomial term frequency NB [MN TF
corpora are composed by legitimate messages extracted
NB] [36], Multinomial Boolean NB [MN Bool NB] [37],
from the mailboxes of six former employees of the Enron
Multivariate Bernoulli NB [MV Bern NB] [38], Boolean
Corporation. For further details about the dataset statistics,
NB [Bool NB] [31], Multivariate Gauss NB [Gauss
refer to Metsis et al. [31]. Tables II, III, IV, V, VI, and VII present the performance achieved by each classifier for each Enron
NB] [31], and Flexible Bayes [Flex Bayes] [39]) and linear Support Vector Machine (SVM) with Boolean attributes [12, 13, 14, 15]. Table I summarizes all compared Na¨ıve Bayes spam
dataset. We highlight the highest score in bold. According to Cormack [1], the filters should be judged along four dimensions: autonomy, immediacy,
filters. For further information, consult Almeida et al. [9, 10, 11]. A
spam identification, and non-spam identification. However,
comprehensive tables
and
set
of
figures,
results, is
including
it is not obvious how to measure any of these dimensions
all
available
separately, nor how to combine these measurements into a
http://www.dt.fee.unicamp.br/˜tiago/
single one for the purpose of comparing filters.
research/spam/spam.htm.
at
The MDL spam filter outperformed the compared methods for the majority e-mail datasets used in our ¶ The Enron datasets are available at http://www.iit.demokritos.gr/ skel/i-config/.
c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls
empirical evaluation. Note that, in some situations the
5
Compression-based Spam Filter
T.A. Almeida & A. Yamakami
Table I. Na¨ıve Bayes spam filters.
Q|m|
P (tk |ci )
O(|m|.n)
O(|m|)
P (tk |ci )xk
O(|m|.n)
O(|m|)
P (tk |ci )xk
O(|m|.n)
O(|m|)
P (tk |ci )xk .(1 − P (tk |ci ))(1−xk )
O(|m|.n)
O(|m|)
P (tk |ci )
O(|m|.n)
O(|m|)
g(xk ; µk,ci , σk,ci )
O(|m|.n)
O(|m|)
O(|m|.n)
O(|m|.n)
Basic NB
k=1
Q|m|
MN TF NB
k=1
Q|m|
MN Boolean NB MV Bernoulli NB Boolean NB
k=1
Q|m|
k=1
Q|m|
k=1
MV Gauss NB Flexible Bayes
Complexity on Training Classification
P (~x|ci )
NB Classifier
Q|m|
k=1
Q|m|
1 k=1 Lk,c
i
PLk,ci l=1
g(xk ; µk,ci ,l , σci )
Table II. Enron 1 – Results achieved by each filter.
Classifiers Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL
Sre(%)
Spr(%)
Lre(%)
Lpr(%)
Accw (%)
M CC
91.33 82.00 82.67 72.67 96.00 85.33 86.67 83.33 92.00
85.09 73.21 60.19 60.56 52.55 89.51 88.44 87.41 92.62
93.48 87.77 77.72 80.71 64.67 95.92 95.38 95.11 97.01
96.36 92.29 91.67 87.87 97.54 94.13 94.61 93.33 96.75
92.86 86.10 79.15 78.38 73.75 92.86 92.86 91.70 95.56
0.831 0.676 0.560 0.508 0.551 0.824 0.825 0.796 0.892
Table III. Enron 2 – Results achieved by each filter.
Classifiers Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL
Sre(%)
Spr(%)
Lre(%)
Lpr(%)
Accw (%)
M CC
80.00 75.33 76.00 66.00 95.33 75.33 64.00 90.67 91.33
97.56 96.58 98.28 81.82 81.25 98.26 97.96 90.67 99.28
99.31 99.08 99.54 94.97 92.45 99.54 99.54 96.80 99.77
93.53 92.13 92.36 89.06 98.30 92.16 88.96 96.80 97.10
94.38 93.02 93.53 87.56 93.19 93.36 90.46 95.23 97.31
0.850 0.812 0.827 0.657 0.836 0.823 0.743 0.875 0.937
MDL performs much better than SVM and NB classifiers.
spam recall rate equal to 92% while SVM attained
For instance, for Enron 1 (Table II), MDL achieved
83.33%. It means that for Enron 1 MDL was able to
6
c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls
T.A. Almeida & A. Yamakami
Compression-based Spam Filter
Table IV. Enron 3 – Results achieved by each filter.
Classifiers
Sre(%)
Spr(%)
Lre(%)
Lpr(%)
Accw (%)
M CC
Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL
58.00 62.00 60.00 100.00 95.33 55.33 52.67 91.33 90.00
100.00 100.00 100.00 85.23 87.73 97.65 97.53 96.48 100.00
100.00 100.00 100.00 93.53 95.02 99.50 99.50 98.76 100.00
86.45 87.58 87.01 100.00 98.20 85.65 86.78 96.83 96.40
88.59 89.67 89.13 95.29 95.11 87.50 72.83 96.74 97.28
0.708 0.737 0.723 0.893 0.881 0.676 0.656 0.917 0.931
Table V. Enron 4 – Results achieved by each filter.
Classifiers Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL
Sre(%)
Spr(%)
Lre(%)
Lpr(%)
Accw (%)
M CC
95.33 94.00 97.11 98.22 98.00 94.22 95.78 98.89 97.11
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
87.72 84.75 92.02 94.94 94.34 85.23 88.76 96.77 92.02
96.50 95.50 97.83 98.67 98.50 95.67 96.83 99.17 97.83
0.914 0.893 0.945 0.966 0.962 0.896 0.922 0.978 0.945
Table VI. Enron 5 – Results achieved by each filter.
Classifiers Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL
Sre(%)
Spr(%)
Lre(%)
Lpr(%)
Accw (%)
M CC
90.22 88.59 95.11 98.10 85.87 88.59 91.58 89.40 99.73
98.81 100.00 100.00 91.86 100.00 99.39 98.54 99.70 98.39
97.33 100.00 100.00 78.67 100.00 98.67 96.67 99.33 96.00
80.22 78.12 89.29 94.40 74.26 77.89 82.39 79.26 99.31
92.28 91.89 96.53 92.47 89.96 91.51 93.05 92.28 98.65
0.832 0.832 0.922 0.814 0.799 0.821 0.845 0.837 0.967
recognize over 8% more of spams than SVM, representing
The results show that the data compression model is
an improvement of 10.40%. The same result can be
more efficient to distinguish messages as spams or hams
found for Enron 2 (Table III), Enron 5 (Table VI), and
than other compared spam filters. It achieved an impressive
Enron 6 (Table VII). Both methods, MDL and SVM,
average accuracy rate higher than 97% and high precision
achieved similar performance with no significant statistical
× recall rates for all datasets indicating that the MDL spam
difference only for Enron 3 (Table IV) and Enron 4
filter makes few mistakes. We also verify that the MDL
(Table V).
classifier achieved an average M CC score higher than
c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls
7
Compression-based Spam Filter
T.A. Almeida & A. Yamakami
Table VII. Enron 6 – Results achieved by each filter.
Classifiers Basic NB MN TF NB MN Bool NB MV Bern NB Bool NB Gauss NB Flex Bayes SVM MDL
Sre(%)
Spr(%)
Lre(%)
Lpr(%)
Accw (%)
M CC
87.78 75.78 93.33 96.00 66.67 89.56 94.22 89.78 98.67
99.00 99.42 97.45 92.31 99.67 98.05 97.03 95.28 95.48
97.33 98.67 92.67 76.00 99.33 94.67 91.33 86.67 86.00
72.64 57.59 82.25 86.36 49.83 75.13 84.05 73.86 95.56
90.17 81.50 93.17 91.00 74.83 90.83 93.50 89.00 95.50
0.781 0.651 0.828 0.753 0.572 0.785 0.833 0.727 0.878
0.925 for all tested e-mail collections. It clearly indicates that the proposed filter almost accomplished a perfect prediction (M CC = 1.000) and it is much better than not
LAM % = logit−1
using a filter (M CC = 0.000). Among the evaluated NB classifiers, the results indicate
! logit(f pr) + logit(f nr) , 2
where
that all of them achieved similar performance with no ! p . logit(p) = log 100 − p
significant statistical difference. However, they achieved lower results than MDL and SVM, which attained accuracy rate higher than 90% for the most of Enron spam datasets. To reinforce the validation we performed an other experiment to evaluate the MDL spam filter on more realistic and larger datasets: TREC06 and CEAS08. The TREC Spam Track and CEAS Spam Challenge are the largest and most realistic laboratory evaluations to date. The main task at TREC 2006 – the immediate feedback task – simulates the on-line deployment of a spam filter with idealized user feedback [1]. On the other hand, the 2008 CEAS Spam Challenge evaluated the filters using a live simulation task which offers delayed partial feedback in which feedback is delayed and given only for messages addressed to a subset of the recipients. The main goal is to simulate a real mailbox in order to explore the impact of imperfect or limited user feedback. To offer a fair evaluation and reproducible results, we followed exactly the same guidelines used in the challenges including the same measures: spam caught (%), blocked ham (%), LAM% and (1-AUC)% [1]. The Logistic Average Misclassification (LAM %) is a
This measure imposes no a priori relative importance on ham or spam misclassification, and rewards equally a fixed-factor improvement in the odds of either [1]. The lower the LAM %, the better the performance. The area under the ROC curve (AU C) provides an estimate of the effectiveness of a soft classifier over all threshold filtering settings. AU C also has a probabilistic interpretation: it is the probability that the classifier will award a random spam message a higher score than a random ham message. The (1 − AU C)% is often used instead of AU C due to the fact that in spam filtering the AU C value is very close to 1 [1]. Again, we compared the performance achieved by the MDL spam filter against: • Na¨ıve Bayes classifier: the open-source and well-known 0.94.0,
Bogofilter
default
classifier
parameters
–
(version
available
at
http://www.bogofilter.org); • SVM classifier: relaxed linear support vector machine [40].
single quality measure, based only on the filter’s binary
Table VIII present the results achieved by MDL spam
classifications. Let the false positive rate (f pr) and false
filter and established spam filters on immediate feedback
negative rate (f nr), LAM % is given by
task for the TREC06 data set. Table IX present the results achieved by the filters for CEAS08 data set on
8
c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls
T.A. Almeida & A. Yamakami
Compression-based Spam Filter
live simulation. The results by each e-mail collection are
To evaluate the proposed method we have conducted
ordered by decreasing performance in the (1 − AU C)%
empirical experiments using real, large and public datasets
statistic.
and the results indicate that the proposed classifier
Table VIII. TREC 2006 – Results achieved by each filter.
outperforms currently established spam filters. Actually, we are conducting more experiments to compare our approach with other established commercial
Filters
(1 − AU C)%
SC%
BH%
LAM %
MDL SVM NB
0.053 0.060 0.084
96.92 99.51 91.77
0.19 0.71 0.00
0.52 0.62 0.00
and open-source spam filters, such as: SpamAssassin, ProcMail, OSBF-Lua and others.
ACKNOWLEDGMENT Table IX. CEAS 2008 – Results achieved by each filter.
This work is partially supported by the Brazilian funding agencies CNPq, CAPES and FAPESP.
Filters
(1 − AU C)%
SC%
BH%
LAM %
MDL NB SVM
0.0129 0.0157 0.0221
95.79 89.04 98.96
0.08 0.00 0.51
0.21 0.00 0.49
REFERENCES 1. Cormack G. Email Spam Filtering: A Systematic Review. Foundations and Trends in Information
The MDL spam filter achieved very good performance
Retrieval 2008; 1(4):335–455.
for all tested e-mail data sets independent on the task. If
2. Joachims T. A Probabilistic Analysis of the Rocchio
we take into account that the (1 − AU C)% is the standard
Algorithm with TFIDF for Text Categorization.
measure used in the spam challenges, we can conclude that
Proceedings of 14th International Conference on
the proposed approach outperforms the top-performance
Machine Learning, Nashville, TN, USA, 1997; 143–
spam filters. Note that, although the MDL has not achieved
151.
the best SC% rates, it attained very low BH% and
3. Schapire R, Singer Y, Singhal A. Boosting and
LAM % rates for all tested data sets. It indicates that the
Rocchio Applied to Text Filtering. Proceedings of the
proposed filter rarely misclassify a legitimate message. It is
21st Annual International Conference on Information
a remarkable feature since in spam filtering false positives
Retrieval, Melbourne, Australia, 1998; 215–223.
are considered much worse than false negatives. Despite
4. Carreras X, Marquez L. Boosting Trees for Anti-
the fact that Bogofilter (NB classifier) attained the best
Spam Email Filtering. Proceedings of the 4th Inter-
BH% rates for all used e-mail data sets, it achieved the
national Conference on Recent Advances in Natural
worst SC% rates.
Language Processing, Tzigov Chark, Bulgaria, 2001; 58–64. 5. Reddy C, Park JH. Multi-Resolution Boosting for
5. CONCLUSIONS AND FURTHER WORK
Classification and Regression Problems. Knowledge and Information Systems 2010; . 6. Sahami M, Dumais S, Hecherman D, Horvitz E.
In this paper, we have presented a spam filtering approach
A Bayesian Approach to Filtering Junk E-mail.
based on data compression model that have proved to be
Proceedings of the 15th National Conference on
fast to construct and incrementally updateable. We have
Artificial Intelligence, Madison, WI, USA, 1998; 55–
also compared its performance with the well-known linear
62.
Support Vector Machine and different models of Na¨ıve
7. Androutsopoulos I, Koutsias J, Chandrinos K,
Bayes classifiers, something the spam literature does not
Paliouras G, Spyropoulos C. An Evalutation of Naive
always acknowledge.
Bayesian Anti-Spam Filtering. Proceedings of the
c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls
9
Compression-based Spam Filter
11st European Conference on Machine Learning, Barcelona, Spain, 2000; 9–17. 8. Song Y, Kolcz A, Gilez C. Better Naive Bayes Classification for High-precision Spam Detection. Software – Practice and Experience 2009; 39(11):1003–1024. 9. Almeida T, Yamakami A, Almeida J. Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters. Proceedings of the 8th IEEE International Conference on Machine
T.A. Almeida & A. Yamakami
Computers and Security 2006; 25(8):566–578. 19. Blanzieri E, Bryl A. A Survey of Learning-based Techniques of Email Spam Filtering. Artificial Intelligence Review 2008; 29(1):335–455. 20. Guzella T, Caminhas W. A Review of Machine Learning Approaches to Spam Filtering. Expert Systems
with
Applications
September
2009;
36(7):10 206–10 222. 21. Caruana G, Li M. A Survey of Emerging Approaches
Learning and Applications, Miami, FL, USA, 2009;
to Spam Filtering. ACM Computing Surveys 2012;
517–522.
44(2):1–27.
10. Almeida T, Yamakami A, Almeida J. Probabilistic
22. Almeida TA, Yamakami A. Advances in Spam
Anti-Spam Filtering with Dimensionality Reduction.
Filtering Techniques. Computational Intelligence for
Proceedings of the 25th ACM Symposium On Applied
Privacy and Security, Studies in Computational
Computing, Sierre, Switzerland, 2010; 1804–1808.
Intelligence, vol. 394, Elizondo D, Solanas A,
11. Almeida T, Almeida J, Yamakami A. Spam Filtering:
Martinez-Balleste A (eds.). Springer, 2012; 199–214.
How the Dimensionality Reduction Affects the
23. Rissanen J. Modeling by Shortest Data Description.
Accuracy of Naive Bayes Classifiers. Journal of Internet Services and Applications 2011; 1(3):183– 200. 12. Drucker H, Wu D, Vapnik V. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks September 1999; 10(5):1048–1054.
Automatica 1978; 14:465–471. 24. Barron A, Rissanen J, Yu B. The Minimum Description Length Principle in Coding and Modeling. IEEE Transactions on Information Theory 1998; 44(6):2743–2760. 25. Gr¨unwald P. A Tutorial Introduction to the Minimum Description Length Principle. Advances in Min-
13. Kolcz A, Alspector J. SVM-based Filtering of E-mail
imum Description Length: Theory and Applications,
Spam with Content-Specific Misclassification Costs.
Gr¨unwald P, Myung I, Pitt M (eds.). MIT Press, 2005;
Proceedings of the 1st International Conference on Data Mining, San Jose, CA, USA, 2001; 1–14.
3–81. 26. Almeida T, Yamakami A, Almeida J. Filtering Spams
14. Hidalgo J. Evaluating Cost-Sensitive Unsolicited
using the Minimum Description Length Principle.
Bulk Email Categorization. Proceedings of the 17th
Proceedings of the 25th ACM Symposium On Applied
ACM Symposium on Applied Computing, Madrid,
Computing, Sierre, Switzerland, 2010; 1856–1860.
Spain, 2002; 615–620. 15. Forman G, Scholz M, Rajaram S. Feature Shaping
27. Androutsopoulos I, Paliouras G, Michelakis E. Learning to Filter Unsolicited Commercial E-
for Linear SVM Classifiers. Proceedings of the
Mail. Technical Report 2004/2, National Centre for
15th ACM SIGKDD International Conference on
Scientific Research “Demokritos”, Athens, Greece
Knowledge Discovery and Data Mining, Paris, France, 2009; 299–308.
March 2004. 28. Teahan W, Harper D. Using Compression-Based Lan-
16. Almeida T, Yamakami A. Content-Based Spam
guage Models for Text Categorization. Proceedings
Filtering. Proceedings of the 23rd IEEE International
of the 2001 Workshop on Language Modeling and
Joint Conference on Neural Networks, Barcelona,
Information Retrieval, Pittsburgh, PA, USA, 2001; 1–
Spain, 2010; 1–7.
5.
17. Almeida TA, Yamakami A. Facing the Spammers:
29. Frank E, Chui C, Witten I. Text Categorization Using
A Very Effective Approach to Avoid Junk E-mails.
Compression Models. Proceedings of the 10th Data
Expert Systems with Applications 2012; :1–5.
Compression Conference, Snowbird, UT, USA, 2000;
18. Carpinter J, Hunt R. Tightening the Net: A Review of
555–565.
Current and Next Generation Spam Filtering Tools.
10
c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls
T.A. Almeida & A. Yamakami
Compression-based Spam Filter
30. Siefkes C, Assis F, Chhabra S, Yerazunis W. Combining Winnow and Orthogonal Sparse Bigrams
for Classification: An Overview. Bioinformatics May 2000; 16(5):412–424.
for Incremental Spam Filtering. Proceedings of the
36. McCallum A, Nigam K. A Comparison of Event
8th European Conference on Principles and Practice
Models for Naive Bayes Text Classication. Proceed-
of Knowledge Discovery in Databases, Pisa, Italy,
ings of the 15th AAAI Workshop on Learning for Text Categorization, Menlo Park, CA, USA, 1998; 41–48.
2004; 410–421. 31. Metsis V, Androutsopoulos I, Paliouras G. Spam
37. Schneider K. On Word Frequency Information
Filtering with Naive Bayes - Which Naive Bayes?
and Negative Evidence in Naive Bayes Text
Proceedings of the 3rd International Conference on
Classification. Proceedings of the 4th International
Email and Anti-Spam, Mountain View, CA, USA,
Conference on Advances in Natural Language Processing, Alicante, Spain, 2004; 474–485.
2006; 1–5. 32. Bordes A, Ertekin S, Weston J, Bottou L. Fast Kernel
38. Losada D, Azzopardi L. Assessing Multivariate
Classifiers with Online and Active Learning. Journal
Bernoulli Models for Information Retrieval. ACM
of Machine Learning Research 2005; 6:1579–1619.
Transactions on Information Systems June 2008;
33. Matthews B. Comparison of the Predicted and
26(3):1–46.
Phage
39. John G, Langley P. Estimating Continuous Distribu-
Lysozyme. Biochimica et Biophysica Acta October
tions in Bayesian Classifiers. Proceedings of the 11st
1975; 405(2):442–451.
International Conference on Uncertainty in Artificial
Observed
Secondary
Structure
of
T4
34. Almeida T, Hidalgo JG, Yamakami A. Contributions
Intelligence, Montreal, Canada, 1995; 338–345.
to the Study of SMS Spam Filtering: New
40. Sculley D, Wachman G, Brodley C. Spam Filtering
Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering, Mountain
using Inexact String Matching in Explicit Feature Space with On-Line Linear Classifiers. Proceedings
View, CA, USA, 2011; 259–262.
of the 15th Text REtrieval Conference, Gaithersburg,
35. Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen
MD, USA, 2006; 1–10.
H. Assessing the Accuracy of Prediction Algorithms
c 2012 John Wiley & Sons, Ltd. Security Comm. Networks 2012; 00:1–11 DOI: 10.1002/sec Prepared using secauth.cls
11