Dec 6, 2005 - ing a legitimate email as spam, usually referred to as a false positive, carries ... of adaptive statistical compression models for automated text categorization. ..... as well as follow-up experiments on the TREC public corpus, are ...
Spam Filtering Using Compression Models Andrej Bratko and Bogdan Filipiˇc {andrej.bratko,bogdan.filipic}@ijs.si Department of Intelligent Systems Joˇzef Stefan Institute Jamova 39, SI-1000 Ljubljana, Slovenia
Technical Report IJS-DP 9227 December 6, 2005
Abstract Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. This paper summarizes our experiments for the TREC 2005 spam track, in which we consider the use of adaptive statistical data compression models for the spam filtering task. The nature of these models allows them to be employed as Bayesian text classifiers based on character sequences. Since messages are modeled as sequences of characters, tokenization and other error-prone preprocessing steps are omitted altogether, resulting in a method that is very robust. The models are also fast to construct and incrementally updateable. We present experimental results indicating that compression models perform well in comparison to established spam filters. We also show that the method is extremely robust to noise, which should make such filters difficult to defeat.
1
Introduction
Electronic mail is arguably the “killer app.” of the internet. It is used daily by millions people to communicate around the globe and is a mission-critical application for many businesses. Over the last decade, unsolicited bulk email has become a major problem for email users. An overwhelming amount of spam is flowing into users mailboxes daily and trends are on the rise. In 2004, an estimated 62% of all email was attributed to spam, according to the anti-spam outfit Brightmail1 . Not only is spam frustrating for most email users, it is straining the IT infrastructure of organizations and costing businesses billions of dollars in lost productivity. 1
http://brightmail.com/, 2004-03-12
1
Many different approaches for fighting spam have been proposed, ranging from various sender authentication protocols to charging senders indiscriminately, in money or computational resources (Goodman, 2003). A promising approach is the use of content-based filters, capable of discerning spam and legitimate email messages automatically. Machine learning methods are particularly attractive for this task, since they are capable of adapting to the evolving characteristics of spam, and an abundance of data is usually available for training such models (Sahami et al., 1998; Androutsopoulos et al., 2000; Graham, 2002). Nevertheless, spam filtering poses a special problem for automated text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Unlike most text categorization tasks, the cost of misclassification is heavily skewed: Labeling a legitimate email as spam, usually referred to as a false positive, carries a much greater penalty than vice-versa. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this article, we examine the use of adaptive data compression models for spam filtering. Specifically, we employ the Prediction by Partial Matching (PPM) (Cleary and Witten, 1984) and Context Tree Weighting (CTW) (Willems et al., 1995) algorithms. Classification is done by first building two compression models from the training corpus, one for spam and one for legitimate messages. The compression rate achieved using these two models on a new message determines the classification outcome. Two variants of the method with different theoretical underpinnings are evaluated. The first variant works in a typical Bayesian manner: It estimates the probability of a document using compression models derived from the training data, and assigns the class based on the model that deems the document most probable. The second variant selects the class for which the addition of the new document produces the minimal increase in complexity of the class description. The methods were evaluated on a number of real-world email collections in the framework of the TREC 2005 spam track. The results of this evaluation show promise in the use of statistical compression algorithms for spam filtering. We also report on the results of additional experiments, in which we compare different variants of the compression approach, and its tolerance to noise in the data.
2
Related Work
Most relevant to our work is the work of Frank et al. (2000), who propose the use of adaptive statistical compression models for automated text categorization. They experiment with using the PPM algorithm as a Beyesian text classifier. They find that the compression-based method is inferior to an SVM and roughly on par with Naive Bayes on the classical Reuters-21578 dataset. The same method was successfully applied to a number of text categorization problems, such as authorship attribution, dialect identification and genre classification by Teahan (2000), who claims that the compression-based method outperforms other methods in most cases. Peng et al. (2004) propose augmenting a word-based Naive Bayes classifier with statistical language models to account for word dependencies. They also consider training the language model on characters instead of words, incidentally reproducing the method of Frank et al. (2000). In experiments on a number of categorization tasks, they find that the character-based method usually outperforms the word-based approach and always outperforms a simple Naive Bayes model. They also claim that the character-level n-gram model outperforms an SVM classifier on most datasets. In most spam filtering work, text is modeled with the widely used bag-of-words (BOW) representation, even though it is widely accepted that tokenization is a vulner-
2
ability of keyword-based spam filters. Some filters use patterns of characters instead of word tokens (Rigoutsos and Huynh, 2004; Yerazunis, 2003). However, their techniques are different to the statistical data compression models that are evaluated here. While the idea of using statistical compression algorithms for text categorization is not new, we are not aware of any study that would consider such methods for spam filtering. We demonstrate that the approach is well suited to the spam filtering problem. In particular, we show that compression models are extremely robust to noise and detect useful punctuation and other patterns in the text that are difficult to model using a bag-of-words representation. We also propose a modification of the original method with strong theoretical justification, which is confirmed in the results of our experimental evaluation.
3
Statistical Data Compression
Probability plays a central role in data compression: Knowing the exact probability distribution governing an information source allows us to construct optimal or nearoptimal codes for messages produced by the source. Statistical data compression algorithms exploit this relationship by building a statistical model of the information source, which can be used to estimate the probability of each possible message that can be emitted by the source. This model is coupled by an encoder that uses the probability estimates to construct the final binary representation.
3.1
Finite Memory Markov Sources
For our purposes, the encoding problem is irrelevant. We therefore focus on the source modeling task. In particular, we are interested in modeling of text generating sources. Each message d produced by such a source is naturally represented as a sequence sn1 = s1 . . . sn ∈ Σ∗ of symbols over the source alphabet Σ. Such sequences can be arbitrarily long and complex. Although a symbol is usually interpreted as a single character, other schemes are possible, such as word-level models. To make the inference problem tractable, sources are usually modeled as stationary and ergodic Markov sources with limited memory k. Each symbol in a message d ∈ Σ∗ is therefore assumed to be independent of all but the preceding k symbols (the symbol’s context): P (d) =
|d| Y
P (si |si−1 1 )
≈
i=1
|d| Y
P (si |si−1 i−k )
(1)
i=1
The number of context symbols k is also referred to as the order of the model. Higher order models are more realistic and can therefore better model the characteristics of the source. However, since the number of possible contexts increases exponentially with context length, accurate parameter estimates are hard to obtain. For example, an order-k model requires Σk (Σ − 1) parameters. Compression algorithms typically tackle this problem by blending models of different orders, much like smoothing methods used for modeling natural language. We describe two such algorithms that were used in our experiments later in this section.
3.2
Static vs. Adaptive Methods
Static data compression methods require two passes over the data. The first pass is used to train the model. The encoded message then contains the trained model, followed by the data which is encoded using this model. The decoder first reads the model and uses it to decode the remaining data.
3
Adaptive methods do not include the model in the encoded message. Rather, they start encoding the message using an empty model, for example, a uniform distribution over all symbols. The model is incrementally updated after each symbol. The decoder repeats this process, building its own version of the model as the message is decoded. Adaptive methods require only a single pass over the data. The redundancy incurred due to the fact that adaptive methods start with an empty, uninformed model, can be compared to the cost of separately encoding the model in static methods.
3.3
Prediction by Partial Matching
The Prediction by Partial Matching (PPM) algorithm (Cleary and Witten, 1984) is an adaptive data compression method that has set the standard for lossless text compression since its introduction over two decades ago (Cleary and Teahan, 1997). It uses a special escape mechanism for smoothing. An order-k PPM model works as follows: The source alphabet is extended with a special escape symbol ǫ. When predicting the next symbol in a sequence, the longest context found in the preceding text is used (up to length k). If the symbol has appeared in this context in the preceding text, its relative frequency within the context is used for prediction. Otherwise, the escape symbol is encoded and the algorithm reverts to a lower-order model. The algorithm therefore reserves some probability mass for the escape symbol, depending on the statistics of the current context. This probability mass is effectively distributed according to a lower-order model. The procedure is applied recursively, ultimately terminating in a default model of order −1, which always predicts a uniform distribution among all possible symbols. Many versions of the PPM algorithm exist, differing mainly in the way the escape probability is estimated. In our implementation, we used escape method D (Howard, 1993) in combination with the exclusion principle (Sayood, 2000).
3.4
Context Tree Weighting
The Context Tree Weighting (CTW) algorithm (Willems et al., 1995) uses a Bayesian mixture of all tree source models up to a certain order to predict the next symbol in a sequence. Each such model can be interpreted as a set of contexts with one set of parameters (i.e. next-symbol probabilities) for each context. These contexts form a tree. The CTW algorithm blends all models that are subtrees of the full order-k context tree (see Figure 1). When predicting the next symbol, the models are weighted according to their Bayesian posterior, which is estimated from the data that has been k−1 encoded so far. Although the mixture includes more than 2Σ different models, the algorithm’s complexity is linear in the sequence length. This is achieved with a clever decomposition of the posteriors, exploiting the property that many of these models share the same parameters. The CTW algorithm has favorable theoretical properties, i.e. a theoretical upper bound on non-asymptotic redundancy can be proven for this algorithm. However, the algorithm is not widely used in practice, possibly due to its complex implementation, and the observation that it yields only minor gains in compression compared to PPM (or none at all). The original CTW algorithm was designed for compression of binary sources. We used a version of the algorithm that is adapted for multi-alphabet sources, using the escape mechanism described in (Tjalkens et al., 1993).
4
Text Categorization using Compression Models
Compression models can be applied to text categorization by building one compression model for each class and using these models to evaluate the target document. The
4
p(0|1)
p(0|11)
p(0|1)
p(0|11)
p(0)
1
0
1
0
1
0
1
0
1 0
1 0
1 0
1 0
Input sequence: 010110 ...
Figure 1: All tree models considered by an order-2 CTW algorithm with a binary alphabet. An example input sequence is given to show the context that each of the models use for prediction of the last character in this sequence. model that deems the document most probable (i.e. the model that achieves the highest compression rate) determines the classification outcome. In the following subsections, we describe two approaches to classification, both of which use compression models in a very similar manner, but with different theoretical motivation. We first describe the Minimum Cross-Entropy (MCE) criterion for classification (Frank et al., 2000; Teahan, 2000). This method chooses the class which yields the model that best describes the test document, i.e. encodes it most efficiently. We then introduce the Minimum Information Gain (MIG) classification criterion, which selects the class for which the addition of the test document produces the minimal increase in the length of the description of the class.
4.1
Classification by Minimum Cross-entropy
The entropy H(X) of a source X measures the average amount of information conveyed by a symbol of the source alphabet. It also determines a lower bound on asymptotic lossless compression of messages generated by the source (Shannon and Weaver, 1949): 1X P (sn1 ) log2 P (sn1 ) (2) n→∞ n If the true probability distribution P governing the source were known, an average message could be encoded using no less than H(X) bits per symbol. However, the true distribution over all possible messages is typically unknown, so the exact entropy is hard to determine. A compression model M approximates this distribution. Although there is no guarantee that the model is perfect, it will hopefully match the true distribution reasonably well. The cross-entropy between an information source X and a model M is defined as follows: H(X) = lim −
1X P (sn1 ) log2 PM (sn1 ) (3) n The cross-entropy H(X, M ) determines the average number of bits per symbol required to encode messages produced by source X, when using model M for compression. Since the best possible model achieves a compression rate equal to the entropy, the following relation always holds: H(X, M ) = lim − n→∞
H(X, M ) ≥ H(X)
(4)
The exact cross-entropy is again hard to compute, since it would require knowing the exact probability P (sn1 ) of each possible message. It can, however, be approximated by applying the model M to sufficiently long sequences of symbols, with the
5
expectation that these sequences are representative samples of all possible sequences generated by the source: 1 log2 PM (sn1 ) , n very large. (5) n For a given document d, the document cross-entropy can be defined as the average number of bits per symbol required to encode the document using the model M : H(X, M ) ≈ −
|d|
H(X, M, d) = −
1X 1 log2 PM (si |si−1 log2 PM (d) = − 1 ) n n i=1
(6)
We can expect that a model that achieves a low cross-entropy on a given document approximates the information source that actually generated the document well. We can use this measure for classification by building one model from the training data for each class. The model that achieves the lowest cross-entropy determines the classification outcome. After some arithmetic manipulation, it can be shown that the minimum crossentropy criterion is equal to the Bayesian maximum a-posteriori class selection rule, if the class priors are ignored: c(d)
=
=
arg max [P (C|d)] C P (C)P (d|C) arg max C P (d) arg max [P (d|C)]
=
arg max [PMC (d)]
(10)
=
arg max [log2 PMC (d)] C 1 arg min − log2 PMC (d) C n
(11)
=
=
C
C
(7) (8) (9)
(12)
In the above equation, PMC (d) and P (d|C) are equivalent. The notation is used to emphasize that a statistical compression model MC , which was built from the training data for class C, is used to estimate the conditional probability of the document, given the class P (d|C).
4.2
Classification by Minimum Information Gain
The Minimum Description Length (MDL) (Rissannen, 1978) principle favors models that yield compact representations of the data. Specifically, it states that the best model results in the shortest description of the model and the data, given this model. In other words, the model that best compresses the data is preferred. This model selection criterion naturally balances the complexity of the model and the degree to which this model fits the data. This is precisely the tradeoff that is sought by static data compression methods, which were discussed in Section 4. Adaptive methods have the same goal: To produce the shortest possible representation of the data. The complexity of the model is hidden in the fact that the model adapts its parameters as the data is being encoded, resulting in some redundancy due to the use of uninformed parameters. This penalty is greater with complex models that require more data to train, but diminishes as the model adapts after a sufficient amount of data is “seen”. Since the training data in a learning problem is not sufficient to completely specify the class (otherwise, the learning problem would be trivial), we expect each test
6
instance to convey some new information about the class. On the other hand, we expect some of the data contained in the instance to be redundant with respect to the training data for its class. It is this redundant information that allows classification. This line of thought leads us to an alternative classification criterion: To choose the class for which the amount of new information is minimal. In other words, the minimal information gain criterion selects the class for which the addition of the test document produces a minimal increase in the length of the description of the class: c(d) = arg min [I(C ∪ {d}) − I(C)] C
(13)
In the above equation, I(C) denotes the amount of information contained in the set of documents C, where C is one of the target classes. Each class is defined as the set of all training documents belonging to this class. Adaptive models are particularly suitable for this type of classification, since they can be used to estimate the information gain without having to rebuild the model from scratch. The only difference in implementation compared to the minimum crossentropy approach is that the models are allowed to adapt as a message is being classified. The class selection rule is then: c(d)
= =
∗ arg min − log2 PM (d) C C ∗ arg max PM (d) C C
(14) (15)
∗ (d) denotes the probability assigned to the sequence of symbols found in where PM C document d, which is estimated with a compression model MC built on the training documents for class C. The the asterisk symbol ‘*’ is used to indicate that the compression model is allowed to adapt as the sequence is being evaluated.
4.3
A Comparison of the Classification Criteria
This section attempts to further clarify the differences between the MCE and MIG classification criteria. Both methods model a class as an information source, and consider the training data for each class a sample of the type of data generated by the source. They differ in the way classification is performed. The MCE criterion assumes that the test document d was generated by some unknown information source. The document is considered a sample of the type of data generated by the unknown source. Classification is based on the distance between each class and the source that generated the document. This distance is defined as the document cross-entropy, which can serve as an estimate of the cross-entropy between the unknown source and each of the class information sources. The MIG criterion tests, for each class C, the hypothesis that d ∈ C, by adding the document to the training data of the class and estimating how much this addition increases the length of the class description. The class for which the addition of the document produces the smallest increase in the description length is selected. This can also be interpreted as selecting the class for which the addition of the document results in a minimal change in the definition of the class, or, equivalently, the class to which the document adds the least new information. We believe that the MIG criterion offers some advantages over the MCE method. In particular, it discounts redundant evidence favoring a particular class that is found in the document being classified. This behavior is depicted in Figure 2, which shows the difference between the methods. Consider a hypothetical binary classification problem, in which the task is to classify a document that contains 6 words (excluding
7
Document The casino department develops software for the casino business.
Word casino department develops software business
P (word|+)
P (word|−)
1/500 1/1000 1/1000 1/1000 1/500
1/1000 1/500 1/500 1/500 1/1000
Figure 2: An example document containing six “pieces of evidence” - occurrences words that the classifier can use to predict the class. The probabilities of each of the individual words for each of the classes is given in the table. stopwords). For purposes of clarity, each word is considered an independent “piece of evidence” on which the classification is based. Three pieces of evidence favor the positive class and three favor the negative class, all with an equal weight. However, two pieces of evidence that indicate the positive class are redundant - they contain the same information. In this scenario, the MCE criterion will assign an equal probability to each class. The MIG criterion will, on the other hand, select the negative class. This is because while the first occurrence of the word “casino” is surprising under the hypothesis d ∈ −, the second occurrence is less surprising, since it is redundant. This is a direct consequence of allowing the model to adapt as the document is being processed.
5
The Spam Filtering Task at TREC 2005
The methods described in the previous sections were evaluated in the framework of the Text REtrieval Conference (TREC) spam track. The purpose of TREC is to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. In 2005, TREC also includes a new track on spam filtering. The goal of the spam track is to provide a standard evaluation of current and proposed spam filtering approaches. A framework and methodology for evaluation was developed as part of the TREC 2005 spam track, as well as a network of evaluation corpora (both public and private). A large public corpus was also prepared, containing almost 100,000 email messages received by employees of the Enron corporation, which were originally made public by the Federal Energy Regulatory Commission during its investigation.
5.1
Problem Description
For the spam filtering task, a number of email corpora containing legitimate (ham) and spam email were constructed automatically or labeled by hand. To make the evaluation realistic, messages were ordered by time of arrival. Evaluation is performed by presenting each message in sequence to the classifier, which has to label the message as spam or ham (i.e. legitimate mail), as well as provide some measure of confidence in the prediction - the “spamminess score”. After each classification, the filter is informed of the true label of the message, which allows the filter to update its model accordingly. This behavior simulates a typical setting in personal email filtering, which is usually based on user feedback, with the additional assumption that the user promptly corrects the classifier after every misclassification. Each participating group was allowed to submit up to four different filters for evaluation.
8
Table 1: Filters submitted by IJS for evaluation in the TREC 2005 spam track. Label ijsSPAM1pm8 ijsSPAM2pm6 ijsSPAM3pm4 ijsSPAM4cw8
5.2
Comp. algorithm PPM PPM PPM CTW
Order 8 6 4 8
Class. criterion MIG MIG MIG MIG
System Submitted by IJS
The filters that were submitted for evaluation in the TREC ’05 spam track are listed in Table 1. Three of the filters were based the PPM compression algorithm, the fourth was based on the CTW algorithm, since, in our initial tests, it was found that the PPM algorithm generally outperforms CTW. The difference between the three PPMbased filters was in a different order of the model. All of the four submitted filters use the MIG criterion for classification, i.e. the compression models were allowed to adapt during classification. This decision was again based on the results of initial experiments, in which the MIG criterion generally outperformed the MCE criterion. We primarily wanted to study the effect of varying the PPM model order. In hindsight, we probably should have omitted the CTW-based method altogether and submitted at least one filter using the MCE classification criterion. We would very much wish to see the difference in performance of the two classification criteria confirmed in the results of experiments on the private corpora. The system submitted for the TREC spam track was designed as a client-server application. The server maintains the compression models required for classification in main memory and updates the models incrementally with each new training example. The filter performs very limited message preprocessing, which primarily includes mime-decoding and stripping messages of all non-textual message parts. All sequences of whitespace characters (tabs, spaces and newline characters) were replaced by a single space. The reason for this step was to undo line breaks added to messages by email clients. The newline characters that delimit header fields and separate the header from the body part were preserved. No other preprocessing was done.
6
Experiments and Results
In this section, we report on the results of our experimental evaluation. We compare the performance of different compression models and classification criteria. We also compare compression models with a selection of established spam filters. Finally, we report on experiments that evaluate the robustness of compression models, which may prove to be the greatest advantage of using such methods for spam filtering.
6.1
Datasets
We report experiments on five email datasets. The SpamAssassin (sa) dataset2 contains legitimate and spam email messages from the SpamAssassin developer mailing list. The TREC 2005 public (t05p) dataset3 contains almost 100,000 messages received by employees of the Enron corporation. The dataset was prepared specifically 2 3
http://spamassassin.org/publiccorpus/ http://trec.nist.gov/data.html
9
Table 2: Basic statistics for the evaluation datasets. Dataset Messages Ham Spam sa 6033 1884 4149 t05p 92189 39399 52790 mrx 49086 9038 40048 sb 7006 6231 775 tm 170201 150685 19516 for the evaluation in the TREC 2005 spam track. The messages were labeled semiautomatically with an iterative procedure described by Cormack and Lynam (2005). The other three datasets contain hand-labeled personal email. As these datasets are not publicly available due to privacy concerns, the evaluation of filters was done by the organizers of the TREC 2005 spam track. The basic statistics for all five datasets are given in Table 2.
6.2
Evaluation Criteria
A number of evaluation criteria proposed by Cormack and Lynam (2004) were used for the official evaluation for the TREC 2005 spam track. In this paper, we only report on a selection of these measures: • HMR: Ham Misclassification Rate, the fraction of ham messages labeled as spam. • SMR: Spam Misclassification Rate, the fraction of spam messages labeled as ham. • 1-AUC : Area above the Receiver Operating Characteristic (ROC) curve. All of the measures are computed on the raw filter results. These results include the classification decision for each message and an associated spamminess score, which specifies the degree to which the filter believes the message to be spam. We believe the 1-AUC statistic is the most comprehensive measure, since it captures the behavior of the filter at different filtering thresholds. Increasing the filtering threshold will reduce the HMR at the expense of increasing the SMR, and vice versa. Most users are willing to accept only a very small HMR (i.e. few false positives), but expect the SMR to remain low enough to justify the utility of the filter. The exact tradeoff will generally depend on the user and the type of mail he receives. The HMR and SMR statistics are interesting for practical purposes, since they are easy to interpret and give a good idea of the kind of performance one can expect from a spam filter. They are, however, less suitable for comparing filters, since one of these measures can always be improved at the expense of the other. For this reason, we focus on the 1-AUC statistic when comparing different filters.
6.3
A Comparison Between Compression Algorithms
A comparison between the performance of the PPM and CTW algorithms is given in Table 3. In all tests, an order-8 compression model was used. Classification was performed by minimum information gain, i.e. the models were allowed to adapt to the message being classified. Although the performance of the algorithms is similar, the PPM algorithm usually outperforms CTW in the 1-AUC statistic, with the exception of the mrx dataset. Since the CTW algorithm is more complex and slower than PPM, without yielding any noticeable improvement in PPM’s performance, we focus on the PPM algorithm in later experiments.
10
Table 3: Evaluation of CTW and PPM algorithms for spam filtering. The best results in the 1-AUC statistic are in bold.
HMR (%)
Dataset
SMR (%)
1-AUC (%)
CTW
PPM
CTW
PPM
CTW
PPM
1.42 0.37 1.44 0.30 0.43
1.21 0.25 1.54 0.35 0.36
0.74 0.91 0.54 12.77 3.34
0.90 0.93 0.39 11.35 3.31
0.161 0.025 0.063 0.422 0.167
0.132 0.021 0.069 0.372 0.155
sa trec05p mrx sb tm
Table 4: Comparison of order-4, order-6 and order-8 PPM models. The best results in the 1-AUC statistic are in bold.
HMR (%)
Dataset sa trec05p mrx sb tm
6.4
SMR (%)
1-AUC (%)
order4
order6
order8
order4
order6
order8
order4
order6
order8
1.57 0.26 1.33 0.59 0.70
1.21 0.23 1.52 0.16 0.36
1.21 0.25 1.54 0.35 0.36
0.95 0.97 0.33 11.10 3.87
0.85 0.95 0.34 11.74 3.43
0.90 0.93 0.39 11.35 3.31
0.188 0.022 0.050 0.475 0.181
0.148 0.019 0.069 0.285 0.135
0.132 0.021 0.069 0.372 0.155
Varying the Model Order
The effect of varying the order of the compression model on classification performance is given in Table 4. An adaptive PPM model is used in these tests. No particular model is uniformly best, however, it can be seen that using an order-6 model usually produces good results. In future research, it would be interesting to develop automatic methods for selecting the best model or blending models of different orders. On the other hand, varying the model order usually does not have a drastic effect on performance. In data compression, it is known that increasing the order of PPM to beyond 5 characters or so usually results in a gradual decrease of compression performance (Teahan, 2000), so the results presented here are not unexpected. In follow-up experiments, we used an order-6 PPM model to evaluate compression models against other spam filters.
6.5
Cross Entropy vs. Information Gain
In initial experiments on the SpamAssassin corpus, we found that messages often contain many occurrences of longer sequences of characters that seem indicative of ham/spam to the filter. To limit the effect of redundant sequences, we decided to allow the models to adapt during classification. The results of these experiments, as well as follow-up experiments on the TREC public corpus, are given in Table 5. The increase in performance, especially in the 1-AUC statistic, prompted our decision to only submit adaptive versions of the algorithms for evaluation. A later analysis revealed that, theoretically, allowing the model to adapt during classification actually estimates a different classification criterion, as discussed in Section 4. The ROC curves for both classification criteria are depicted in Figure 3. It is interesting to note that the gain in the 1-AUC statistic can mostly be attributed to the fact that classification by MIG performs better at the extreme ends of the curve. Performance is comparable when HMR and SMR are balanced. This indicates
11
Table 5: Comparison of classification by minimum cross-entropy and minimum information gain, using an order-6 PPM model. The best results in the 1-AUC statistic are in bold.
Dataset sa trec05p
HMR (%)
SMR (%)
1-AUC (%)
MCE
MIG
MCE
MIG
MCE
MIG
1.47 0.35
1.21 0.23
0.74 0.71
0.85 0.95
0.350 0.038
0.148 0.019
Table 6: Comparison of an order-6 adaptive PPM model and a number of established publicdomain spam filters on the SpamAssassin corpus. The best results in the 1-AUC statistic are in bold.
Filter BogoFilter CRM114 DBACL SpamAssassin SpamBayes SpamProbe PPM-6
HMR (%) 0.19 1.81 1.28 0.96 0.17 0.80 1.21
SMR (%) 23.93 4.03 1.59 5.09 30.24 3.13 0.85
1-AUC (%) 0.185 1.143 0.261 0.243 1.437 0.282 0.148
that the MIG criterion makes less gross mistakes and that the scores produced by the method are better calibrated. Performance at the extreme ends of the curve is especially important when the cost of misclassification is unbalanced, as is the case in spam filtering.
6.6
Comparison to Established Spam Filters
Table 6 compares the performance of an order-6 adaptive PPM model with a number of established spam filters: BogoFilter4 , CRM1145 , DBACL6 , SpamAssassin7 , SpamBayes8 and SpamProbe9 , all of which are in the public domain. We ran this comparison on the SpamAssassin corpus only, since the corpus is small enough to make such an evaluation feasible, given the limited computational resources we had at our disposal. The TREC 2005 spam track will yield a more comprehensive comparison of various filtering techniques, once the official results are available. Figure 4 shows the ROC curves of the filters on the SpamAssassin corpus. This comparison shows promise in using compression models for spam filtering with the adaptive PPM model outperforming all other filters in the 1-AUC statistic. We look forward to the results of the official TREC evaluation in the hope that these results will be confirmed on other corpora. 4
http://www.bogofilter.org/ http://crm114.sourceforge.net/ 6 http://dbacl.sourceforge.net/ 7 http://spamassassin.apache.org/ 8 http://spambayes.sourceforge.net/ 9 http://spamprobe.sourceforge.net/ 5
12
0.01
0.10
0.10
% Spam Misclassification (logit scale)
% Spam Misclassification (logit scale)
0.01
1.00
ppm6 adaptive ppm6 static
10.00
50.00 0.01
0.10
1.00 % Ham Misclassification (logit scale)
10.00
50.00
1.00
ppm6 adaptive ppm6 static
10.00
50.00 0.01
0.10
1.00
10.00
50.00
% Ham Misclassification (logit scale)
Figure 3: ROC curves of adaptive (classification by MIG) and static (classification by MCE) PPM order-6 models on the SpamAssassin (left) and Enron (right) corpora.
% Spam Misclassification (logit scale)
0.01
0.10
1.00
ppm6 bogofilter crm114 dbacl spamassassin spambayes spamprobe
10.00
50.00 0.01
0.10
1.00 % Ham Misclassification (logit scale)
10.00
50.00
Figure 4: ROC curves of various open source spam filters and an adaptive PPM-6 model on the SpamAssassin corpus.
13
Table 7: Effect of noise on classification performance on the SpamAssassin dataset. An order-6 adaptive PPM model and BogoFilter are compared. The best results in the 1-AUC statistic are in bold. Dataset
Headers preserved
Headers stripped
6.7
Noise 0% 5% 10% 20% 0% 5% 10% 20%
HMR (%) Bogo PPM 0.19 1.23 0.27 1.40 0.24 1.64 0.34 2.05 0.19 1.08 0.39 1.40 0.53 2.17 1.21 3.52
SMR Bogo 24.63 36.62 45.54 56.69 29.14 47.29 57.38 67.57
(%) PPM 0.90 0.90 0.80 1.27 2.12 1.91 2.49 3.50
1-AUC (%) Bogo PPM 0.202 0.151 0.388 0.144 0.571 0.162 0.959 0.256 0.367 0.190 0.974 0.207 2.010 0.298 5.785 0.660
Robustness of Compression-based Filtering
We believe that one of the greatest advantages of statistical data compression models over standard text categorization algorithms is that they do not require any preprocessing of the data. Classification is not based on word tokens or manually constructed features, which is in itself appealing from the implementation standpoint. For spam filtering, this property is especially welcome, since preprocessing steps are error-prone and are often exploited by spammers in order to evade filtering. A typical strategy is to distort words with common spelling mistakes or character substitutions, which may confuse an automatic filter. We believe that compression models are much more robust to such tactics than methods that require tokenization. To support this claim, we conducted an experiment in which all messages in the SpamAssassin dataset were distorted by substituting characters in the original text with random alphanumeric characters and punctuation. Characters in the original message that were subject to such a substitution were also chosen randomly. By varying the probability of distorting each character, we evaluated the effect of noise on classification performance. For this experiment, all messages were first decoded and stripped of non-textual attachments. Noise was added to the subject headers and the message body. An order-6 adaptive PPM model and BogoFilter, the best-performing open source filter from previous experiments, were evaluated on the resulting dataset. We then stripped the messages of all headers except the subject header and re-ran the experiment. The results in Table 7 show that compression models are very robust to noise. Even after 20% of all characters are distorted, rendering messages practically illegible, the PPM model retains a respectable performance. The advantage of compression models on noisy data is particularly pronounced in the second experiment, where the classifier must rely solely on the (distorted) textual contents of the messages. It is interesting to note that the performance of PPM increases slightly at the 5% noise level if headers are kept intact. We have no definitive explanation for this phenomenon. We suspect that introducing noise in the text implicitly increases the influence of the non-textual message headers, and that this effect is beneficial.
14
7
Conclusion
In comparison to other machine learning methods, compression models offer a number of advantages that are especially relevant for spam filtering. Unlike classical machine learning methods, text is simply treated as a sequence of characters. This way, tokenization, stemming and other tedious and error-prone preprocessing steps are omitted altogether. It is precisely these preprocessing steps that are often exploited by spammers. Also, characteristic sequences of punctuation and other special characters, which are generally thought to be useful in spam filtering, are naturally included in the model. The algorithms are also efficient, with training and classification times linear in the amount of data. The models are incrementally updateable, which is important for practical spam filters that often support online learning based on user feedback. The experimental results that were presented confirm that compression models perform well on the spam filtering task. We also show that allowing the model to update adaptively during classification is beneficial, especially when the cost of misclassification is uneven, as is the case in spam filtering. While the effect on classification accuracy is not very obvious, the AUC statistic is substantially improved in all tests, indicating that the errors produced by the information gain method are less severe. Finally, our experiments show that compression models are very robust to noise, which should make them difficult for spammers to defeat.
Acknowledgment The work presented in this report was supported by the Ministry of Higher Education, Science and Technology of the Republic of Slovenia.
References Androutsopoulos, I., J. Koutsias, K. V. Chandrinos, and C. D. Spyropoulos (2000). An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e-mail messages. In Proceedings of SIGIR-2000, pp. 160–167. Cleary, J. G. and W. J. Teahan (1997). Unbounded length contexts for PPM. The Computer Journal 40 (2/3), 67–75. Cleary, J. G. and I. H. Witten (1984, April). Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications COM-32 (4), 396–402. Cormack, G. and T. Lynam (2004). A study of supervised spam detection applied to eight months of personal email. Technical report, School of Computer Science, University of Waterloo. Cormack, G. and T. Lynam (2005). Spam corpus creation for trec. In Proceedings of Second Conference on Email and Anti-Spam CEAS 2005, Stanford University, Palo Alto, CA. Frank, E., C. Chui, and I. H. Witten (2000). Text categorization using compression models. In Proceedings of DCC-00, IEEE Data Compression Conference, Snowbird, US, pp. 200–209. IEEE Computer Society Press, Los Alamitos, US. Goodman, J. (2003, November). Spam: Technologies and policies. Technical report, Microsoft Research, Redmond, US.
15
Graham, P. (2002). A plan for spam (http://www.paulgraham.com/spam.html). Howard, P. (1993). The Design and Analysis of Efficient Lossless Data Compression Systems. Ph. D. thesis, Brown University, Providence, Rhode Island. Peng, F., D. Schuurmans, and S. Wang (2004). Augmenting naive bayes classifiers with statistical language models. Information Retrieval 7 (3-4), 317–345. Rigoutsos, I. and T. Huynh (2004, July). Chung-kwei: A pattern-discovery-based system for the automatic identification of unsolicited e-mail messages (spam). In Proceedings of the First Conference on Email and Anti-Spam. Rissannen, J. (1978). Modeling by shortest data description. Automatica 14, 465–471. Sahami, M., S. Dumais, D. Heckerman, and E. Horvitz (1998, July). A bayesian approach to filtering junk e-mail. In AAAI’98 Workshop on Learning for Text Categorization, Madison, Wisconsin, US. Sayood, K. (2000). Introduction to Data Compression, Chapter 6.2.4. Elsevier. Shannon, C. E. and W. Weaver (1949). The Mathematical Theory of Communication. Urbana, Illinois: University of Illinois Press. Teahan, W. J. (2000). Text classification and segmentation using minimum crossentropy. In Proceeding of RIAO-00, 6th International Conference “Recherche d’Information Assistee par Ordinateur”, Paris, FR. Tjalkens, J., Y. Shtarkov, and F. M. J. Willems (1993). Context tree weighting: Multialphabet sources. In Proceedings of the 14th Symp. on Info. Theory, Benelux, pp. 128–135. Willems, F. M. J., Y. M. Shtarkov, and T. J. Tjalkens (1995). The context-tree weighting method: Basic properties. IEEE Trans. Info. Theory, 653–664. Yerazunis, B. (2003). Sparse binary polynomial hash message filtering and the crm114 discriminator. In Proceedings of the 2003 Spam Conference. MIT.
16