Bayesian Spam Filtering Based on Co-weighting Multi-estimations SHRESTHA Raju, LIN Ya-ping, CHEN Zhi-ping (College of Computer and Communication, Hunan University, Changsha 410082, China) Abstract: Statistical filters based on Bayesian classification are by far the most popular and widely used among different types of spam email filters because of efficient training, quick classification, easy extensibility and adaptive learning. However, several ways of computing probability estimations have been proposed and used. In this paper we examine those estimations and propose a new, more effective estimation based on co-weighting multi-estimations. It also takes into consideration of the different parts of emails and defines effective tokenizer rules. The new approach is compared to earlier well-known approaches and the experimental result with three public corpuses showed improvement, stability, robustness and consistency in the spam filtering with the proposed estimation. Keywords: spam filter; email classification; Bayesian technique; co-weighting
1. Introduction Email is the fastest and most economical form of communication, so it is being commonly used to send “unsolicited commercial email” also known as junk or spam email. Spam not only wastes the time, but also wastes bandwidth, server space and some contents are even harmful to some recipients. As the volume of such junk/spam mail has grown enormously in the past few years, the need for stopping such emails is realized and many anti-spam filtering techniques are being proposed and some of them are already in use. Sahami[1] first employed the Naive Bayes algorithm to classify messages as spam or legitimate, using mutual information measure as feature selector. In the series of papers, Androutsopoulos[2~4] extended Naive Bayes filter by investigating the effect of different number of features and training-set sizes on the filter's performance. The accuracy of the Naive Bayes filter was shown to greatly outperform a typical keywordbased filter[3]. Paul Graham[5, 6] defines various tokenization rules treating tokens in different parts of emails like To, From, Subject and Return-path separately and computed token probabilities and combined spam probability based on Bayes rule, but in a different way. Gary Robinson, in [7] suggested enhancements to Graham's approach by proposing Bayesian approach of handling rare words and in [8] further suggesting to use Fisher's inverse chi-square function for combining probability estimations. All previous algorithms use either token counts (token frequencies) or spam and legitimate email counts (document frequencies) in probability computations. In this paper we present a new approach which, on the top of all those evolutions and enhancements, takes into account of both document and token frequencies in probability computations and appropriately co-weight them to give resultant probability estimates that leads to improved, more stable and more consistent results in spam filtering. Furthermore it differentiates the tokens occurred in different email areas by proper tagging. The results are compared with three popular variants of Bayesian filtering algorithms. The rest of the paper is organized as follows. In Section 2, we briefly review popular evolutions and variants of statistical spam filtering algorithms. In Section 3, we present the main idea of this paper along with the preprocessing and feature extraction techniques used. Section 4 describes the experiments and analysis done with our approach and made comparisons between different algorithms. The section also introduces the corpora and performance measures used. Finally, section 5 presents the conclusion of the paper and possible future work.
2. Statistical Bayesian Filtering Algorithms Here we introduce three popular statistical Bayesian filtering algorithms including the original Naive Bayes. 2.1 Naive Bayes (NB) Algorithm Naive Bayes algorithm is the simplified version of Bayes theorem with the assumption of feature independence. It computes the probability of a Class = {Spam, Legt} given an Email as:
∏ P(t | Class)
P (Class | Email ) = P(Class ) P ( Email | Class ) = P (Class )
i
(2.1.1)
i
Where P(Class) is the prior probability for the Class, P(Email|Class) is the conditional probability of the P (ti | Class ) , the product of conditional probabilities of the all valid Email given the Class, calculated by
∏ i
tokens ti corresponding to the Class. The conditional probability of a token t, given a Class, i.e. P (t | Class ) is calculated using the formula[9] :
1+Count (t , Class) Count(AllTokens, Class)+|V|
(2.1.2)
Where Count(t, Class) is the number of occurrences of the token t in the Class, Count(AllTokens, Class) is the total number of occurrences of all the tokens in the Class and |V| is the size of the vocabulary, a distinct set of words built from all spam and legitimate emails during training. A new unknown token (i.e. the one never occurred during training) is either ignored or assigned a constant probability like 1/Count(AllTokens, Class)*|V|. The filter then classifies an email as Spam or Legitimate according to whether the P(Spam|Email) is greater than P(Legt|Email) or not. Although the feature independence is a poor assumption, many implementations have shown that the algorithm is fairly robust and powerful in filtering spams and outperforms many other knowledge base filters[4, 10]. 2.2 Paul Graham’s (PG) Algorithm Paul Graham[5] calculated the probability estimates for a given email being spam and legitimate, given a token appears in that using the formulas:
p (t ) = P( Spam | t ) =
tbad / nbad tbad / nbad + 2 * tgood / ngood
(2.2.1)
and
P(Legt | t) = 1 - P(Spam | t) (2.2.2) Where tbad and tgood are the number of times the token t occurred in all the spam and legitimate emails respectively, nbad and ngood are the number of spam and legitimate emails respectively. tgood is multiplied by 2 to bias towards legitimate. Token probability estimates of spam and legitimate are calculated for all the tokens present in the email and the combined probability for spam is obtained using Bayesian approach which makes the assumption of these probabilities being independent to each other, using the formula:
∏ P(Spam | t ) P( Spam | Email ) = ∏ P(Spam | t ) + ∏ P( Legt | t ) i
i
i
i
(2.2.3)
i
i
Paul used only 15 most interesting tokens, measured by how far their spam probability is from a neutral 0.5, in computing the combined probability for spam. He ignored tokens whose total number of occurrences less than 5. Moreover, Paul assigned probability value of 0.4(obtained by trial and error) to an unknown token and 0.99 to tokens occurred in one class but not in the other. A test email is then classified as spam if the combined probability is more than a defined threshold value of 0.9. In his test, Paul got promising result of the miss of less than 5 per 1000 spams, with 0 false positives. 2.3 Gary Robinson’s (GR) Algorithm
Gary in [7] pointed out several drawbacks with Paul’s algorithm and suggested improvements: Gary argued Paul’s assignment of value 0.4 to an unknown token and 0.99 to tokens occurred in one class but not in the other, without treatment to low and high data situations, as inconsistent unsmooth and proposed a consistent and smooth way of dealing rare words by using the following formula obtained from Bayesian approach to compute the token probability guesstimate, termed as degree of belief :
s * x + n * p(t ) (2.3.1) s+n Where p(t) is the Paul’s probability estimation (2.2.1) for the token t, s is the strength we want to give to our background information, x is our assumed probability for an unknown token and n is the number of emails we have received that contain the token t. The values for s and x are obtained through testing to optimize performance with the reasonable starting points of 1 and .5 for s and x, respectively. In [8], Gary further suggested to use Fisher’s inverse chi-square function to compute combined probabilities using the formulas: f (t ) =
H = C −1 (−2ln
∏ f (t ), 2n) i
(2.3.2)
and
S = C −1 (−2ln
∏ (1 − f (t )), 2n)
(2.3.3)
i
H and S are combined probabilities that allow rejecting the null hypotheses and assume instead the alternative hypotheses that the email is a ham and spam respectively. C-1() is the Fisher’s inverse chisquare function used with 2n degrees of freedom. The combined indicator of spamminess or hamminess for the email as a whole is then obtained using the formula:
1+ H − S (2.3.4) 2 The email is classified as spam if the indicator value is above some threshold value otherwise as legitimate. Emails with I values near 0.5 can also be classified as uncertain. It has been found significant improvement in the performance of the filter with this Gary’s modifications. I=
3. Bayesian Filtering Based on Co-weighting Multi-estimations In this section, we present our new approach of co-weighting multi-estimations of probabilities along with preprocessing and feature extraction techniques. 3.1 Preprocessing
Due to the prevalence of headers, html and binary attachments in modern emails, pre-processing is required on email messages to allow effective feature extraction. We use following steps: The whole email structure is divided into 4 areas: (1.) Normal header comprising of “From”, “Reply to”, “Return-path”, “To”, “Cc”; (2.) Subject header (3.) Body and, (4.) Html tags comprising of 3 tags ,
and . All other headers and html tags (except those mentioned above) are ignored and hence removed from the email text. All binary attachments, if any, are ignored and so removed. The remaining text is then used for feature extraction. 3.2 Feature extraction and Tokenization
Our approach considers tokens as the sole features for the spam filtering. We define tokenizer rules as: All terms constituting alphanumeric characters, dash(-), underscore(_), apostrophe(’), Exclamation(!), asterisk(*) and currency signs($£€¥) are considered valid tokens and tokens are case-sensitive. IP addresses, domain names, money values (numbers separated by comma and/or with currency symbols) are considered valid tokens. Pure numbers are ignored. For domain name, it is broken into sub-terms (like www.hnu.net is broken into www.hnu.net, www.hnu, hnu.net, www, hnu and net) and the sub-terms are also considered valid tokens. Contents within , and html tags are tokenized as usual except tag names and attribute names are ignored. Spammer’s one of newest tricks of non-HTML text, interspersed with HTML tags like “You can bium here!” is handled and obtains the text as “You can buy valium here” and tokenize it normally. Tokens occurred in four email areas are appropriately tagged with area marks. Stemming and stop lists are not used, as performances are better without them.
3.3 Main Idea and Algorithm Description
The main idea in our approach lies in the fact that both document and token frequencies play vital role in determining whether a random email is a spam or not, hence needs to be considered in the filtering system. So rather than computing probability estimations merely based on either document or token frequencies, we compute probabilities based on both document and token frequencies as well as based on average number of token occurrences in spam and legitimate emails. Then the combined integrated estimate is obtained by coweighting them with normalized weight values. Let nos(t) and nol(t) be the number of occurrences of token in spam and legitimate emails respectively, Ts and Tl be the total number occurrences of all tokens in spam and legitimate emails, ns(t) and nl(t) be the
number of spam and legitimate emails containing the token, Ns and Nl be the number of spam and legitimate emails respectively. Then for each token t in the given email, three individual probability estimates are calculated as follows: (1.) Estimation based on document frequencies: ns(t ) / Ns p1 (t ) = (3.3.1) ns(t ) / Ns + BIAS _ FACTOR * nl (t ) / Nl (2.) Estimation based on token frequencies: nos (t ) / Ts p2 (t ) = nos (t ) / Ts + BIAS _ FACTOR * nol (t ) / Tl
(3.3.2)
(3.) Estimation based on average number of occurrences of a token in spam and legitimate emails: nos(t ) / Ns p3 (t ) = (3.3.3) nos(t ) / Ns + BIAS _ FACTOR * nol (t ) / Nl As suggested by Paul, BIAS_FACTOR is used to slightly bias towards legitimates. These three probability estimations correspond to Paul’s probability estimation (2.2.1). Next for each of these, the degrees of belief estimations: f1 (t ) , f 2 (t ) , and f 3 (t ) are computed using Gary’s formula (2.3.1) replacing p(t) with p1(t), p2(t) and p3(t) respectively and using n=ns(t)+nl(t) for f1 (t ) and n=nos(t)+nol(t) for f 2 (t ) and f 3 (t ) . s and x are, like in Gary’s Algorithm, the belief factor and the assumed probability for an unknown token whose values are determined while tuning the filter for optimal performance. Then the combined probability estimation is calculated by co-weighting the three individual estimations with CoOperative Training (COT) approach: f (t ) = ω1* f1 (t ) +ω2* f 2 (t ) + ω3* f 3 (t ) , where ω1+ω2+ω3=1 (3.3.4) The normalized weight values indicate the importance of three individual estimations. Since f (t ) has the resultant effect of both document and token frequencies as well as the average token occurrences in spam and legitimate emails, it gives more realistic and better probability estimation. Now rather than considering fixed number of interesting tokens as suggested in Paul’s algorithm, which is unrealistic and unreasonable, we consider all tokens whose f (t ) value is above and below certain offset value PROB_OFFSET from the neutral 0.5 as interesting. If this gives less than certain fixed number MIN_INTTOKENS of tokens, the range is extended to obtain that number of interesting tokens. Now f (t ) values for those interesting tokens are used to obtain the combined indicator I of spamminess and hamminess using the formula (2.3.4) whereby H and S are calculated using Fisher’s inverse chi-square functions (2.3.2) and (2.3.3) respectively. Finally the test email is classified as spam if I is greater than certain threshold value, SPAM_THRESHOLD, otherwise classified as legitimate.
4. Experiments and Analysis First we will introduce the corpora collection with which we have performed experiments, and performance measures used for evaluations and comparisons of algorithms. 4.1 Corpora Collection
In this paper three publicly available corpora have been used to allow comparisons to be made with other published work and ensure that performance is tested across a varied range of email. They are: (1.) Ling Spam corpus which was made available by Ion Androutsopoulos[4] and has been used in a considerable number of publications. It is composed of 481 spams, and 2,412 legitimate emails. All messages were taken from a linguist mailing list. (2.) Spam Assassin corpus used to optimize the open source SpamAssassin filter. It contains 1,897 spam (spam & spam-2) and 4,150 legitimate (easy ham, easy ham-2 & hard ham) emails. Constructed from the maintainers' personal email it is probably the most representative corpus of a typical user's mail that is available publicly. (3.) Annexia/Xpert corpus, synthesis of 10,025 spam emails from Annexia spam archives and 22,813 legitimate emails from X-Free project's Xpert mailing list. We randomly picked 7,500 spam and equal number of legitimate emails from this corpus. From each corpus, training and test datasets are prepared by randomly picking two-thirds of the total corpus data as training dataset and the rest one-third as test dataset.
4.2 Performance Measures
Let NSpam and NLegt be the total number of spam and legitimate email messages to be classified by the filter respectively, and NXÆY the number of messages belonging to Class X that the filter classified as belonging to class Y (X, Y Є {Spam, Legt}). Then the six performance measures used in this paper are calculated as shown below in four categories: (1.) Weighted Accuracy (WAcc) and Weighted Error (WErr) rates: These measure the percentages of correctly and incorrectly classified messages. Assuming LegtÆSpam is λ (We have used λ = 99 in our experiments and analysis) times more costlier than SpamÆLegt, the weighted accuracy and error rates are calculated as follows: N Spam −> Spam + λ N Legt −> Legt WAcc = (4.2.1) N Spam + λ N Legt
WErr = 1 − WAcc = (2.)
(4.2.2)
N Spam + λ N Legt
Spam Recall (SR) and Spam Precision (SP) rates: Spam recall measures the percentage of spam messages that the filter manages to block (intuitively its effectiveness), while spam precision measures the degree to which the blocked messages are indeed spam (the filter’s safety). They are calculated using formulas: N Spam −> Spam SR = (4.2.3) N Spam
SP = (3.)
N Spam −> Legt + λ N Legt −> Spam
N Spam −> Spam
(4.2.4)
N Spam −> Spam + N Legt −> Spam
False Positive Rate (FPR) and False Negative Rate (FNR): False positive rate measures the percentage of legitimate messages incorrectly marked as spam, and false negative rate measures the percentage of spam messages incorrectly marked as legitimate. These measures are calculated as follows: N Legt −> Spam (4.2.5) FPR = N Legt FNR =
N Spam −> Legt
(4.2.6)
N Spam
4.3 Experiments and Analysis
Gary Robinson's Paul Graham's (GR) (PG)
Naive Bayes (NB)
We have performed experiments on the filter application we developed in Java during our research work. All experiments are carried out five times for all three Table 1. Test Results for NB, PG and GR Algorithms datasets by randomly picking training and test datasets AlgoLing Spam Annexia/ Average as described in Sect. 4.1 and the average results are rithms Datasets Spam Assassin Xpert Measures reported. Our experiments consist of two parts: WAcc 0.99863 0.99051 0.99918 0.99610 WErr 0.00137 0.00949 0.00082 0.00390 (1.) First, we train and test all three: Naive Bayes, SR 0.93750 0.96990 0.99680 0.96807 Paul Graham’s and Gary Robinson’s algorithms SP 0.99340 0.97920 0.99920 0.99060 with all three corpus datasets using the same FPR 0.00120 0.00940 0.00080 0.00380 preprocessing and tokenizer rules as described FNR 0.06250 0.03010 0.00320 0.03193 in Sects. 3.1 and 3.2. For PG’s Algorithm, we WAcc 0.99964 0.99382 0.99907 0.99751 use the same parameters used by Paul. With WErr 0.00036 0.00618 0.00093 0.00249 GR’s algorithm we use the same parameter SR 0.81880 0.90820 0.98640 0.90447 values for SPAM_THRESHOLD, MIN_INTTOKENS, SP 1.00000 0.98630 0.99920 0.99517 FPR 0.00000 0.00580 0.00080 0.00220 PROB_OFFSET, BIAS_FACTOR, s and x as FNR 0.18120 0.09180 0.01360 0.09553 obtained by exhaustive test and adjustments in WAcc 0.99976 0.99584 0.99966 0.99842 the second experiment. Test results obtained WErr 0.00024 0.00416 0.00034 0.00158 from the experiment are given in the Table 1 SR 0.88120 0.87820 0.96640 0.90860 below and accuracy based comparative chart is SP 1.00000 0.99110 1.00000 0.99703 shown graphically in Fig.1. FPR FNR
0.00000 0.11880
0.00360 0.12180
0.00000 0.03360
0.00120 0.09140
respectively that give such a result. These values also result in consistent performances with all three individual estimations and so are used here. Next, we aimed to find best weighting modulus for multi-estimations and test the filter with all possible combinations and analyze the result. From this exhaustive test and analysis of the results, we came up with the weighting modulus: (ω1=0.05, ω2=0.7, ω3=0.25) that gives consistent, stable and optimal performance. The choice of the weights for three computations reflects their importance or effect in the combined estimation and the values well represents their individual behaviors that can be observed from the result of the experiment. The results of the experiment is given in the Table 2 and shown graphically in Fig.2.
Accuracy
Naive Bayes Algorithm
0.998
Paul Graham's Algorithm
0.996 0.994
Gary Robinson's Algorithm
0.992 0.990
Ling Spam Annexia Spam Assasin Xpert Datasets
Fig.1 Classification Accuracies with NB, PG and GR Algorithms Table 2. Test Results for Co-weighting Multi-estimations and Individual Estimations Ling Spam
Datasets Estimations
Estimation with Estimation Based Estimation Based Estimation Based Co-weighting on Avg. Token on Token on Document Multi-estimations Occurrences Frequencies Frequencies
SPAM_THRESHOLD, MIN_INTTOKENS, PROB_OFFSET, BIAS_FACTOR, s and x
1.000
Accuracy
(2.)
From the experiment we have observed that Paul’s Algorithm gives better result than Naive Bayes and the performance of Gary Robinson’s algorithm gives further improvement to that of Paul’s and this approves our approach to be based on Gary Robinson’s modification. In the second part of the experiment, performances are measured with three separate individual probability estimations described in section 3.3 and with our CoOperative Training (COT) approach of combined integrated co-weighted multiestimations. An exhaustive search of parameter combinations was also carried out to identify the lowest false positive rates that could be obtained for each corpus and we observed that it is possible to achieve near zero false positives if the classifier is well tuned to the data, but at the cost of reduced accuracy. Moreover, varying parameter values change the performance widely and sometimes differently for different corpus which is because of quite different type and nature of contents of three different corpora data. So we searched for compromised parameter combination that gives compromised result with reduced false positive, with high accuracy and stable and consistent result for all three corpora datasets and we found that the values 0.9, 15, 0.27, 1.30, 0.65 and 0.35 for
No. of Spams No. of Legitimates WAcc WErr SR SP FPR FNR WAcc WErr SR SP FPR FNR WAcc WErr SR SP FPR FNR WAcc WErr SR SP FPR FNR
Spam Annexia/ Assassin Xpert Average Measures
160
632
2500
804
1383
2500
0.99964 0.00036 0.81880 1.00000 0.00000 0.18120 0.99955 0.00045 0.77500 1.00000 0.00000 0.22500 0.99976 0.00024 0.88120 1.00000 0.00000 0.11880 0.99965 0.00035 0.82500 1.00000 0.00000
0.99368 0.00632 0.87820 0.98580 0.00580 0.12180 0.99855 0.00145 0.84020 0.99810 0.00070 0.15980 0.99584 0.00416 0.87820 0.99110 0.00360 0.12180 0.99863 0.00137 0.85760 0.99820 0.00070
0.99954 0.00046 0.95440 1.00000 0.00000 0.04560 0.99956 0.00044 0.95640 1.00000 0.00000 0.04360 0.99966 0.00034 0.96640 1.00000 0.00000 0.03360 0.99960 0.00040 0.95960 1.00000 0.00000
0.99762 0.00238 0.88380 0.99527 0.00193 0.11620 0.99922 0.00078 0.85720 0.99937 0.00023 0.14280 0.99842 0.00158 0.90860 0.99703 0.00120 0.09140 0.99929 0.00071 0.88073 0.99940 0.00023
0.17500
0.14240
0.04040
0.11927
1.000 0.999 0.998 0.997 0.996 0.995 0.994 0.993 0.992
Estimation Based on Document Frequencies
Ling Spam
Spam Annexia Assasin Xpert Dataset
Estimation Based on Token Frequencies Estimation Based on Average Token Occurences Estimation using Co-weighted M ulti-estimations
Fig.2 Classification Accuracies with Co-weighting Multi-estimations and Individual Estimations
The performances of the filter based on document frequencies are average with Ling Spam and Annexia/Xpert corpora but drops down with Spam Assassin; the filter based on token frequencies performs relatively consistently with all three corpora data, however with relatively lower accuracy. Wide fluctuations of performances for Spam Assassin are because of resulting few false positives and which is due to inability of the individual estimations to consistently handle hard ham emails (containing unusual HTML markup, colored text, spammish-sounding phrases) in the corpus data. The new approach handles all cases consistently with combined positive effects of three individual estimations. This resulted high accuracy, high spam precision and low false positives, on the average on all three datasets. As the filter is tuned for high accuracy and low minimum false positive, it causes slightly low spam recall and little high false negative rate which is acceptable since false positives are generally considered far worse than false negatives. More importantly the new approach of integrated multi-weighted estimations exhibits more stable and consistent behavior with all three corpora.
5. Conclusion In this paper we present the new approach to statistical Bayesian filter based on co-weighting multiestimations. This new algorithm co-relate three probability estimations based on token and document frequencies by co-weighting linearly, and set the normalized weighting coefficients by analyzing the experiments on individual and combined estimations. Experimental results on all three different public corpus datasets showed that the new algorithm performs better in terms of stability and consistency with heterogeneous datasets than using the single individual estimations and also gives improved result on the average. We have used our own definition of preprocessing and tokenizer rules. As the filter performance depends largely on these as well, future developments may include integrating our approach with phrasebased and/or other lexical analyzers and with rich feature extraction methods, from which even better performances can be expected. Since the performance widely varies with a number of parameters which still have scope of fine tuning, further study on the approach can still be continued. References [1]
Sahami M, Dumais S, Heckerman D, and Horvitz E. A Bayesian Approach to Filtering Junk E-mail. In: Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, 1998.
[2]
Androutsopoulos I, Koutsias J, Chandrinos K V, Paliouras G, and Spyropoulos C D. An Evaluation of Naive Bayesian AntiSpam Filtering. In: Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, May 2000.
[3]
Androutsopoulos I, Koutsias J, Chandrinos K V, Paliouras G, and Spyropoulos C D. An Experimental Comparison of Naive Bayesian and Keyword-based Anti-Spam Filtering with Personal E-mail Messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000.
[4]
Androutsopoulos I, Koutsias J, Chandrinos K V, Paliouras G, and Spyropoulos C D. Learning to Filter Spam E-mail: A Comparison of a Naive Bayesian and a Memory-based Approach. In: H. Zaragoza, P.Gallinari, and M.Rajman, eds. Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2000.
[5]
Graham
[6]
Graham P. Better Bayesian Filtering. In: Proceedings of the First Annual Spam Conference. MIT, 2003. http://ww.paulgraham.com/better.html.
[7]
Robinson
[8]
Robinson G. A Statistical Approach to the http://www.linuxjournal.com/article.php?sid=6467.
[9]
Mitchell T M. Machine Learning. McGraw-Hill, 1997. 177–184
P. A Plan for Spam, 2002. http://www.paulgraham.com/spam.html.
G. Spam Detection, 2003. http://radio.weblogs.com/0101454/stories/2002/09/16/ spamDetection.html. Spam
Problem,
Linux
Journal,
March
2003,
Issue-107.
[10] Provost J. Naive-Bayes vs. Rule-Learning in Classification of Email, Department of Computer Sciences, The University of Texas at Austin, 1999.