control. Our proposal presents a word-serial naıve Bayes clas- sifier architecture that utilizes the Logarithmic Number System. (LNS) to reduce the computational ...
Binary LNS-based Na¨ıve Bayes Hardware Classifier for Spam Control Muhammad N. Marsono
M. Watheq El-Kharashi
Fayez Gebali
Dept. of Electrical and Computer Engineering, University of Victoria, Victoria BC, V8W 3P6 Canada E-mail: {mmarsono, watheq, fayez}@ece.uvic.ca
Abstract— We propose a hardware architecture for a na¨ıve Bayes classifier in the context of e-mail classification for spam control. Our proposal presents a word-serial na¨ıve Bayes classifier architecture that utilizes the Logarithmic Number System (LNS) to reduce the computational complexity. We present the hardware architecture for non-iterative binary LNS recoding using a look-up table approach. Our design was synthesized targeting an Altera Stratix CPLD device. The synthesized classifier was functionally verified with a MATLAB implementation. Our binary LNS na¨ıve Bayes classifier exhibits high e-mail classification throughput, limited by the pre-classification throughput.
I. I NTRODUCTION Spam or unsolicited e-mails constitute approximately twothirds of the e-mail traffic over the Internet [1] and 70% of the total corporate e-mail traffic [2]. Moreover, Internet mass-mailing worm outbreaks generated exponential e-mail traffic growth, which could disable e-mail infrastructures worldwide [3]. Spam traffic degrades network services and wastes networking resources. Spam is costing US companies approximately $20 billion per year due to productivity loss and up to $2 billion for anti-spam solutions [4]. Gateway-level anti-spam solutions have been proposed to mitigate the effects of spam (e.g., [5]). These solutions are software-based solutions running on general-purpose processor-based (GPP-based) custom-hardware and optimized operating systems. The computing power of GPP-based solutions are unable to cope with the three-fold increase of network load [6]. Specialized hardware architectures with improved processing power are needed to support the network throughput growth and more quality-defined e-mail services. Na¨ıve Bayes statistical technique has been used in most antispam solutions (e.g., [7], [8], [9]). E-mail features are extracted from the e-mail’s header, message, and/or attachment. Na¨ıve Bayes assumes that all classification features are independent of each other, temporally and spatially. Floating-point-based approaches require complex hardwares, while fixed-pointbased need large number of bits. The logarithmic number system (LNS) offers an attractive solution to speed up the na¨ıve Bayes computation [10]. LNS and its dynamic range enable (a) reduction of the na¨ıve Bayes computational complexity; and (b) better control of the hardware classification noise. Hardware architectures for a general-purpose, iterative binary logarithmic arithmetic unit and a converter have been proposed [11], [12]. Several other iterative techniques for
evaluating elementary functions such as HCORDIC [13] and ERL [14] have been proposed to evaluate natural logarithm. On the other hand, binary LNS can be mapped directly in lookup tables (LUTs), as compared to natural logarithm [12], due to the binary number representation. Efficient LUT mapping enables non-iterative binary LNS recoding. This paper proposes the binary LNS na¨ıve Bayes implementation, optimized for spam classification at the algorithmic level. The specialized na¨ıve Bayes classifier can be used to augment the e-mail servers (for e-mail classification speed-up) and to realize our general research goal for upstream e-mail processing. This paper’s contributions could be summarized in: (a) modification of the na¨ıve Bayes technique to reduce its arithmetic complexity using binary LNS approach; (b) presenting a hardware architecture for non-iterative binary LNS recoding for floating-point and fixed-point input; (c) studying the speed-area characteristics of an Altera Stratix [15] CPLD implementation of our na¨ıve Bayes classifier; and (d) verified the classifier’s functionality using numerical simulation in MATLAB. This paper is organized as follows. Section II describes the na¨ıve Bayes e-mail classification model and the binary LNS optimization. We present the hardware architecture of our binary LNS na¨ıve Bayes e-mail classifier in Section III. We discuss the experimental results in Section IV. Then we conclude with few general observations and directions for future works in Section V. II. NA¨I VE BAYES E-M AIL C LASSIFICATION U SING B INARY LNS A PPROACH An e-mail can be viewed as a text document that consists of header, message, and attachments fields. Each field is made of a set of features that can be used for classification purposes. Throughout this paper, we assume the two classes for classifying an e-mail: c0 (spam) and c1 (legitimate). Na¨ıve Bayes technique is a supervised-learning statistical technique. A Na¨ıve Bayes classifier requires a learning phase to generate a generative model V. Classifying an unknown e-mail (represented by a vector of features x) is performed according to Bayes’ theorem to predicts a class that it is most likely had generated the unknown e-mail. The na¨ıve assumption implies feature independence and ordering independence. The occurrence of a feature xi is independent to all other features in x.
A. Multivariate Bernoulli Generative Model The multivariate Bernoulli model is used for anti-spam e-mail classifications (e.g., [7], [8], [9]). The na¨ıve Bayes learning generates V from the learning data set D. The learning data set D of size |D| is a collection on pre-classified emails (d1 , d2 , ..., d|D| ). The generative model V of size |V | is a set of tuples {vt , P (vt |c0 ), P (vt |c1 )}. For any class cj where j ∈ {0, 1}, the likelihood probability of any feature vt occurring in any cj class e-mails is given by: P|D| 1 + i=1 Bit P (cj |di ) P (vt |cj ) = (1) P|D| 2 + i=1 P (cj |di ) Bit ∈ {0, 1} indicates whether feature vt occurs at least once in an e-mail di . P (cj |di ) ∈ {0, 1} gives a binary value that indicates whether an e-mail di belongs to class cj . The 1 in the numerator and 2 in the denominator in Equation (1) are to prevent P (vt |cj ) from equaling zero or unity. The a priori probability P (cj ) represents the probability of cj class e-mails occurring in D. It is given by: P|D| P (cj |di ) (2) P (cj ) = i=1 |D|
Larger |D| gives more precise P (vt |cj ) and P (cj ) estimations in Equations (1) and (2). B. Na¨ıve Bayes Classification A vector x represents m features (x1 , x2 , · · · , xm ) that could be words, phrases, or strings extracted from an e-mail. The a posteriori probability of the e-mail of being spam or legitimate can be calculated using Bayes theorem as: P (c0 |x)
=
P (c1 |x)
=
P (c0 )P (x|c0 ) P (x) 1 − P (c0 |x)
(3) (4)
P (x) is the probability of x occurring in D and P (x|c0 ) is the likelihood probability of x occurring in class c0 . By using the na¨ıve assumption, the likelihood probability of P (x|c0 ) is given by: m Y P (xi |c0 ) (5) P (x|c0 ) = i=1
For each feature, P (xi |cj ) can be obtained from V. P (x) can be calculated as: m 1 Y X P (xi |ck ) (6) P (ck ) P (x) = k=0
i=1
Substituting (5) and (6) into (3), a posterior probability for a class cj given x is given by: Qm P (cj ) i=1 P (xi |cj ) P (cj |x) = P1 (7) Qm k=0 P (ck ) i=1 P (xi |ck ) Dividing P (c0 |x) with P (c1 |x), and taking binary logarithm (instead of natural logarithm as in [10]) gives:
Qm P (c0 ) i=1 P (xi |c0 ) Qm = lg P (c1 ) i=1 P (xi |c1 ) m X = lg (P (xi |c0 )) − lg (P (xi |c1 ))
P (c0 |x) y = lg P (c1 |x)
i=1
+ lg (P (c0 )) − lg (P (c1 ))
(8)
Solving for P (c0 |x) from (8) gives: 1 (9) 1 + 2−y Hence, the multiplication and divisions operations in Equation (7) are reduced to repetitive additions in Equation (8) and a sigmoid-like evaluation in Equation (9). P (c0 |x) =
III. H ARDWARE A RCHITECTURE OF NA¨I VE BAYES C LASSIFIER Fig. 1 shows the na¨ıve Bayes e-mail classifier (in thick boxes) proposed in this paper. The output of of pre-classifier becomes the input of the classifier. Assume that pij represents the probability terms in (8), which are given by: P (cj ), when i = 0; (10) pij = P (xi |cj ), otherwise. E-mail streams
Classifier LG
PreClassifier
pij
ACC
LG
y
SIGM
P(c0|x)
Control Unit V External Interface
Fig. 1. E-mail classification data path with two separate Pre-classifier and Classifier blocks.
The classifier consists of three main design blocks (a) The LG block that converts fixed or floating-point data into fixedpoint binary LNS format; (b) The ACC block that does the accumulation operations in Equation (8); and (c) The SIGM block that performs the sigmoid evaluation function in Equation (9). Assume that pij in (10) is a normalized floating-point number. pij is given by: pij
=
(−1)s × 1.pf × 2pe
(11)
where pe is e-bit biased exponent and pf is f -bit normalized significand. Representing probability, 1-bit s is always equal to zero. The binary logarithm of pij is given from (11) by: lg(pij )
=
lg(1.pf ) + pe
(12)
where lg (1.pf ) is bounded by: 0 ≤ lg(1.pf ) < 1
(13)
Thus, binary LNS recoding is achieved by the addition of pe to the binary LNS value of (1.pf ). The exponent bias constant will be canceled out when lg(pi1 ) is subtracted from lg(pi0 ) according to Equation (8). In fixed-point format, pij can be expressed as:
pij = 0.pf p
(14)
where pf p is a q bits number. Fixed-point pij is converted first to a normalized floating-point form (similar to (11)) by the normalizer block. Fig. 2 shows the LG block design. A demultiplexer routes floating-point or fixed-point pij based on the select signal issued by the control unit. For fixed-point pij , a normalizer block does the normalizing into floating-point format. Multiplexers are used to select pf and pe from either floating-point or fixedpoint input. lg(1.pf ) is evaluated from pf (with implicit 1) using a LUT of size λ1 × λ2 and added to pe to form a two’s complement binary LNS value of e-bit integer and f -bit fraction. pe fixedpoint Normalizer pij lg(pij) LUT pf lg(1.pf) select floating-point select LG block design with separate floating-point and fixed-point data
Fig. 2. paths.
The width λ5 determines the number of words in the LUT. The upper and lower bounds of y, ymax and ymin can be derived when: 1 − (1 + 2−ymax )−1 = 2−(λ5 +1) −ymin −1
(1 + 2
)
(16)
−(λ5 +1)
=2
(17)
Solving Equations (16) and (17) gives: ymax ymin
< (λ5 + 1) > −(λ5 + 1)
(18) (19)
From equations above, λ4 is given as: λ4 = ⌈lg(λ5 + 1)⌉ + 1 + λ5
(20)
where (⌈lg(λ5 + 1)⌉ + 1) bits are used for signed integer and λ5 bits for fraction. IV. E XPERIMENTAL R ESULTS We synthesized the na¨ıve Bayes classifier in Fig. 1 targeting Altera Stratix CPLD devices. We ran simulations of the classifier in MATLAB. We compared the classification results from both implementations to verify the correctness of the hardware implementation.
A. ACC Block
A. Design Settings
Fig. 3 shows the ACC and SIGM blocks. The ACC block receives the binary logarithms of pij from the LG block and produces y. To ensure that overflow or underflow conditions do not occur, the accumulator’s width λ3 must satisfy:
For both synthesized and simulation implementations, we used the design settings shown in Table I. The Input Data Type indicates fixed-point inputs (FP) or floating-point inputs (FLP). For Design 4, we used LG block without fixed-point support. This is to evaluate floating-point only implementation. The next two rows show the classifier’s parameters chosen in our experiment.
λ3 ≥ ⌈lg (m)⌉ + (e + λ2 )
(15)
where λ2 is equal to (q − 1) for fixed-point, or f for floatingpoint. ACC λ3 y
lg(pi0) lg(pi1)
P(c0|x)
SIGM
ACC and SIGM block designs.
Fig. 3.
B. SIGM Block: Estimating Spam Probability SIGM block in Fig. 3 is a LUT of size 2λ4 × λ5 used to evaluate Equation (9), which gives a sigmoid-like response shown in Fig. 4. P (c0 |x) is 0 when y < ymin and 1 when y > ymax . P(c 0|x) P(c 0 |x)
1
yymax
0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2
ymin min
0 -15 -15
Fig. 4.
-10 -10
-5 -5
00
55
10 10
15 15
yy
P (c0 |x) approaches 0 when y < ymin and 1 when y > ymax .
TABLE I D ESIGN S ETTINGS ’ PARAMETERS . Design Input Data Type q or f (bits) LG’s LUT size
1 FP 8 27 × 7
2 FP 12 211 × 11
3 FP 16 212 × 12
4 FLP 12 212 × 12
For fixed-point pij , we fixed q to 8, 12, and 16 bits. For floating-point pij , the significand pf of size 12 bits is used. We set the size of LG’s LUT to 2λ2 ×λ2 , bounded to 212 ×12. We chose the data path width λ3 in Fig. 3 twice the width of λ2 for synthesis. The SIGM ’s LUT size is fixed to 213 × 8-bits to evaluate y in the range of {−24 , (24 − 2−8 )}. B. Synthesis We coded the classifier block using RTL-style VHDL with std ulogic data types. Using Altera Quartus II tool, we synthesized our classifier aiming for the Altera Stratix EP1S10F780C5ES device. The classifier’s control unit was designed using hardwired finite state machine. The content for LUTs in ACC and SIGM were obtained using MATLAB. Table II shows the CPLD resource usage. The usage of memory bits increases with the size of LUTs. The number of logic elements and the I/O pins used increase with the data path’s widths. For 8-bit fixed-point, our classifier achieves more than 156 MHz, and decreases to 117.26 MHz for 16-bit
TABLE II S YNTHESIS R ESULTS TARGETING A LTERA S TRATIX EP1S10F780C5ES. Design Area Logic Elements (LE) Memory Bits (bits) I/O Pins Timing fmax (MHz)
1
2
3
4
146 38,656 43
278 81,920 59
313 135,168 75
144 135,168 77
156.18
128.78
117.26
142.23
fixed-point. The floating-point input achieves 142.43 MHz due to simplified LG block design. Word-serial architecture and non-iterative binary LNS recoding enable one classification per clock cycle. Hence, each clock cycle directly translates to a feature classification per clock cycle.
and 7 millions e-mails per second for 1-gram, non-overlapping 3-gram, and fifteen features per e-mail, respectively. High throughput e-mail classifier will pave the way for fast e-mail classification and filtering. Hardware classifier can be used as an off-load engines that will ease the classification processing in e-mail systems. The noise analysis to determine the robustness of the hardware classifier and the investigation of in-line e-mail classification are reserved for future works. ACKNOWLEDGMENTS The first author is funded by Malaysian Government scholarship JPA-UTM JPA(L)A-3238549. He is attached to Faculty of Electrical Engineering, Universiti Teknologi Malaysia. R EFERENCES
C. E-mail Processing Throughput To obtain the e-mail processing throughput, we assume the following assumptions: (a) the average e-mail size of 30 kB [17] and (b) unlimited pre-classifier throughput (Fig. 1). The throughput of all design settings are given in Table III for three feature extraction methods: (a) 1-gram classification [18], (b) non-overlapping 3-gram; and (c) 15 features per e-mail as used in [9]. The 1-gram method classifies an e-mail on a per character basis while the 3-gram method segments the e-mail into non-overlapping three-character segments. TABLE III E- MAIL P ROCESSING T HROUGHPUTS (E- MAIL PER S ECOND ) Design 1-gram 3-gram 15 feature/e-mail
1 40,671 122,016 9,761,250
2 33,536 100, 609 8,048,750
3 30, 536 91,609 7,328,750
4 37,039 111,117 8,889,375
D. Data Sets Verifying the classifier requires having the generative model V and test set x. We used a data set D from SpamAssassin’s public corpus [16], which consists of 4361 and 2357 predefined legitimate and spam e-mails, respectively. Annoyance Filter [9] is used to generate V and x. E. Design Verification Using the design settings in Table I, we wrote a na¨ıve Bayes classifier in MATLAB with the same finite precision as in synthesized classifier. We used the test set x on both classifier implementations and compared the results. The results from the synthesized classifier conformed with the results from MATLAB classifier. V. C ONCLUSION We presented a word-serial hardware architecture of na¨ıve Bayes classifier for two-class e-mail classification for spam control. We showed that by using binary LNS recoding, the computational complexity of na¨ıve Bayes operations can be reduced. The binary LNS also enables simple non-iterative recoding for either floating or fixed-point numbers. Synthesized for Altera Stratix CPLD device, our hardware classifier is capable to classify more than 37 thousands, 91 thousands,
[1] J. Goodman, D. Heckerman, and R. Rounthwaite, “Stopping spam,” Scientific American, pp. 42–49, April 2005. [2] Symantec Webcast. (2005) [Online]. Available: http://www.symantec.com/resellers/softchoice [3] C. Wong, S. Bielski, J. M. McCune, and C. Wang, “A study of massmailing worms,” in WORM ’04: Proceedings of the 2004 ACM workshop on Rapid Malcode, Washington DC, USA, 2004, pp. 1–10. [4] J. Lyman, “Spam costs $20 billion each year in lost productivity,” LinuxInsider Website. html., December 2005. [Online]. Available: http://www.linuxinsider.com/story/32478.html [5] Espion, January 2006. [Online]. Available: http://www.espionintl.com/enterprise.php [6] P. Lekkas, Network Processors: Architetcures, Protocols, and Platforms. McGraw Hill, New York, 2003. [7] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian Approach to Filtering Junk E-Mail,” in Learning for Text Categorization: Papers from the 1998 Workshop, Madison, Wisconsin: AAAI Technical Report WS-98-05, 1998. [8] I. Androutsopolous, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropolous, “An evaluation of naive Bayesian anti-spam filtering,” Proceedings of the workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning, Barcelona, Spain pp. 9-17, May 2000. [9] J. Walker, “Annoyance Filter: Adaptive Bayesian junk mail filter,” November 2005. [Online]. Available: http://www.fourmilab.ch/annoyance-filter/ [10] C. Elkan. (1997)“Boosting and Naive Bayesian Learning,” Technical Report CS97-557, UCSD, September 1997 [Online]. Available :http://www-cse.ucsd.edu/users/elkan/papers/bnb.ps [11] R. Matousek, M. Tich´y, Z. Pohl, J. Kadlec, C. Softley, and N. Coleman, “Logarithmic number system and floating-point arithmetics on FPGA.,” Proceedings of the International Conference on Field-Programmable Logic and its Applications, Montpellier, France, pp. 627–636, Sept 2002. [12] Y. Wan. and C. -L. Wey, “Efficient algorithms for binary logarithmic conversion and addition,” IEE Proceedings of Computers and Digital Techniques, vol. 146, no. 3, pp. 168-172, May 1999. [13] F. Elguibaly, N.-T. Sui, and A. Rayhan, “HCORDIC: A high-radix adaptive CORDIC algorithm,” Canadian Journal of Electrical and Computer Engineering, vol. 25, no. 3, pp. 149-154, Oct 2000. [14] F. Gebali and M. W. El-Kharashi, “ERL: An algorithm for fast evaluation of exponential, reciprocal, and logarithmic functions,” 2004 International Conference on Electrical, Electronic and Computer Engineering, Cairo, Egypt, pp. 269-272, Sept 2004. [15] Altera Stratix Device Handbook (2005), [Online.] Available: http://www.altera.com/literature/hb/stx/stratix handbook.pdf [16] SpamAssassin Public Corpus. (2005) [Online]. Available: http://spamassassin.apache.org/publiccorpus/ [17] L. H. Gomes, C. Cazita, J. M. Almeida, V. Almeida, and J. Wagner Meira, “Characterizing a Spam Traffic,” in Proceedings of the 4th ACM SIGCOMM Conference on Internet Measurement (ICM), Taormina, Sicily, Italy, pp. 356–369, October 2004. [18] C. O’Brien and C. Vogel, “Spam Filters: Bayes vs. Chi-squared; Letters vs. Words,” in Proceedings of the 1st International Symposium on Information and Communication Technologies (ISICT). Dublin, Ireland, pp. 291–296, September 2003