Hybrid Switch on Artificial Neural Networks and Naïve Bayes

5 downloads 16768 Views 530KB Size Report
Nov 13, 2015 - there is no filter which is 100% good at blocking spams. Some filters that ... on e-mail users', spam is often referred to as Unsolicited Bulk Email.
ISSN 2315-5027; Volume 3, pp. 1-8; November, 2015

Online Journal of Physical and Environmental Science Research ©2015 Online Research Journals Full Length Research Available Online at http://www.onlineresearchjournals.org/JPESR

Hybrid Switch on Artificial Neural Networks and Naïve Bayes Clopas Kwenda Accounting and Information System Department, Great Zimbabwe University. E-mail: [email protected]; Tel.: +263773357572. Received 24 October, 2015

Accepted 13 November, 2015

E-mails have to a very large extend enabled millions of people around the globe to interact with each other by exchanging information inform of text, videos, audios and pictures. With due course this blessing came as a disguise as there exist some mischievous people who takes the advantage an email has brought by sending millions of spams to everyone who has an e-mail account. Efforts have been done by designing filters that would reduce the seepage of spams to mail inbox but up-to-date there is no filter which is 100% good at blocking spams. Some filters that have been developed employed the following machine learning algorithms SVM, naïve bayes algorithm, artificial neural networks etc. The goal of this paper was to design the hybrid switch between naïve bayes and artificial neural networks to reduce spam seepage. The general ideal of the model is such that at first stage the classification is done using ANN at 300 iterations, when ANN fails to categories the category of the message, the Bayesian algorithm will the proceed as a second stage to categories the message. ANN is arguably better at classification as compared to training, so training is done using naïve bayes. The results showed that the hybrid approach is really better at blocking spams achieving an accuracy of 94% basing on the Ling Spam data set. A suggestion of employing the OCR technique in filters so as to be able to produce videos and pictures as tokens was made so that filters will be able to categories pornographic videos and pictures. Key words: Spam, artificial neural network, naïve bayes algorithm, optical character recognition, simple vector machines. INTRODUCTION The introduction of ICTs (information communication technologies) in modern day world has vastly improved the communication system. This is evidenced by having different available tools that can be used for communication. Almost everyone with access to internet is subscribed to an e-mail which he/she uses for communication to anyone all over the world in just a matter of a second. An Electronic mail referred to as email is a method of exchanging digital messages from an author to one or more recipients [1]. But despite enjoying using e-mails for communication its introduction to the world has brought spams. The term spam refers to unsolicited, unwanted, annoying and inappropriate bulk e-mail [2]. Due to its negative impact on e-mail users’, spam is often referred to as Unsolicited Bulk Email (UBE), Excessive Multi-Posting (EMP), Unsolicited Commercial Email (UCE), and Unsolicited Automated

Email (UAE), bulk mail or just junk mail [2]. Spammers harvest e-mail addresses using some of the following means: from mailing lists, from posts to usenet with your e-mail, from web pages, from various web and paper form, via an ident daemon, from a web crawler, from IRC and chat rooms, from finger daemons, AOL profile files, from daemon contact points, by guessing and cleaning, from white and yellow pages, by having access to the same computer, from the previous owner of the e-mails, using social engineering attack, from address book and e-mails on other peoples computer, buying lists from others, and by hacking into sites [3]. The problem with spam is that it makes up 30% to 60% of mail traffic and is on the rise [2]. It can make the mail traffic become slow. If a spam is received and stored in respective mailbox, it can cause mailbox shutdown problems. E-mail users’ wastes much of their precious time in managing and

2

Online J Phys Environ Sci Res

deleting these unwanted junk messages.

Artificial Neural Networks

Background

An artificial neural network (ANN) is a mathematical model (see Figure 1) that tries to simulate the structure and functionalities of biological neural networks [5]. The model of an artificial neuron has got three generic steps which are:

The naïve bayes derives its background from the bayes theorem on conditional probability. For instance P(X|Y) is a conditional probability of X occurring given that Y has already occurred. Given a document there are two cases, its either a spam or ham (legitimate message). Determining the category (being either a spam or ham) of a document the naïvebayes will calculate the probability as follows: 𝑃 𝐶𝑗 𝑃(𝑑|𝐶𝑗 ) 𝑃 𝐶𝑗 |𝑑) = 𝑃(𝑑) 𝑃 𝐶𝑗 𝑑 𝑖𝑠 𝑡𝑕𝑒 𝑝𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 𝑃 𝑑 𝐶𝑗 𝑖𝑠 𝑡𝑕𝑒 𝑙𝑖𝑘𝑒𝑙𝑖𝑕𝑜𝑜𝑑 𝑃 𝑑 𝑖𝑠 𝑡𝑕𝑒 𝑒𝑣𝑖𝑑𝑒𝑛𝑐𝑒 𝑃 𝐶𝑗 𝑖𝑠 𝑡𝑕𝑒 𝑝𝑟𝑖𝑜𝑟 The naïve works with a very strong assumption of feature or token independence. That is the occurrence of one feature is not affected by the occurrence of another feature. Feature or a token are words of interest within a document d. So supposing that a document 𝑑 has 𝑛 features F = 𝑓1 , … . 𝑓𝑛 : 𝑛 𝑃 𝐶𝑗 𝑖 𝑃(𝑓𝑖 |𝐶𝑗 ) 𝑃(𝑓1 , … . 𝑓𝑛 ) The category of a class 𝑃 𝐶𝑗 i.e being ham or spam is calculated as follows: 𝐴 𝑃 𝐶𝑗 = 𝐴𝑛 where 𝐴𝑛 is the total number of documents belonging to a certain category of a class and 𝐴 is the total number of training documents.

𝑃 𝐶𝑗 |𝑑) =

Naïve Bayes calculates the posteriors of each class and the document will be assign to class depending on the Bayesian probability value obtained from that document. The process of Bayesian filtering is easy, in the first stage corpora of ham (legitimate) message and spam are given as input to a Bayesian learning algorithm. The messages are then tokenized using regular expressions to produce a set of features or tokens [4]. The frequency of each token is then recorded in the database. The tokens which are of interest are those whose value is closer to 0 (highly indicative of spam) or closer to 1 (highly indicative of ham). Therefore dimensionality vector reduction technique is then used to remove stop words or noisy tokens (i.e those tokens which have the same probability of either occurring in a spam message or ham message. When the process of training is completed the clean tokens in the database is then used by the algorithm to determine the category of the message.

1) Multiplication 2) Summation 3) Activation Therefore generally at the entrance of the model each input data say 𝒙𝟏 is multiplied with an individual weight say 𝒘𝟏 to get 𝒙𝟏𝒘𝟏. So this process is recursive for each and every input supplied. Then at the middle of the model the weighted inputs are summed up together. Finally at the exit of the model the summation of the weighted inputs are then passed through an activation function. The neuron is the building block of artificial neural networks. A neuron is an information processing unit and neurons themselves are connected by weighted links to pass signals from one neuron to the other. The neuron computes the weighted sum of the input signals and compares the result with a threshold value, θ [6]. If the net input is less than the threshold, θ the neuron output is -1. But if the net input is greater than or equal to the threshold, the neuron becomes activated and its output attains a value +1 [6]. In 1958, Franklin Rosenblatt came up with a simple ANN training algorithm called the perceptron (Figure 2) which consisted of single neuron with adjustable synoptic weights and a hard limiter. In the perceptron the weighted sum of the inputs are supplied to the hard limiter which produces an output of +1 if summation of inputs is positive and -1 if it is negative. The aim of the perceptron is to classify inputs𝒙𝟏 , 𝒙𝟐,… 𝒙𝒏 into one of the two classes either a spam or a ham class. How the Perceptron Works Weights are adjusted to reduce the gap between the actual and the desired outputs of the perceptron. The initial weights are randomly picked from the range [-0.5 to 0.5] and then updated to achieve the desired result. At each iteration 𝑷 (i.e training example) the actual output 𝒀(𝑷) is compared with the desired output 𝒀𝒅 (𝑷) to calculate the error. 𝒆 𝒑 = 𝒀𝒅 𝒑 − 𝒀 𝒑 The weights are adjusted accordingly to reduce the error Perceptron Learning Rule 𝒘𝒊 𝒑 + 𝟏 = 𝒘𝒊 𝒑 + 𝜶. 𝒙𝒊 𝒑 . 𝒆 𝒑 Where α is the learning rate with the following condition 0 ≤α threshold then Desired_output ← 0.9999(spam) Else Desired_output← 0.1111(ham) End if End if End for Score ← total_weght/ number_of_token_matched {Determine score using weighted sum} If score > threshold then Message is spam Error_rate ← absolute (desired_output_score) Token_weight ← token_weight + correction If token does not exist in innate_input_layer (this is to prevent continuous learning) End if Else Message is not spam Error_rate ← Absolute(Desired_Output_Score) Correction ← error_rate*learning_rate Token_weight ← token_weight – correction End if end [7].

Combining Naive Bayes and ANN for Spam Detection A framework for combining naïve bayes and ANN is proposed. The general idea is to use the ANN first for message classification. If the overall value computed by ANN cannot clearly determine the category of the message then the control is given to the naïve bayes. Basically naïve bayes is used for training purpose and for producing data sets tokens using white space as the delimiter. Neural networks cannot be used for training because they have long training time and require a large number of parameters that are best determined empirically [8] hence it is used for classification purpose only. Also neural networks have been criticized for their poor interpretability since it is difficult to interpret the symbolic meaning behind the learned weight [9]. Training Corpora messages (i.e. collection of ham and spam messages) are taken as input to a supervised learning algorithm. The algorithm will then use white space as a delimiter to produce a set of tokens (also called features) process known as tokenisation. The frequent count of each token relative to the overall tokens is kept in the database. This process is repeated with as much corpora messages as possible Testing A message (this is the message that we want to determine its category) is tokenized first to produce a set of tokens. With the help of tokens in the database, noisy tokens are eliminated (i.e. tokens whose value do not clearly determine the category of the message this involves words such as “is”, “or” etc). This is the decision matrix process and this process produces apt tokens (Figure 3). The artificial neural network will do the classification first using the step activation function at 300 iteration the optimum number of iteration that was reached by Kufandirimbwa and Gotora [10]. The function is as follows: 𝑛

𝑌𝑖 𝑝 = 𝑠𝑡𝑒𝑝 [

𝑥𝑖 𝑝 . 𝑤𝑖𝑗 𝑝 − 𝜃𝑖] 𝑖=1

If the output from the ANN cannot clearly determine the category of message then the naïve Bayes takes over. The general assumption here is that what cannot be solved by ANN, naïve Bayes (see Figure 4) will do it. Proposed Algorithm LEARN_NAIVE_BAYES_TEXT 1) Collect all tokens that occur in Traing data (TD) B:= All distinct tokens in TD // for determining the probability of unseen items 2) Calculte the required P(Cj) and P(Wk|Cj) probability

Kwenda

Figure 3. Dimensionality vector technique. Source: Kufandirimbwa et al. [4].

Figure 4. Proposed hybrid algorithm.

5

6

Online J Phys Environ Sci Res

terms For each target value cj in C do Docsj: = subset of TD for which the target value is cj P(cj) := Lj; = a single document created by concatenating all members of docsj n:= total number of tokens in Lj (counting duplicate tokens multiple times) 3) For each token wk in B nk : = number of times token xi occurs in Lj P(xi|Cj) := // the Jeffery’s Perks Law

nSS: Defines the number of spam messages which are accurately classified as spam. nSL: Represents the number of spam messages that are inaccurately classified as legitimate. nLL: This is the total number of legitimate messages which are accurately classified as legitimate. nLS: Represents the number of legitimate messages that the system inaccurately classified as spam. Accuracy of a classifier is calculated by dividing the number of correctly classified samples by the total number of test samples and is defined as [8,12]:

Classification 4) Position: = all tokens in training documents that contains tokens found in Lj 5) Calculate Y(P) 𝑛

𝑌𝑖 𝑝 = 𝑠𝑡𝑒𝑝 [

𝑥𝑖 𝑝 . 𝑤𝑖𝑗 𝑝 − 𝜃𝑖] 𝑖=1

If 𝒀𝒊 𝒑 ≈ 0.5 return CNB where 𝐶𝑁𝐵 ≔ 𝑎𝑟𝑔𝑚𝑎𝑥(𝑃(𝐶𝑖)𝑐𝑖ℰ𝐶 +

𝑙𝑜𝑔⁡(𝑃(𝑤𝑘|𝑐𝑗) 𝑖ℰ𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛

MATERIALS AND METHODS Benchmark Corpora A hybrid switch on naïve bayes and artificial neural networks was designed for message classification either being spam or ham. The experiment was conducted on Ling Spam public corpus [11] which contains 2412 ham messages and 481 spams. The hold out method was employed to split the data into 2 parts where half of the messages were used for testing and the other was used for training. Messages for testing from the ling spam were further divided into three groups as shown in Table 1. Table 1: Three sets of messages to test.

Messages M1 M2 M4

Ham 402 402 402

Spam 80 80 80

Performance Measurement The following variables were used to measure performance, precision, recall and weighted accuracy (Table 2). Legitimate precision is the number of genuine messages classified as genuine that are indeed genuine, whereas legitimate recall refers to the proportion of correctly – classified genuine messages to the number of messages originally categorised as genuine [12]. The following counts are defined below:

number of correctly classified messages total number of all messages nLL + nSS accuracy = nSS + nLL + nSL + nLS

accuracy =

Legitimate Precision(LP) number of messages classified as good = total number of messages classified as good nLL Legitimate Precision = nLL + nSL Legitimate Recall(LR) number of messages classified as legitimate = total number of legitimate messages nLL Legitimate Recall = nLL + nLS Spam Precision(SP) number of messages classified as spam = total number of messages classified as spam nSS Spam Precision = nSS + nLS number of messages classified as spam total number of spam messages nSS Spam Recall = nSS + nSL

Spam Recall(LR) =

RESULTS AND DISCUSSION Results obtained from Figure 5 shows that the accuracy of the hybrid approach is 0.94 which is approximately 94% an indicative of a good result. The accuracy of the hybrid approach is better when comparing to researches that were done by Pantel and Lin [13] of 92% and Kufandirimbwa [4] of 93% whose researches was purely based on the Bayesian technique. However Kingstrom [14] and Anderson [15] who got accuracy of 99.21% and 99% had better results as compared to the hybrid approach. The reason behind could be choice of the probability estimator of the resulting tokens and their choice of the data sets of the experiment. Ham precision and ham recall of the hybrid model have

Kwenda

7

Table 2. Results shown by three messages.

Corpus M1 M2 M3

nL 402 402 402

0.98

nS 80 80 80

nLL 382 388 381

nSL 10 13 9

nLS 20 14 21

nSS 70 67 71

0.972

SP 0.778 0.827 0.772

SR 0.875 0.838 0.888

LP 0.974 0.967 0.977

LR 0.950 0.965 0.948

Accuracy 0.979 0.988 0.979

0.972 0.954

0.96

0.94

0.94 0.92

0.9 0.88

0.867

0.86 0.84 0.82 0.8 spam precision

spam recall

ham precision

ham recall

accurray

Figure 5. Results of variables for performance measurement.

superb results again having the values of 97% and 95% respectively. As the value indicated by the former variable (ham precision) it means that the hybrid approach is strong in reducing the misclassification of S→L (spam misclassified as ham) of the total legitimate messages and also by the later variable (ham recall) the hybrid approach is strong in reducing L→S (ham misclassified as spam) of the total legitimate messages which is generally worse as one may lose important information. The results of ham precision and recall is better as compared to the researches that were done by Ndumiyana et al. [12] who got 94% for ham precision and 91% for ham recall, who used entirely the ANN approach and Kufandirimbwa et al. [4] who got 92% for ham precision and 86% for ham recall, who used entirely the Bayesian approach. Again the hybrid approach is strong in the spam precision as compared to the spam recall having values of 97% and 87% respectively. This means that the model is good at classifying spam messages as spam rather than misclassifying spam messages as ham of the total number of spam messages. The results are better as compared to the research that was done by Kufandirimbwa and Gotora [10] who got 94% for spam precision and 77% for spam recall at 300 iterations that they recommend for their ANN approach.

Conclusion A hybrid switch model between naïve bayes and ANN (basing on the single perceptron) was designed with the intention of reducing spams that seeps through to the mail inbox. The model basing on the Ling spam data set produced better results achieving an accuracy of 94%, which is better as compared to other models solely based naïve bayes or artificial neural networks. Recommendation Filters that exist at this current moment are much more concentrated with text information. The author suggest the proposal of designing a filter that employs the OCR (optical character recognition technique) to produce videos as tokens so as to block pornographic videos and pictures to seep through to the mail inbox. Reference [1] Wikepedia. Email definition. Retrieved data on 8 August, 2015 from https://en.wikipedia.org/wiki/E-mail. [2] Ibrahim A. Spam filtering Using Open Source Software. Mara

8

Online J Phys Environ Sci Res

University of Technology. 2006; P. 1. [3] Raz U. How do spammers harvest e-mail addresses? Retrieved data on 8 August, 2015 from www.private.org.il/haervest.html. [4] Kufandirimbwa O, Benny MK, Clopas K. Bayesian Technique Using Regular Expressions as a Way of Message Tokenization. Online J Phy Environ Sci Res, 2012; 1(3): 38-44. [5] Krenker A, Bešter J, Kos A). Introduction to the Artificial Neural Networks, Artificial Neural Networks - Methodological Advances and Biomedical Applications, Prof. Kenji Suzuki (Ed.), ISBN: 978- 953-307243-2, InTech. 2011. [6] Wtodzistaw D, Norbet J. Survey of Neural Transfer Function, Nicholus Copernicu University, ul, Department of Computer Method, Grudziadzka 2, 87 -100 Torun, Poland 1999; pp 4-5. [7] Sabri AT, Mohammads AH, Al-Shargabi B, Hamdeh MA. Developing New Continuous Learning Approach for Spam, Detection using Artificial Neural Network (CLA_ANN). Eur J Sci Res, 2010; 42(3): 525-535. [8] lslam S, Khaled SM, Farhan K, Rahman A, Rahman J. Modelling Spammer Behaviour: Naïve Bayes vs Artificial Neural Networks, International Conference on Information and Mutimedia Technology, 2009. [9] Han J, Kamber M. Data Mining Concepts and Techniques. Academic Press, ISBN 81-7867-023-2, 2001. [10] Kufandirimbwa O, Gotora R. Spam Detection Using Artificial Neural Network. Online J Phy Environ Sci Res, 2012; 1(2): 22-29.

[11] Ling Spam Public Corpus. Retrieved data on 15 July, 2015 from http://www.aueb.gr/users/ion/. [12] Ndumiyana D, Magomelo M, Sakala L. Spam Detection Using a Neural Network Classifier. Online J Phy Environ Sci Res, 2013; 2(2): 28-37. [13] Pantel P, Lin D. SpamCop - A Spam Classification and Organization Program. In Proceedings of AAAI Workshop on Learning for Text Categorization. 1998; pp 95-98. [14] Kingstrom J. Improving Naïve Bayesian Spam Filtering. Mid Sweden University. 2005; pp16-17. [15] Anderson D. Statistical Spam Filtering, EECS595. 2006; P. 4.