User History Based Mail Filtering Process - IEEE Xplore

5 downloads 1180 Views 146KB Size Report
Abstract—Email spams are probable intimidation to an email user. In this paper we deal with spams which may slither through filters into client's mailbox (false ...
2013 International Symposium on Computational and Business Intelligence

User History based Mail Filtering Process Ravi Kumar

Mayank Punetha

ABV-Indian Institute of Information Technology and Management Gwalior, India [email protected]

ABV-Indian Institute of Information Technology and Management Gwalior, India [email protected]

Hitesh Soni

Mahua Bhattacharya

ABV-Indian Institute of Information Technology and Management Gwalior, India [email protected]

ABV-Indian Institute of Information Technology and Management Gwalior, India [email protected] and techniques are different. We have performed end user based filtering. Our research is to imply some proficient and innovative algorithms for detecting a spam and also filtering those mails which are not important to a user so that it can save user’s time and avoid providing personal information to spammers. The focus of our research is to make user’s inbox spam free and also free from unimportant junks. The paper is prepared with obligatory description of allied work which we reviewed to survey researches that have been done in the past associated with this area. This is followed by the anticipated methodology containing proposed solutions to the existing problems and all those novel techniques, explanation of algorithms in order to implement the planned methodology along with the research results. Finally, to summarize the work we have mentioned the conclusions.

Abstract—Email spams are probable intimidation to an email user. In this paper we deal with spams which may slither through filters into client’s mailbox (false negative). We use client’s history to solve the above mentioned problem. For a client a particular message can be important while for some other client they may be unimportant. Other than filtered spam generally a user decides which type of message is spam by flagging it. Our novel approach reduces this effort and client need not see all mails and manually flag them as spam because this filtering system and algorithms used in it will separate all those junks so that the client is left with only those mails which are useful for him/her. We propose using, part of speech tagging module of Natural Language Processing and some other discussed algorithms. This approach is not only saving time of a client but is also acting as a good mail filter. Keywords-Enron Dataset; POS; Spam;

II. BASIC TERMS AND CONCEPTS In this section a brief description of basic terms and concepts used in our research are discussed. For testing purpose we have used Enron dataset. Enron dataset was collected and prepared by the CALO Project (A Cognitive Assistant that learns and organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5 million messages. This data was made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation [5]. Natural language processing (NLP) is an area of artificial intelligence, computer science and linguistics worried with the interactions between computers and human (natural) languages. In particular, it is the procedure of a computer gathering meaningful information from input of natural language and/or producing output of natural language. We have used part of speech (POS) tagging of natural language processing to perform our research. In part of speech tagging, if a sentence is given, then it do the task to determine the part of speech for each word in the sentence. Many words, especially the common ones, can be used to serve as multiple parts of speech. For example, "book" can

I. INTRODUCTION Spam means flooding the internet with too many unwanted copies of messages. Generally, most of the spams are sent for commercial purpose. The most common among all these spams are email spams. An e-mail has two parts – message and the header. Header consist of Information like From, Date, Message Id, In Reply To, To, Subject, Bcc, Cc, Content Type, Precedence, References, Sender, Archive that, Received, Return Path, Authentication, Received SPF, AutoSubmitted, VBR-Info and many other types of information. So we can see that a header contains lots of information that can help us in deciding whether a mail is spam or not. We have used both header and message part of an e-mail to predict whether an e-mail is spam or not. There are various ways of spam filtering like black listing, real time black hole listing, white listing, grey listing, content based filters, heuristic filters, Bayesian filters, challenge/response systems, collaborative filters, DNS lookup systems and many more. A score is assigned by the filter to all those mails which server receives. On the basis of score a threshold is set according to which mail is considered to be a spam. Higher score means the mail looks more like a spam. Our approach also works on the basis of setting threshold and assigning score but strategy 978-0-7695-5066-4/13 $26.00 © 2013 IEEE DOI 10.1109/ISCBI.2013.37

143

be a noun (e.g. “the book on the shelf”) or verb (e.g. “to book a ticket to Gwalior”). Some languages have more such uncertainty than others. Natural language with small inflectional morphology (or the structure of words), such as English are mainly prone to this uncertainty. Hindi is prone to such uncertainty because Hindi is basically a tonal language during articulation. Such variation is not willingly conveyed through the units in work within the orthography to express the intended meaning. We have used java POS tagger module for our research [4]. III.

RELATED WORK

Figure 1: UserId Database.

Previous work on spam detection and mail filtering has focused on detection of spam like link spam, cloaking, content spam, web topology based filtering, link spam detection based on mass estimation and rank propagation, probabilistic counting for link based spam detection & email filtering based on sender, sentence type, existence of time expression, and subject of mail contents. Web spam can depreciate the quality of search engine results. Pedram Hayati and Vidyasagar Potdar focused on the evaluation of Spam detection and prevention framework for email and image spam. He analyzed the existing works in two different sectors of spam domains on image spam and email spam so that he can understand the problem better [1]. Ronald Nussbaum, Abdol-Hossein Esfahanian and Pang-Ning Tan proposed two new methods of performing email prioritization on the basis of email history. They used header information to effectively prioritize the incoming emails [2]. Part of speech tagging was originally written by Kristina Toutanova. Dan Klein, Christopher Manning and Yoram Singer worked with Kristina Toutanova and proposed Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network [4]. Clotilde Lopes, Paulo Cortez, Pedro Sousa, Miguel Rocha and Miguel Rio suggested a method called symbiotic filtering which aggregates distinct local filters from various users and combine these filter into one standard filter in order to separate the mails and improve its performance with some predefined filters. This technique is completely end user based and works on mail server site [9]. Kobkiat Saraubon and Benchaphon Limthanmaphon proposed a novel approach of scanning the mail header. It works on a predefined database of analyzed mails of test inbox and is up to 95% accurate in identifying the mails. This technique is used to categorize the mails including the mail header and image spams [10]. El-Sayed M. El-Alfy and Radwan E. Abdel-Aal proposed GMDH (Group method of data handling) based inductive learning approach to filter mails by automatically identifying the spam and legitimate mails. This was proven to be 91.7% accurate in classifying the mails [8]. Other than mail content, URLs are also associated with spam. A novel approach proposed by Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson and Dawn Song creates a real time system which predetermines the URL present in the content of the mail to detect whether it is spam or not [6]. Chi-Yao Tseng and Ming-Syan Chen designed a complete spam detection system MailNET based on the incremental support vector machine which helps any

proposed method to get updated whenever any changes in network takes place [7]. IV.

PROPOSED METHODOLOGY

We know that messages are categorized into two parts in user’s inbox, read (important mails) and unread (new mails or unimportant mails). Read mails are those mails which a user has already seen or user has interest in such type of mails and unread mails are those mails which have just arrived in user’s inbox or are unimportant to a user. First we structure a database of number of mails sent by different users along with its user id as can be seen in Fig. 1. After this we store all the contents of read mails, word by word along with its parts of speech type and word count in a database (DATABASE 1) as can be seen in Fig. 2. We do part of speech tagging by the use of our algorithm and opennlp-tool.jar file [3]. If suppose we do part of speech tagging of “Tiger is a good animal” then it looks like: Tiger_NNP is_VBZ a_DT good_JJ animal_NN POS tagging algorithm categorizes these words into their respective parts of speech, such as nouns, common noun, pronouns, proper nouns, prepositions, verbs, adverbs and adjectives and increase their count in DATABASE1, if the same word and its type repeat. Now on each new incoming mail (unread mail) we do its POS tagging and store it in another database (DATABASE2) as can be seen in Fig. 3. Now DATABASE 1 is compared with DATABASE 2 and then DATABASE2 contents are deleted for fresh storage of another new mail contents. During comparison several processes are involved which decide whether a mail is spam or not. These processes have

Figure 2: DATABASE1.

144

been described below: We fetch the word count for a particular word from DATABASE 1 (Count_B) and for same word, the word count from DATABSE 2 (Count_S). Now we calculate X where num_mail is the number of read mails. This X value is added for all the words of a new mail each time in P (Initially P=0). Finally we calculate VALUE where n is total number of messages of a particular user given in Fig. 1. X = (|Count_B — Count_S|)/num_mail.

P=∑X

VALUE = P/n

(1) Figure 3: Graph of Value versus Mail Number showing the fluctuation of the Value and window of spam where Mail Number is mail sequence number.

(2)

for further tests, in which we have to combine all other techniques of spam detection for better results. User history based mail filtration is a wide area for future research and many new researches can be carried out in this field.

(3)

To decide whether a mail is spam or not we set a threshold window. This threshold window concept has been implemented on the basis of experimental observation of our experimental results on Enron dataset. If a new mail arrives, which has word content totally different from user’s history then we cannot say whether it is spam or not and for those mails we get VALUE less than 0.196203. But if VALUE is higher than 0.196203 and less than 23.8846 then we can surely say it is spam, as can be seen in Fig. 4. V.

REFERENCES [1]

Pedram Hayati and Vidyasagar Potdar, “Evaluation of spam detection and prevention frameworks for email and image spam - a state of art,” ACM iiWAS2008 Conf. , November 24–26, 2008, Linz, Austria, pp. 520-527, doi 10.1145/1497308.1497402. [2] Ronald Nussbaum, Abdol-Hossein Esfahanian, and Pang-Ning Tan, “History based Email Prioritization,” Advances in Social Network Analysis and Mining, IEEE Digital Object Identifier Conf. , 2009, pp. 364-365, doi 10.1109/ASONAM.2009.44. [3] The Stanford Natural Language Processing Group. “Stanford Loglinear Part-Of-Speech Tagger”. Retrieved from http://nlp.stanford.edu/software/tagger.shtml [4] Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer, “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network,” Proc. of HLT-NAACL 2003, pp. 252-259, doi 10.3115/1073445.1073478. [5] CALO project. “Enron Email Dataset”. (2009, August 21). Retrieved from http://www.cs.cmu.edu/~enron/ [6] Kurt Thomas, Chris Grier, Justin Ma, Vern Paxson and Dawn Song, “Design and Evaluation of a Real-Time URL Spam Filtering Service,” Proc. of IEEE Symposium on Security and Privacy, 2011, pp. 447-462, doi 10.1109/SP.2011.25. [7] Chi-Yao Tseng and Ming-Syan Chen, “Incremental SVM Model for Spam Detection on Dynamic Email Social Networks,” 2009 International Conference on Computational Science and Engineering, IEEE, 29-31 Aug. 2009, pp 128-135, doi 10.1109/CSE.2009.260. [8] El-Sayed M. El-Alfy and Radwan E. Abdel-Aal, “Using GMDHbased networks for improved spam detection and email feature analysis,” Applied Soft Computing, vol. 11 issue 1, January, 2011, pp. 447-488, doi 10.1016/j.asoc.2009.12.007. [9] Clotilde Lopes, Paulo Cortez, Pedro Sousa, Miguel Rocha and Miguel Rio, “Symbiotic filtering for spam email detection,” Expert Systems with Applications, vol. 38 issue 8, August, 2011, pp. 9365-9372, doi 10.1016/j.eswa.2011.01.174. [10] Kobkiat Saraubon and Benchaphon Limthanmaphon, “Fast Effective Botnet Spam Detection,” Proc. of ICCIT 2009, pp. 1066-1070, doi 10.1109/ICCIT.2009.128.

RESULTS

The proposed methodology was implemented and tested for the Enron dataset. For Enron dataset testing was performed on 2.54 GB inbox mails and success rate of detecting spam was 87.7%. On a sample of 1187 mails, 354 mails were tested as spams. So our algorithms were able to detect 29.823% of mails as spam which were present in user’s inbox (false negative). VI.

CONCLUSION

Spam and worthy mails are becoming threat for data, mails and useful secret information, so we are in need of a better filtering techniques. The methodology formulated for this research has shown significant level of correctness and accuracy in the modeled performance for our testing purpose. The purpose of this research had been to show how the inbox can be made spam free for a user. This methodology can obviously be implemented at larger scale

Figure 4: DATABASE2.

145

Suggest Documents