Spam Filtering using Signed and Trust Reputation ... - wseas.us

7 downloads 15122 Views 496KB Size Report
people send bulk messages in the form of spam [5]. Content-based classification ... sender's e-mail id and trusted subject of the e-mail along with a weighted ...
Recent Researches in Applied Computer and Applied Computational Science

Spam Filtering using Signed and Trust Reputation Management G.POONKUZHALI1 , K.THIAGARAJAN 2 ,P.SUDHAKAR3 , R.KISHORE KUMAR4, K.SARUKESI5 1,4

Department of Computer Science and Engineering, Rajalakshmi Engineering College, Affiliated to Anna University- Chennai, Tamil Nadu 2 Department of Science and Humanities, KCG College of Technology Affiliated to Anna University-Chennai, Tamil Nadu 3 Vernalissystems Pvt Ltd, Chennai- 600116 5 Hindustan Institute of Technology and Science-Chennai,Tamil Nadu INDIA 1 [email protected] , [email protected] , 3 [email protected] ,4 [email protected], 5 [email protected]

In the e-mail revolution most of the e-mail messages contain SPAM which clogs up the inbox and is quite obnoxious. Therefore, managing a mailbox has become a big task in the faster e-world. Abstract: -

Especially, when the user linked with social networks, user’s inbox is occupied with several kinds of SPAM emails which lead to many problems. Deduction of spam mails has become an important issue in e-world. In this paper a mathematical approach based on signed and trust reputation management is developed to restrict the spam e-mails through user’s attitude on a particular e-mail and content relevancy of the e-mail. The results obtained by this approach are similar to the results obtained through ID3 classifier.

Key-Words: - attitude, e-mail, rating, relevant, spam, trust. not prevent the bot from using the identities stored on the hijacked computer and sending email through the domain’s relays. It does however, make it easier to identify the source of the email. Adoption, as in many other cases, may prove to be the biggest hurdle for DKIM. The work in this paper is directed towards handling such e-mail messages. We use signed based approach for classifying e-mails and directing them in spam folder or inbox of the user .This paper proposes five steps for the managing e-mail. 1. Attitude analysis, 2. Pre-processing, 3. Relevancy analysis, 4. Post-Processing, 5. Final decision making. In the process of the attitude analysis filter compares the e-mail address of the sender with the content of the address book of the receiver and analysis of the subject and based on this analysis + and – signs are assigned to the e-mail. In the preprocessing phase content of the e-mail is checked. Pre-processing stage consist of stemming where all the HTML tags are removed followed by stop word removal where the words which do not form any meaning to the sentence are removed from the

1 Introduction Due to the intensive use of internet for sending messages, unsolicited commercial messages known as spam also creeps in to inbox. These are harmful and have offensive comment. Due to the low cost involved in sending E-mails, Companies and several people send bulk messages in the form of spam [5]. Content-based classification analyzes the contents and packet of an e-mail using Bayesian networks [11, 12] or pattern matching [13]. Spam in the past contains known strings or patterns which are not necessary for the user. Unfortunately, majority of e-mail clients now render Hypertext Markup Language (HTML) based e-mails, allowing spammers many opportunities to fool the filters. Content-based filters require never-ending tuning and adjustment in order to keep up with the spammers’ latest tricks. Another approach is Domain Keys Identified Mail (DKIM) [14], which associates a “responsible identity” with each e-mail. Allowing the receiver to confirm the sender and origin of the email. Unfortunately, this system does

ISBN: 978-960-474-281-3

67

Recent Researches in Applied Computer and Applied Computational Science

content. Final stage of pre-processing consists of the feature extraction where the left over content of the e-mail is tokenized . In the third phase of the spam detection, relevancy analysis is done where the left over content of the e-mail is compared with the content of the positive dictionary and the content of the negative dictionary. Based on the comparison positive and negative counts are established. In the post processing phase the positive and the negative counts are compared. If the positive count is greater than the negative count, then the left over tokens in the e-mail is entered in to the positive dictionary else the left over tokens are entered in to negative dictionary . In the final decision stage the e-mail is moved to the inbox or to the spam based on the ratings.

Incoming e-mail

Receive in Input buffer folder

Attitude Analysis Phase

Pre-processing Phase

Downsides of the existing system 1. The spam guards used in e-mails today are generalized based on the e-mail service providers. 2. User centric approach of spam guard is not followed rather than application centric approach is followed. 3. Customizable options of a spam guard are limited. 4. Classification of normal and spam e-mails is having more preference over management of e-mails chosen by the user

Positive Dictionary

Relevancy Analysis Phase

Negative Dictionary Decision Making

2 Architectural Design Move to Inbox

The figure shows the architectural design of the proposed spam management system. When an email arrives to the proposed system, it pass through four phases for spam deduction. They are i) Attitude analysis phase ii) Pre processing of the received email iii) relevancy analysis phase and iv) Post processing to make final decision to categorize normal and spam e-mail. In Attitude analysis, users interest on the email was consider based on the sender email address available in his address book. In the pre-processing phase email contents are proposed. In the relevancy analysis phase email contents are analysed for relevancy. A Domain dictionary of words is used in the relevancy analysis phase. Based on the outcome of the previous phases, the final decision was mode. Decision making process recommends whether the received e-mail is to be placed in inbox or spam.

ISBN: 978-960-474-281-3

Hold

Move to Spam

Fig. 1 Architectural Design 2.1 Attitude Analysis Phase Attitude analysis is done on the incoming e –mail received by the user based on the following attributes namely sender’s e-mail id and subject. Action is determined based on the existence of sender’s e-mail id and trusted subject of the e-mail along with a weighted score of 0.25 is given to each attribute. Table 1: Rating based on attitude analysis ref [3] E-mail id Subject Rating Action ++ A Exist Trusted +B Exist Not Trusted -+ C Not Exist Trusted -D Not Exist Not Trusted

68

Recent Researches in Applied Computer and Applied Computational Science

Table 2. Decision made based on Attitude and Relevancy rating

2.2 Pre-processing Content of the e-mail is extracted and pre-processed to proceed the next phase. The pre-processing phase transforms the extracted e-mail content into a structured form. In the case of text mails, stop words removal and stemming of the words are carried on.

Attitude

2.3 Relevancy Analysis Phase The second phase in the process is to verify the content (body) of the e-mail for confirming the relevancy of the e-mail with the context which is preferred by the user. Each word in the preprocessed content will be compared with the positive dictionary to examine the relevancy of the content. If the words in the content of the e-mail are present in the positive dictionary then the process leads to the next stage. Otherwise, the words in the content are compared with the negative dictionary to make sure if it contains any spam prone words. If more than 50% of the word content is relevant then + sign (Positive rating) is assigned with a weighted value of 0.5 else - sign (Negative rating) is assigned with a weighted value of 0 for irrelevant content.

Weighted value

Decision made

++

+

1.0

Move to inbox

+-

+

0.75

Move to inbox

++

-

0.50

+-

-

0.25

Move to spam

-+

+

0.75

Move to inbox

--

+

0.50

-+

-

0.25

Move to spam

--

-

0.0

Move to spam

Hold

Hold

3. Experimental Results Verification

2.4 Post Processing and Final decision making

ID3 algorithm is used to verify the decision made based on attitude analysis and relevance analysis. The results imply most of the Spam e-mail contains irrelevant content and not trusted subject. The result produced by the proposed algorithm is same as the result obtained by ID3 algorithm.

Final decision is made based on the weighted score of the attributes of both attitude analysis phase and relevancy analysis phase. The attitude analysis holds 0.5 weightage for both e-mail id and subject trusted. Similarly, relevancy analysis phase holds 0.5 weightage for relevant content. If the weighted value is greater than 0.5 then the email is moved to Inbox and the pre-processed root words which are not already exist are added to positive dictionary. If the weighted value is less than 0.5 then the email is moved to spam and the pre-processed root words which are not already exist are added to negative dictionary. If the weighted value is equal to 0.5 then the e-mail is hold. The number of normal e-mail that are classified as spam and the reverse will be significantly trim down since there are a two levels of validating a e-mail in the system. Also user can classify spam and ham e-mail according to his personal interest on a particular e-mail rather than going for a generalized spam filter.

ISBN: 978-960-474-281-3

Relevancy

Table 3: Dataset for classifying SPAM e-mail

69

Email ID

Subject

Content

Result

Exist

Trusted

Relevant

Inbox

Exist

Not Trusted

Relevant

Inbox

Exist

Trusted

Irrelevant

Hold

Exist

Not Trusted

Irrelevant

Spam

Not Exist

Trusted

Relevant

Inbox

Not Exist

Not Trusted

Relevant

Hold

Not Exist

Trusted

Irrelevant

Spam

Not Exist

Not Trusted

Irrelevant

Spam

Recent Researches in Applied Computer and Applied Computational Science

[3] K. Thiagarajan, A. Raghunathan, Ponnamal Natarajan, G. Poonkuzhali and Prashant Ranjan, Weighted Graph Approach for Trust Reputation Managements, International Conference on Intelligent Systems and Technologies, Published in Proc. Of World Academy of Science and Technology- Vol 56, pp-830-836,2009. [4] Ryota Mastumoto,Du Zhang and Meiliu Lu, Some empirical Results on Spam deduction Methods , IEEE Trans on Spam Deduction, April 2004. [5] Spam and Social technical gap – IEEE Computer ,Vol 37 Oct 2004. [6] Web Spam Taxonomy. Zolt´an Gy¨ongyi, Hector Garcia-Molina. Proc., First International Workshop on Adversarial Information Retrieval on the Web (at the 14th International World Wide Web Conference), 2005. [7] Spam, Damn Spam, and Statistics. Dennis Fetterly, Mark Manasse and Marc Najork. Proc. of the Seventh International Workshop on the Web and Databases (WebDB 2004), 2004, Paris, France. [8] The EigenTrust algorithm for reputation management in P2P networks. S. Kamvar, M. Schlosser, and H.Garcia-Molina. Proc. of the Twelfth International World Wide Web Conference, 2003. [9] BadRank. Online at: http://pr.efactory.de/epr0.shtml [10] Sit, E., and Morris, R. Security considerations for peer-topeer distributed hash tables. In International Workshop on PeertoPeer Systems (2002), vol. 2429 of Lecture notes in computer science. [11] Grahm P. A plan for spam. In Reprinted in Paul Graham,Hackers and Painters, Big Ideas from the Computer Age, O’Really, 2004. [12] Sahami M, Dumais S, Heckerman D and Hortivz E, A bayesian approach to filtering junk e-mail. In Workshop on Learning for Text Categorization - AAAI1998. [13] Showalter, T. RFC 3028 – Sieve: A MailFilteringLanguage. http://tools.ietf.org /html/rfc3028, 2001. [14] Allman E, Callas J, Delany M, Libbey M, Domain keys identified mail (dkim)signatures. http://www.ietf.org/internetdrafts/draft-ietfdkim-base-10.txt, 2007.

Fig.3. Decision Tree

4 Conclusion This paper proposes signed approach for e-mail classification. Here two approaches are used for the classification of the e-mail. Here the user can also tag a e-mail as spam which is included in the preprocessing step. This paper proposes a self learning process for the classification of the e-mail. Acknowledgment The authors would like to thank Dr. Ponnammal Natarajan worked as former Director – Research , Anna University- Chennai, India and currently an Advisor, (Research and Development), Rajalakshmi Engineering College for her cognitive ideas and dynamic discussions with respect to the paper’s contribution. . References: [1] Bogdan Hoanca, How Good Are our Weapons in the Spam Wars?, Vol. 25, No.1,Spring,2006. [2] H. Brett Watson, Beyond Identity: Addressing Problems that Persist in an Electronic Email System with Reliable Sender Identification, Second International Conference on E-email and Anti-Spam - IEEE & IACR,2005.

ISBN: 978-960-474-281-3

70

Recent Researches in Applied Computer and Applied Computational Science

G.Poonkuzhali received B.E degree in Computer Science and Engineering from University of Madras, Chennai, India, in 1998, and the M.E degree in Computer Science and Engineering from Sathyabama University, Chennai, India, in 2005. Currently she is pursuing Ph.D programme in the Department of Information and Communication Engineering at Anna University – Chennai, India. She has presented and published 10 research papers in international conferences & journals and authored 5 books. She is a life member of ISTE (Indian Society for Technical Education) ,IAENG (International Association of Engineers), and CSI (Computer Society of India).

R.Kishore Kumar currently undergraduate student of Rajalakshmi Engineering College. He has presented 5 papers in conferences and published 4 research papers in international journals and 3 papers in national journals. One of his paper has been selected as the Best Paper. He is also the member of Computer Society of India.

Dr. K. Sarukesi has a very distinguished career spanning of nearly 40 years. He has a vast teaching experience in various universities in India and abroad. He was awarded a commonwealth scholarship by the association of common wealth universities, London for doing Ph.D in UK. He completed his Ph.D from the University of Warwick – U.K in the year 1982. His area of specializations is Technological Information System. He worked as expert in various foreign universities. He has executed number of consultancy projects. he has been honored and awarded commendations for his work in the field of information technology by the government of TamilNadu. He has published more than 80 research papers in international and national conferences/journals.

K.Thiagarajan working as Senior Lecturer in the Department of Mathematics in KCG College of Technology - Chennai-India. He has totally 14 years of experience in teaching. He has attended and presented research articles in 33 National and International Conferences and published one national journal and 26 international journals. Currently he is working on web mining through automata and set theory. His area of specialization is coloring of graphs and DNA Computing.

P.Sudhakar received Bachelor of Engineering degree in Computer science from Anna University Chennai-India in 2006 and Master of Engineering degree in Computer Science from Anna University ChennaiIndia in 2008. He started his carrier as a Junior software programmer in Vernalis systems Pvt Ltd, Chennai India at 2008 and elevated to Associate software. He also presented various papers in National level conferences and published his research work in International Journals.

ISBN: 978-960-474-281-3

71

Suggest Documents