AWERProcedia Information Technology & Computer Science 00 (2013) 000-000
3rd World Conference on Innovation and Computer Sciences 2013
Comparing Ensemble Classifiers: Forensic Analysis of Electronic Mails Ekin Ekinci a *, Hidayet Takçı b a
Kocaeli University, Department of Computer Engineering, Kocaeli 41380, Turkey Cumhuriyet University, Department of Computer Engineering, Sivas 58140, Turkey
b
Abstract As a result of powerful features and ongoing improvements of the internet, electronic mail has become one of today’s most important communication tools. Electronic mail which provides conveniences to its user in many cases is also an attractive medium for criminals. While malicious electronic mail whose actual owner is unknown is taking place in computer crimes, authorship identification has become necessary for determining the actual owner of it. This study describes an application with the aim of authorship identification of electronic mails in terms of forensic science and this application can improve the security and can find the actual owner of these electronic mails based on message body. Bagging and AdaBoostM1 algorithms, which are ensemble classifiers in machine learning and have good generalization performance, processed 49 textual features extracted from a dataset which contains 250 electronic mails from 5 writers. According to Fmeasure Bagging provided better generalization performance than AdaBoostM1. Keywords: Forensic Science, authorship identification, emails, ensemble classifiers; Selection and/or peer review under responsibility of Prof. Dr. Dogan Ibrahim. ©2013 Academic World Education & Research Center. All rights reserved.
1. Introduction Crime as a social nuisance that affects society in terms of economic, psychological and life standards and is also a negligence action that is considered dangerous by the majority of society [1]. Crime, which is generally classified as traffic violations, theft, fraud, smuggling, arson, drug offenses, violent crime and cyber crime, benefits from forensic analysis for being solved [2]. Forensic analysis as a process in forensic science contains gathering and conserving evidences and transferring them to the relevant departments. The first known practice about forensic science was achieved by Chinese for identifying the owner of documents and sculptures by using fingerprints in 700s [3]. Forensic science has branches like ballistic, criminology, computer forensics and so on, and Hancı [4] (p 60) declared that "using information about medicine, sciences and social science for justice.". Since 1980s there has been rapid advances in information technology, hence computers and their applications have become very important for individuals and institutions especially with internet * Ekin, Ekinci, Kocaeli University, Department of Computer Engineering, Kocaeli 41380, Turkey
E-mail address:
[email protected] / Tel.: +90-262-303-3562
Ekin Ekinci et all. / Procedia Information Technology & Computer Science (2013) 000-000
usage. As a result of these advances computer crimes as a kind of virtual crimes have revealed. Parker [5] (p 25) declared computer crimes as "any violations of criminal law that involve knowledge of computer technology for their perpetration, investigation or prosecution.", computer forensic, which consists of obtaining, identifying and analysing evidence, revealed for the purpose of solving computer crimes. Even though the internet provides information sharing and interaction between computer users through worldwide broadcasting capacity without regarding to the distance, it plays an important role in committing computer crimes [6]. An email which provides written communication over computers has been a very important communication tool through powerful features and ongoing improvements of the internet. Email, which is economic and practical, is used for education, industry, health and commerce, so is also an attractive medium for criminals. Nowadays the legitimate email users stay in a difficult situation by means of email, such as junk mail, threatening mail, fraud mail, virus, unauthorised conveying of important information, pirated software and work, making banned propaganda and so on. In addition to these, using a legitimate user's email account and changing mail body during the message sending are examples of misusing of emails [7]. Because of using antivirus program, firewall, password protection and a retroactive IP searching are not enough for preventing these situations mentioned above, using authorship identification in computer forensic has become necessary. Authorship identification as a procedure in computer forensic uses several classification techniques in machine learning for processing extracted textual features from the authors' texts with the aim of concluding about these authors [8]. Textual features, which are extracted from authors' texts for identifying the author of these texts, are functional words, errors, word and sentence count and so on. Machine learning, which uses available data to make an inference on other data, is used for processing textual features in authorship identification [9]. The first known study about authorship identification was made in 1887 by Mendenhall and he used character length counts as textual features in this study [10]. This work was followed by Zipf's (1932) and Yule's (1939 and 1944) statistical studies [11]. For authorship identification of the Federalist Papers in 1963, Mosteller and Wallace used frequencies of functional words as textual features and used Bayesian Theorem as a classification technique in machine learning [12]. This study has been accepted as a milestone in authorship identification. The use of the electronic texts (email, newsgroup, forum and so on) has become widespread through Internet media; using authorship identification for these texts has become necessary. In 2000, Oliver De Vel processed 38 textual features extracted from emails with Support Vector Machine (SVM) for 5 writers [13]. For Arabic and English forum messages SVM and Decision Tree were used and accuracy rate of 97% in average were achieved [8]. In 2012, for Turkish newsgroup messages 43 textual features were processed with SVM, Artificial Neural Network (ANN) and Decision Tree and as a result of the application Decision Trees Method has been observed to be most successful with Fmeasure rate of 83% in average for available dataset [14]. Currently in authorship identification, single classifiers such as SVM, ANN and so on are used but, these single classifiers are considered weak so they don’t always yield high accuracy on available dataset. For this reason, ensemble classifiers, which take a base single classifier, apply this base classifier to different versions of the dataset for generating multiple classification models and combining these models for producing a strong classification model are preferred for high accuracy and good generalization performance [15]. In this study a sample application, which can improve the security of emails and can find the actual owner of emails instead of known owner based on message body, was achieved. For 250 messages from 5 writers was used with the aim of identifying authors of these electronic mails in terms of forensic science. 49 textual measures which were extracted from the dataset for identifying the author were processed with Bagging and AdaBoostM1 algorithms and both of them take J48, Naive Bayes, Multilayer Perceptron (MLP) and SMO as base classifiers. Based on resampling and reweighting of the training set respectively, Bagging and AdaBoostM1 algorithms as ensemble classifiers in machine learning have good generalization performance than a single classifier. The results were 2
Ekin Ekinci et all. / Procedia Information Technology & Computer Science (2013) 000-000
evaluated according to F-measure by using 10 fold cross validation. Considering all results Bagging provided better generalization performance than AdaBoostM1. 2. Ensemble Classifiers Revealed with "Using a single classifier is not enough to represent whole problem space." ensemble classifiers, generate different versions of training set with resampling or reweighting and apply a base single classifier (SVM, ANN and so on) to these versions for the purpose of generating multiple diverse classification models [16]. This single base classifier is weak and in fact, the ensemble classifiers are intended for producing strong classifier from this weak classifier. In ensemble classifiers if the base classifier makes a mistake on a version of the training set, this mistake doesn’t affect whole system and it can be compensated from other classification models. Compared with the single classifiers, ensemble classifiers provide high accuracy, good generalization performance and robustness with these features. For this study Bagging and AdaBoostM1 algorithms were preferred as ensemble classifiers. 2.1. Bagging Bagging was first introduced by Breinman in 1996 to improve accuracy by resampling of training set [17]. In Bagging resampling is provided with a selection of a set of instances from the original training set. These selected instances constitute new training sets called as bootstrap. Bagging, which is easy to implement, provides high accuracy and good generalization performance and reduces variance and bias. Instead of using a single training set, multiple training sets prevent results which depend on a training set [18]. With this feature Bagging provides variance reduction. After the constitution of new training sets a base single classifier, which is weak, is applied to them in parallel and generated classification models are combined according to majority voting [19]. At the end final classifier model which provides bias reduction is obtained. 2.2. AdaBoostM1 AdaBoostM1 algorithm, which can handle multiclass problem, is a successor of AdaBoost algorithm that was introduced in 1997 can provide binary classification. Based on reweighting of training data AdaBoostM1 generates a set of classification models by using a base classifier. This algorithm is iterated a certain number of times and while misclassified instances' weights are increased, correctly classified instances' weights are decreased and an instance weight shows that how important this instance is. In this algorithm at first all instances have the same weights and after every iteration this weights are changed based on a rule mentioned above. In AdaBoostM1 classification models are generated serially every iteration [19]. At the end generated classification models are combined according to weighted majority voting and the final classifier model is obtained. 3. Email Dataset and Textual Features Emails are personal so forming a dataset from emails is very hard in terms of privacy and security. For this reason, the new dataset, which is public and shares same characteristic with emails, is needed and newsgroup messages which have these features are started to use as a dataset in studies [13, 14]. The newsgroup is a public internet medium and provides exchanging of ideas in specific groups. A newsgroup message consists of two main parts as header and body and while header, which is structured field, consists of subject, from, to and other information about the message, the body is known as unstructured part of the message. All of them show that a newsgroup message is same as with email except privacy so it was preferred as dataset in this study. Dataset, which is constituted from newsgroup messages, was obtained from www.newskolik.net for this study. The dataset contains 250 messages from 5 writers about different topics. After this step obtained messages were preprocessed and textual features were extracted from them. In this study message body is used for authorship identification so preprocessing step was applied to structured field and from, to and other information about message were eliminated then excerpts, links, emails and phone numbers were eliminated from the message body and textual feature extraction started from messages. Textual features, which give important clues about the author, are
3
Ekin Ekinci et all. / Procedia Information Technology & Computer Science (2013) 000-000
about the writing habits of authors and each author has a "textual fingerprint" [20]. As shown in Table 1 in this study 49 textual features were extracted from the dataset. Table 1. Textual Features Feature No
Feature
Feature No
Feature
1 2 3 4 5 6 7 8-10 11 12 13
Character Count Letter Count Digit Count Punctuation Count Capital Letter Count Lower Case Letter Count Facial Expressions Errors Word Count Sentence Count Average Word Length
14 15 16 17 18 19 20 21 22-41 42 43-49
Average Sentence Length Salutation Signature Inverted Sentence Reduplication Vocabulary Richness Hapax Legomena Hapax Dislegomena Functional Words Abbreviation Types of Words
Primarily, the first seven features were extracted from messages because extraction of these features doesn't require extra preprocessing step. Messages from electronic media frequently include facial expressions (e.g. , ) so facial expressions were used as textual features in this study. Syntax errors, punctuation errors and formatting errors (e.g. all caps words) are inevitable in messages and give important clues about writers. These errors were extracted manually from messages. Then features from 11 to 14 were extracted as textual features. Salutation and signatures are writing habits and give important information about writers. Inverted sentence and reduplication usage also give information about writers. After these a new preprocessing was required for features from 19 to 49 and Zemberek, which is the Natural Language Library, were used. This library provides spell checking, determination of word type and so on [21]. In this study Zemberek were used for finding roots and stems of words. After applying it, vocabulary richness, which is the ratio of the different word count to all words in the message, hapax legomenea, which is count of words that occur only once in the message and hapax dislegomena, which is count of words that occur exactly twice in the message were calculated. Functional words which are pronouns, adverbs and particles, are essential for constituting texts. In this study 20 functional words which are occurs 75 or more than in dataset were selected. Abbreviations are used quite a lot in messages so they were preferred as textual features. Lastly types of words, which are adjectives, adverbs, pronouns and so on, were extracted from the dataset. With the extraction of textual features, classification algorithms were applied to these features. 4. Application For the purpose of processing extracted textual features, Bagging and AdaBoostM1 algorithms were used. Bagging and AdaBoostM1 algorithms achieve high accuracy and have better generalization performance than a single classifier so they were preferred for this study. Both of them take J48 decision tree, Naive Bayes, MLP and SMO as base classifiers. J48, which is among the most widely used machine learning algorithm, is a tree structured data structure and works by dividing and managing [22]. For solving the classification problem, Naive Bayes presents a probability based technique and is very fast in authorship identification [23]. MLP provides high accuracy for solving nonlinear classification problem and a have good generalization capacity [24]. SMO, which is easy to carry out and can classify high dimensional data very fast, is used in real time problems [25]. Bagging and AdaBoostM1 were applied to dataset thereby using J48 decision tree, Naive Bayes, MLP and SMO as base classifiers. These algorithms were applied to dataset by using 10 fold cross validation and classification models were obtained. And for performance evaluation of these models confusion matrix as shown in Table 2 were used.
4
Ekin Ekinci et all. / Procedia Information Technology & Computer Science (2013) 000-000
Table 2. Confusion Matrix
ACTUAL CLASS
PREDICTED CLASS Class=1 Class=1 a (TP) Class=0 c (FP)
Class=0 b (FN) d (TN)
To evaluate classification models' performance accuracy, precision, recall and F-measure, which are performance measures, were calculated. Accuracy is the ratio of the positively classified instances (a+d) to all instances (a+b+c+d). Precision (p) is the ratio of the positively classified instances which are actually such (a) to positively classified instances (a+c). Recall (r) is the ratio of positively classified instances which are actually such (a) to instances which are actually positive (a+b). F-measure is the harmonic mean of the precision and recall (2*r*p/(r+p)). The obtained results are shown in Table 3, 4 and 5. Considering all results Bagging provided better generalization performance than AdaBoostM1 according to F-measure. Table 3. Performance Measures for Base Classifiers Performance Measures Accuracy Precision Recall F-Measure
J48 0.784 0.786 0.784 0.784
Naive Bayes 0.788 0.79 0.788 0.788
MLP 0.824 0.825 0.824 0.824
SMO 0.812 0.817 0.812 0.814
Table 4. Performance Measures for Bagging Performance Measures Accuracy Precision Recall F-Measure
Bagging J48 0.812 0.815 0.812 0.813
Bagging Naive Bayes 0.772 0.773 0.772 0.772
Bagging MLP
Bagging SMO
0.840 0.841 0.840 0.840
0.860 0.863 0.860 0.861
Table 5. Performance Measures for AdaBoostM1 Performance Measures Accuracy Precision Recall F-Measure
AdaBoostM1 J48 0.848 0.851 0.848 0.848
AdaBoostM1 Naive Bayes 0.776 0.776 0.776 0.775
AdaBoostM1 MLP 0.824 0.825 0.824 0.824
AdaBoostM1 SMO 0.812 0.817 0.812 0.812
5. Conclusion For the purpose of identifying authors of emails in terms of forensic science; an application, which can improve the security of emails and can find the actual owner of emails instead of known owner based on message body, was achieved. Bagging and AdaBoostM1, which are ensemble classifiers in machine learning, were applied for processing textual features which were extracted from the Turkish newsgroup dataset and classification performance were evaluated according to F-measure and Bagging achieved better generalization performance than AdaBoostM1. Differences between classifier performances arise from the dataset, applied preprocessing step, extracted textual features and parameters of algorithms. The most important result which is extracted from the study is for determining criminal; authorship identification is a very powerful technique and can be useful for scientists working on forensic science. Acknowledgements I would like to thank Associate Professor Doctor Yusuf Sinan Akgül for his support and guidance for this study. 5
Ekin Ekinci et all. / Procedia Information Technology & Computer Science (2013) 000-000
References [1] Dönmezer, S. Criminology, [Kriminoloji], 7th edition, Filiz Kitabevi, İstanbul, 1984, p 60. [2] Chen, H. Chung, W. Xu, J. J. Wang, G. Qin, Y. & Chau, M. Computer, Crime Data Mining: A General Framework and Some Examples, 2004, 37 (4), pp 50 - 56. [3] Rudin, N. & Inman, K. Principles and Practice of Forensic Science: TheProfession of Forensic Science. CRC Press, Boca Raton, 2001, pp 329 - 341. [4] Hancı, H. İ. Adli Bilişim Çalıştayı Sunumu, Adli Bilişim Bilimi ve Diğer Bilimlerle Olan İlişkisi, İzmir, 2005. [5] Parker, D. B. Computer Crime Criminal Justice Resource Manual, 2nd edition. David Assoc, Washington D. C., 1989, p 25. [6] Leiner, B. M. Cef, V. G. Clark, D. D. Kahn, R. E. Kleinrock, L. Lynch, D. C. Postel, J. Roberts L. G. & Wolff, S. S. Communications of the ACM, The Past and Future History of the Internet, 1997, 40 (2), pp 102 - 108. [7] Özmutlu, H. C. & Özmutlu, S. 1. Polis Bilişim Sempozyumu, Bilgisayar Ağları Aracılığı ile Gerçekleştirilebilecek Suçlar ve Yaşanan Sorunlar, Ankara, 2003, pp 84 -87. [8] Abbasi, A. & Chen, H. IEEE Intelligent System, Applying Authorship Analysis To Extremist-Group Web Forum Messages, 2005, 20 (5), pp 67-75. [9] Kumar, D. & Bhardwaj D. International Journal of Computer Science Issues (IJCSI), Rise of Data Mining: Current and Future Application Areas, 2011, 8 (5), pp 256 - 260. [10] Mendenhall, T. C. American Association for the Advancement of Science, Characteristic Curves of Composition, 1887, 9 (214), pp 237 - 246. [11] Stamatatos, E. Journal of the American Society for Information Science & Technology, A Survey of Modern Authorship Attribution Methods, 2009, 60 (3), pp 538 -536. [12] Mosteller, F. & Wallace, D. L. Journal of the American Statistical Association, Inference in Authorship Problem, 1963, 58 (302), pp 275 - 309. [13] De Vel, O. ACM International Conference on Knowledge Discovery and Data Mining (KDD’2000), Mining E-Mail Authorship, Boston, 2000. [14] Ekinci, E. & Takçı, H. 20. Sinyal İşleme ve İletişim Uygulamaları Kurultayı, Using Authorship Analysis Techniques in Forensic Analysis of Electronic Mails, Muğla, 2012. [15] Liu, K. H. & Huang, D. S. Computers in Biology and Medicine, Cancer Classification using Rotation Forest, 2008, 38 (5), pp 601 - 610. [16] Dasarathy, B. V. & Sheela, B. V. Proceedings of the IEEE, A Composite Classifier System Design: Consepts and Methodology, 1979, 67 (5), pp 708 - 713. [17] Liang, G. Zhu, X. & Zhang, C. Proceedings of Twenty-Fifth AAAI Conference on Artifical Intelligence, An Emprical Study of Bagging Predictors for Different Learning Algorithms, 2011. [18] Wang, G. Hao, J. Ma, J. & Jiang, H. Expert Systems with Applications, A Comparative Assessment of Ensemble Learning for Credit Scoring, 2011, 38 (1), pp 223 - 230. [19] Polikar, R. IEEE Circuits and Systems Magazine, Ensemble Based Systems in Decision Making, 2006, 6 (3), pp 21 - 45. [20] Baayen, H. Halteren, H. V. Neijjt, A. & Tweedie, F. JADT 2002, An Experiment in Authorship Attribution, 1, 2002, pp 69 – 75. [21] Akın, M. D. & Akın, A. A. Elektrik Mühendisliği, Türkçe Dilleri için Açık Kaynaklı Doğal Dil İşleme Kütüphanesi: Zemberek, 2007, 431, pp 38 – 44. [22] Alpaydın, E. Artificial Learning, [Yapay Öğrenme], Boğaziçi Üniversitesi Yayınevi, İstanbul, 2011, p 153. [23] Han, J. Kamber, M. & Pei, J. Data Mining Concepts and Techniques, 3rd edition. Morgan Kaufmann, USA, 2012, p 350. [24] Öztemel, E. Artificial Neural Networks [Yapay Sinir Ağları], 2nd edition, Papatya Yayıncılık Eğitim, İstanbul, 2006, p 75. [25] Keerthi, S. S. Shevade, S. K. Bhattacharyya, C. & Murthy, K. R. K. Neural Computing, Improvements to Platt’s SMO Algorithm for SVM Classifier Design, 2001, 13 (3), pp 637 – 649.
6