Authorship Attribution of online messages using ...

International Conference on Advances in Engineering and Technology (ICAET'2014) March 29-30, 2014 Singapore

Authorship Attribution of online messages using Stylometry: An Exploratory Study Smita Nirkhi, Dr.R.V. Dharaskar, and Dr.V.M.Thakare by author A1. We have a text t2 that is certainly by author A2 and we have a disputed text t3 that we know to be either written by A1 or written by A2.Then a stylometry analysis test will take the three texts and (ideally) inform us whether text t3 was written by author A1 or by author A2. There are two issues associated with online text: 1. Length of Online documents is generally short and the writing styles of online documents are not formal therefore vocabulary is less stable. The structure or composition style of online documents is often different from normal text documents. 2. Due to the use of cyberspace at global level, multilingual problems become a new challenge for authorship The aforementioned authorship attribution problem can be solved with stylometry. It quantify writing style of author. It is a rapidly increasing interdisciplinary research area that combines Stylistics, statistics and computer science. Stylometry is defined as the statistical analysis of literary style; it offers a means of capturing the often elusive character of an author's style by quantifying some of its features. The theory underlying stylometry studies is that authors have an unconscious as well as a conscious aspect to their style. The basic assumption here is every author's style has certain features, and since these features cannot be consciously manipulated by the author. Therefore they are considered as reliable data source for a stylometry study. The two primary applications of stylometry are attribution studies and chronological problems. Variation in style can be caused by differences of genus or content, and similarity by literary processes such as imitation. By measuring and counting stylistic traits, we hope to discover the 'characteristics' of a particular author. This paper studies the context of stylometry origins and historical development. This paper discusses about stylometry methods followed by stylometry using R, Design issues for authorship attribution. A section 4 and 5 discusses proposed approach and experimentation evaluation. Stylometry methods are mainly categorized as writer invariant, Neural Networks methods, and Genetic Algorithm methods. These are explained as follows.

Abstract—Authorship attribution process helps to identify the author of anonymous text by providing the some text samples of few authors, assuming that anonymous text is written by one of the author of known text samples. Authorship attribution of online messages can be useful for mining the writing patterns of authors, which in terms helps the forensics investigation process. Basically it is the classification problem for identifying the author of a given text. A set of documents with known authorship are used for training and aim is to automatically determine the corresponding author of an anonymous text. Unlike to other classification tasks, feature identification process to classify an author is difficult as there are different ways of speaking and writing. In recent years, practical applications for authorship attribution have grown in areas such as intelligence for linking intercepted messages to each other and to known terrorists, criminal law for identifying writers of threatening mail, civil law for copyright, and computer security for tracking authors of computer virus source code. Out of the various approaches for Authorship attribution problem, Stylometry is one of the approach to solve this problem. A Stylometry uses statistical methods to analyze a text to determine the authorship of text. This paper investigates the use of stylometry for Authorship Attribution of online messages.

Keywords— Authorship Attribution,KNN, Stylometry; SVM I. INTRODUCTION

E

MAILS blogs, chat rooms, newsgroups are used excessively for online communication and have become integrated into our everyday lives. Unfortunately these online communication mediums are being misused for many illegitimate activities. The anonymous nature of these channels makes them an ideal source of communication for criminal groups and extremist organizations. Authorship analysis of online documents (such as e-mails, VoIP segments and instant messages,etc.) for prosecuting terrorists, pedophiles, and scammers in the court of law, has received great attention in different studies [1][2][3]. Authorship analysis problem is divided into authorship attribution, characterization/profiling and verification or similarity detection. In this paper, focus is to explore the authorship attribution problem and stylometry approach to this problem. Stylometry analysis involves the use of statistical methods to attempt to describe authorship to disputed texts. Let’s consider a simple example: We have a text t1 that is certainly

A. Writer Invariant It is a property of a text which is similar of its author. In writer invariant method initially frequency of function words used by the writer is analyzed. The text is then divided into word chunks and each of the chunks is analyzed to find the frequency of those function words in that chunk. This generates a unique number (n) of identifier for each chunk.

Smita Nirkhi is with the Ramdeobaba College Of Engineering &Management , Nagpur, India. Research Scholar at GHRCE, Nagpur. (Email:[email protected] ). Dr.R.V.Dharaskar, Director, MPGI, Nanded, India. Dr.V.M.Thakare is Head of the Depart, Amravati, India. http://dx.doi.org/10.15242/IIE.E0314119

254


Chunk of texts are then represented into a point in ndimensional space. This n-dimensional space is flattened into a plane using principal components analysis (PCA). This results in a display of points that correspond to an author's style. If two documents are placed on the same plane, the resulting pattern may show if both documents were by the same author or different authors. B. Neural Networks Neural networks can help to analyze authorship of texts. Back propagation method can be used to train classifier with known training data. The network helps to simplify to new items that has not part of training before. From survey it was found that, a neural network may reach to 70% of accuracy in determining authorship of literary. One problem with this method of analysis is that the while selecting authors network can become biased based on its training set. C. Genetic Algorithms Another technique used in stylometry is genetic algorithm. This is rule based method. For example rule generation is based on frequency of specific word written by that author. The program uses text as a input and generates rules to determine authorship. Each rule is tested using fitness score. The rules with the lowest scores are thrown out. The remaining rules are given small changes and few new rules are generated until the new rules correctly attribute the texts.

similar way to word length frequency as a way of comparing texts. Not surprisingly, in English, we are likely to find highest values at ‘e’ and ‘t’. Once again it is worth considering why letter frequencies occur the way they do. Word Based Stylometry Analysis Writeprint is a term proposed by some forensic linguistics researchers to denote a set of distinguishing stylometry characteristics of a written text (writer invariants) such as "vocabulary richness, length of sentence, use of function words, layout of paragraphs, and key words" which allow one to identify its author (if written by a single person). III. DESIGN ISSUES Stylometry provides the way to measure various features such as length of sentence, vocabulary richness and word frequencies. It provides many practical applications in authorship attribution research. These applications are usually based on the principle that every individual has certain writing style that can help to detect the true author of an anonymous text, that there exist stylistic fingerprints that can betray the plagiarist, that the oldest authorship disputes (St. Paul’s epistles or Shakespeare’s plays) can be settled with more or less sophisticated statistical methods. Statistical approaches have been offered impressive precision, to identify texts written by several authors based on a single example of each author’s writing. But the similarity pattern and differences in pattern of stylometry uncovers other research issues like different books by the same author; between books by different authors; between authors differing in terms of chronology or gender; between translations of the same author or group of authors; helping, in turn, to find new ways of looking at works that seem to have been studied from all possible perspectives. In this digital era text is available in electronic form so we can able to apply stylometry to them for experimentation purpose. Two critical research issues influence the performance of authorship analysis: 1. Feature selection: - To Find out the effective discriminators 2. Analytical techniques:-Approach to discriminating texts by authors based on the selected features

II. STYLOMETRY USING R R provides platform for finding sentence length, word length, letter frequencies and writeprint identification functionalities by using appropriate package. These are the useful style markers for analyzing authorship. Sentence Length This one is very simple to describe, it is the number of words in a sentence averaged out. The measure used is then the mean number of words in a sentence used by the author. As a one-dimensional measure, this has all the pitfalls that have been described above, plus some more. First of all old texts, e.g. Shakespeare. Shakespearean texts, for example, were edited and very often the punctuation was changed by editors/compositors and hence can be unreliable. Modern works, particularly very recent works may have been influenced by word-processors. Furthermore, sentence length should never, ever be used on transcriptions of spoken text people don’t speak full-stops. Word Length Again a very simple measure that needs little explanation. Count all the words, find out how long they are and produce a distribution. Inevitably word length profiles will be dominated by a few frequently occurring words. It is observed that few frequently occurring words will dominate the word frequency profile. Word length frequency is a rather coarse grained way of looking at the frequency of words within texts. Letter Frequency Letter frequency involves counting the frequencies with which letters of the alphabet appear in a text and using this in a http://dx.doi.org/10.15242/IIE.E0314119

IV. PROJECTED APPROACHES From the existing classification methods, SVM and KNN are considered to be part of this study. For evaluating the performance of classification method, KNN classifier and SVM classification techniques are implemented and evaluated. A support vector machine (SVMs) is supervised learning methods used for classification. The reason for using SVM as a classification method is, it is effective in high dimensional spaces and effective in cases where number of dimensions is greater than the number of samples. It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient. Different Kernel functions can be specified for the decision function thus versatile. SVM 255


provides Common kernels and we can specify our own kernel by custom kernels. Optimal hyperplane for linearly separable patterns. Extend to patterns that are not linearly separable by transformations of original data to map into new space – Kernel function In pattern recognition, the k-nearest neighbor’s algorithm (k-NN) is a non-parametric method for classification and regression. It predicts objects values or class memberships based on the k closest training examples in the feature space. K-NN is a type of instance-based learning, where the function is only approximated locally and all computation is deferred until classification. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

This paper attempts to recognize authorship of unknown text based on their writing style using stylometry. Both SVM and KNN yield significantly better results than random guess. . We achieved 80% average accuracy on the test set using KNN and 92% average accuracy on the test set using SVM. The experimental result indicates that we can make use of authorship analysis approaches for online messages in cybercrime investigation to address the identity-tracing problem. Stylometry features are discriminators for online documents.SVM techniques achieved high performance .This approach may apply for multilingual context. Thus present study proves that stylometry provides a way to classifiers that require fewer input variables than traditional statistics The proven efficiency of the automatic classifiers marks them as exciting tools for the future in stylometry’s continuing evolution. We consider that the combination of stylometry and AI will result in a useful discipline with many practical applications REFERENCES

V.EXPERIMENTAL EVALUATION Experiments are carried out on dataset Reuter_50_50 of newsletters. The two classification techniques SVM and KNN are implemented and evaluated here .Table1 is showing the results for SVM and table two is showing result for KNN. Performance indicator considered to evaluate the performance are (i) Average Accuracy (ii)Average Precision (iii) Average Recall. Table 1 shows values for these measures against two classification techniques that is SVM and KNN. Data of 5 Authors from the dataset are consider for experimentation purpose. Equation (1),(2),(3) are used to calculate Average Accuracy, Average Precision and Average Recall

[1] A. Abbasi ; H. Chen.(2008) Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems, 26(2) http://dx.doi.org/10.1145/1344411.1344413 [2] O. de Vel.(2000) Mining e-mail authorship. In proc.Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining(KDD) [3] M. Koppel; S. Argamon,; A.R. Shimoni(2002).Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4),pp.401–412, http://dx.doi.org/10.1093/llc/17.4.401 [4] Ramyaa; Congzhou He; Khaled Rasheed Using Machine Learning Techniques for Stylometry Artificial Intelligence Center [5] The University of Georgia, Athens, GA 30602 USA [6] Estival 2008] [Abbasi et. al. 2008] [Koppel et. al. 2003] [De Vel et. al. 2001]. [7] Li, J., Chen, H., & Huang, Z. “A Framework for Authorship Identification of Online Messages: Writing-Style Features and classification Technique”, Journal of the American Society for Information Science, 57(3), 378–393. doi:10.1002/asi,2006. [8] Abbasi, A., & Chen, H. “Visualizing Authorship for Identification”, English, 60–71, (2006). [9] Stańczyk, U., & Cyran, K. A. “Machine learning approach to authorship attribution of literary texts”, Journal of Applied Mathematics, 1(4), 151–158, (2007). [10] Pavelec, D., Justino, E., & Oliveira, L. S. “Author Identification using Stylometric Features”,Inteligencia Artificial, 11(36), 59–65. doi:10.4114/ia.v11i36.892, (2007). http://dx.doi.org/10.4114/ia.v11i36.892 [11] Stamatatos, E. “Author identification: Using text sampling to handle the class imbalance problem”, English, 44, 790–799. doi:10.1016/j.ipm.2007.05.012, (2008). http://dx.doi.org/10.1016/j.ipm.2007.05.012 [12] Iqbal, F., Hadjidj, R., Fung, B. C. M., & Debbabi, M. “A novel approach of mining write-prints for authorship attribution in e-mail forensics”, Information Systems, 5, 42–51. doi:10.1016/j.diin.2008.05.001, (2008). http://dx.doi.org/10.1016/j.diin.2008.05.001 [13] Iqbal, F., Binsalleeh, H., Fung, B. C. M., & Debbabi, M. “Mining writeprints from anonymous e-mails for forensic investigation”, Digital Investigation, 1–9. doi:10.1016/j.diin.2010.03.003, (2010). http://dx.doi.org/10.1016/j.diin.2010.03.003 [14] Mikros, G. K., & Perifanos, K. “Authorship identification in large email collections: Experiments using features that belong to different linguistic levels, (2011).

Dataset Reuter_50_50 Data Set

Dataset Reuter_50_50 Data Set

Accuracy =

Number of messages whose author was correctly identified (1) Total number of messages

Precision =

Recall =

Table I Classification Results Measures Classification(KNN) Average 80% Accuracy Average 80% Precision Average Recall 75% Measures Classification(SVM) Average 92% Accuracy Average 90% Precision Average Recall 75%

Number of messages correctly assigned to the author Total number of messages assigned (2) to the author

Number of messages correctly assigned to the author (3) Total number of messages written by the author

VI. CONCLUSION

http://dx.doi.org/10.15242/IIE.E0314119

256


[15] Tanguy, L., Sajous, F., Calderone, B., & Hathout, N. “Authorship attribution: using rich linguistic features when training data is scarce”, (2012).

http://dx.doi.org/10.15242/IIE.E0314119

257