of the best modern general-purpose compressing algorithms. ... This tool is a commercial (free .... from the The Online Books Page at the University of Pennsyl-.
1
Author Identification using Compression and Machine Learning Approaches A. F. Ibrahim, ECE Department
Abstract—In this report we compare two different paradigms for author identification. The first paradigm is based on compression algorithms where the entire process of defining and extracting features and training a classifier is avoided. On the other hand, the second paradigm takes into account the classical pattern recognition framework, where proposed linguistic features are used to train both Support Vector Machine (SVM) and Naive Bayes classifiers. Comprehensive experiments that were performed on a dataset composed of 5 writers show that compression algorithms may achieve better performance. Advantages and drawbacks of both paradigms are also discussed.
The next sections are organized as follows. Section 2 introduces the basics about compression algorithms and how they can be used for classification. Section 3 discusses the concept of author attributes and presents the author attributes that have been used in this work. Section 4 describes the adopted machine learning approaches. Section 5 explains the dataset that is used in the experiments. Finally, Section 6 concludes this work.
Index Terms—Author identification, Compression, Support Vector Machines, Naïve Bayes.
We have used WinRAR 3.60 beta 3 as a compression software tool in our experiment. This tool is a commercial (free trial) Windows GUI and command line archiver by Eugene Roshal, May 8, 2006. It produces rar and zip archives and decompresses many other formats. It also encrypts and performs other functions. The best compression mode which we applied in our experiment uses PPM compression model with optimizations for text [3].
A
I. INTRODUCTION
Identification is the task of identifying the author of a given text. Therefore, it can be formulated as a typical classification problem, which depends on discriminant features to represent the style of an author. Literature shows a long history investigation into author identification [1]. Frank, Chui, and Witten[2] have been promoting to avoid defining features explicitly while describing the classes as a whole. They used modern lossless data compression algorithms due to their ability to construct accurate statistical models, with low or acceptable computational requirements. In this assignment, we compare both strategies for author identification. First we present the background about used compression algorithms and introduce the PPM (Prediction by Partial Matching) algorithm[3][4] which is considered as one of the best modern general-purpose compressing algorithms. Despite demanding much more computer resources than dictionary based techniques, PPM typically yields substantially improved compression ratios. With modern digital technology, memory usage and processing time of the PPM algorithm are acceptable. We also show how PPM can be used for pattern classification. Then, we present two sets of linguistic features of the English language which were used to train both machine learning classical approaches: Support Vector Machine (SVM) and Naive Bayes classifiers [5]. Comprehensive results on a dataset composed of book chapters written in English by 5 different authors show that compression algorithms may accomplish slight improvement in performance than classical pattern recognition framework. Finally, we discuss some advantages and drawbacks of each paradigm. UTHOR
II. COMPRESSION ALGORITHM
A. The PPM Algorithm Even though the source model P is generally unknown, it is possible to construct a coding scheme based upon some implicit or explicit probabilistic model Q that approximates P. The better Q approximates P, the smaller the coding rate achieved by the coding scheme. In order to achieve low coding rates, modern lossless compressors rely on the construction of sophisticated models that closely follows the true source model. Statistical compressors, such as PPM encode messages according to an estimated statistical model for the source. For stationary sources, the PPM algorithm learns a progressively better model during encoding. Many experimental results show that the superiority of the compression performance of PPM, in comparison with other asymptotically optimum compressors, results mainly from its ability to construct a good model for the source in very early stages of the compression process [3], [4]. In other words, PPM constructs (“learns”) an efficient model for the message to be compressed faster than its competitors. The PPM algorithm is based on context modeling and prediction. The PPM starts with a “complete ignorance model” and adaptively updates this model as the symbols in the uncompressed stream are coded. Based on the whole sequence of symbols already coded, the model estimates probability distributions for the next symbols, conditioned on a sequence of k previous symbols in the stream. The number of symbols in the
2 context, k, determines the order of the model. The next symbol, x, is coded by arithmetic coding, with the probability of x conditioned on its context. If x has not previously occurred in that specific context, no estimate for its probability is available. In this case, a special symbol (“escape”) is coded, and PPM-C tries to do code x in a reduced context, with k-1 antecedent symbols. This process is repeated until a match is found, or the symbol is coded using the independency-equiprobability model. Experiments show that the compression performance of PPM increases as the maximum context size increases up to a certain point, after which performance starts to degrade. This behavior can be explained by the phenomenon of contextdilution and the increased emission of escape symbols. The context size in which compression performance is optimal depends on the message to be compressed, but typical values usually are in the interval 4 to 6. III. AUTHOR ATTRIBUTES The main idea behind authorship attribution is measuring some textual features by which we can distinguish between texts written by different authors. In the typical authorship attribution problem, a text of unknown authorship is assigned to one candidate author, given a set of candidate authors for whom text samples of undisputed authorship are available. From a machine learning point-of-view, this can be viewed as a multi-class single-label text categorization task [1]. A. Linguistic Features Previous studies on authorship attribution proposed taxonomies of features to quantify the writing style, the so called style markers, under different labels and criteria [6]. The current research of text representation features for stylistic purposes is mainly focusing on the computational requirements for measuring them. Lexical and character features consider a text as a mere sequence of word-tokens or characters, respectively[1],[7]. Table 1 shows the set of selected features as a feature vector in the work presented in this report. For the purposes of this assignment, we will consider a sentence to be a sequence of characters that (1) is terminated by (but doesn't include) the characters (! ? . or the end of the file), (2) excludes whitespace on either end, and (3) is not empty. IV. GENERAL PURPOSE CLASSIFICATION Traditional authorship attribution approaches adopted function words, POS tags, and rewrite rules as a feature set to build a classification model [1]. Even though they achieved good accuracy, there still existed more meaningful feature set to improve the performance. In this section, we represent in table 1 a set of basic attributes which are somewhat complex enough and hold large variance of authorship information.
TABLE 1 USED AUTHOR ATTRIBUTES Attribute Adverbs Rate Adjectives Rate Noun Rate Lexical Diversity Content Diversity Pronoun Rate Words Rate Per Sentence Word Length Rate Modifier Rate Average Word Length Sensor Words Rate Descriptive Words Rate Transition Words Rate Average Sentence Length.
Description frequency of adverbs over total number of words frequency of adjectives words over total number of words frequency of noun words over total number of words the number of unique words used in a text divided by the total number of words the number of total content words used in a text divided by the total number of words frequency of pronoun words over total number of words average number of words per sentence total words length over total number of words total number of adjectives + total number of adverbs over total number of words total number of characters over total number of words frequency of sensor words over total number of words frequency of descriptive words over total number of words frequency of transition words over total number of words total number of words over total number of sentences
V. DATASET Unfortunately, there is not a standard data set for evaluating authorship attribution methods. Therefore, we had to assemble our own corpus. This corpus was gathered from the Web. We have collected some book chapters available in the Internet for five authors with different profiles. All samples were taken from the The Online Books Page at the University of Pennsylvania edited by John Mark Ockerbloom [8]. Table 2 resumes some statistics about our generated corpus. We have chosen different number of chapters for each author where the training set contains 15 chapters and the test set contains 5 different texts for each of the 18 authors. The average length of samples is 456 words. Appendix A depicts an example of the article of our dataset. TABLE 2 - CORPUS STATISTICS # of Author Size of vocabulary samples Emile Gaboriau (C1) 9 35610 Eleanor Abbott (C2) 9 40685 Jean Fabre (C3) 6 26198 Mary Wade (C4) 8 9763 Emile Faguet (C5) 6 9709
The used data plays no role in the design, tuning or training of the proposed methods. Further, analysis and handling of the data was conducted in the Matlab software package. VI.
EXPERIMENTS
In this section we report the experiments performed using both paradigms described in this report. In the first experi-
3 ment, the chapters of each author were randomly grouped in two sets training and testing sets (between two to three chapters in each set). In the learning stage, the number N of classes (in this case N = 5 authors) is defined, and a training set Ti of text samples with fixed size known to belong to class Ci, i = {1, 2, . . . , N }, is selected. In the compression strategy, the feature extraction is done basically by the PPM algorithm using WinRaR with best compression option. It sequentially compresses the samples in Ti, and the resulting model Mi is kept as a model for the texts in Ci. During classification, the generated models in the training stage are used but not updated during the encoding process. Classification is done as follows: A text sample x from an unknown author is coded by the PPM algorithm with static model Mi, and the corresponding coding rate ri is registered. Then, the sample x is assigned to Ci if ri< rj, j = {1, 2, . . . , N }, j≠ i. The basis is that if x is a sample from class Ci, the model Mi probably best describes its structure, thus yielding the smallest coding rates. The average performance of this strategy by using testing set (25 chapters × 5 authors), was around 87.75%. Regarding the experiments based on author attributes, the same formalism has been applied. Two machine learning algorithm were applied, Support Vector Machine (SVM) and Naïve Bayes (NB) classifiers. Both classifiers were used to model the N classes (authors). NB has been used because it’s a classical text classifier and less expensive than other classifiers. When using SVMs, there are two basic approaches to solve an N –class problems: pair wise and one-against-others. In this work second strategy has been tried out. A Linear kernel was employed and its parameters were defined. The Matlab version of the SVM was used in this work. The feature vector used to train both classifiers is composed of 14 components. List of adverb, adjectives, nouns and pronouns have been collected from that Online Merriam Webster dictionary [9]. The success rate of the developed system depends on a test set which consists of 25 texts (5 texts per author). In the first method we have used all of 14 features which are calculated as equal weights. Table 1 reports the performance of both strategies. It can be shown that compression algorithms may accomplish slight improvement in the performance than classical pattern recognition framework. TABLE 3 ACCURACY OF COMPRESSION AND THE USED CLASSIFIERS (AUTHOR ATTRIBUTES) COMPRESSION Accuracy
RAR 89.5%
ZIP 86%
STYLOMETRY
SVM 88.56%
Naive 72%
Author C1 C2 C3 C4 C5
TABLE 4 RESULTS FOR USING SVM & NB CLASSIFIERS Precision Recall F-Score SVM NB SVM NB SVM NB 0.84 0.67 1.0 1.0 0.86 0.8 1.0 0.75 0.67 0.5 0.8 0.6 1.0 0.4 0.5 0.5 0.67 0.44 1.0 1.0 0.8 0.8 0.89 0.89 1.0 1.0 1.0 0.75 1.0 0.86
If we compare both paradigms on the basis of accuracy alone, we find that the compression approach accomplishes improvement in performance than classical classification approaches. But, accuracy is not enough to distinguish between both paradigms. The comparison on another basis like precision, recall and f-score has more informative than accuracy alone. Tables 4 and 5 show the precision, recall and Fscore for both paradigms. Looking at recall in table 4, we observe that classes C1 and C1 have better results than those in table 5. Comparison using F-score yields the same ranking as that obtained using recall.
Author C1 C2 C3 C4 C5
TABLE 5 RESULTS FOR USING COMPRESSION ALGORITHMS Precision Recall F-Score Zip Rar Zip Rar Zip Rar 0.8 0.86 0.84 0.84 0.82 0.845 0.8 0.84 0.85 0.84 0.85 0.84 0.86 0.83 0.846 0.86 0.863 0.85 0.88 0.9 0.864 0.862 0.867 0.857 0.87 0.83 0.867 0.85 0.851 0.818
Due to the small size of the corpus, a second experiment is performed using cross-validation which was adopted for computing classification rates. The dataset was randomly divided into different sets with random training and testing sizes for each round. In the first cross-validation round, the first set of each author was used for training, and the remaining sets were used for classification. Table 6 reports the average performance of both strategies based on three different partitions of the database. TABLE 6 COMPARISON USING CROSS VALIDATION COMPRESSION
Round 3
RAR % 88.5 81.2 80.9
Round 4
84.3
Round 5
84.9
Round 1 Round 2
ZIP % 84 80.9 81.01 79.3 82.3
STYLOMETRY
SVM 88.56 84 83 82.8 86.7
Naive 72 74.8 71 79.1 78.5
From the experimental point of view, both strategies have advantages and drawbacks. Compared to the traditional classification scheme used by the author attributes based classifiers, the compression algorithms has some advantages, such as i) no definition of features, ii) no feature extraction and iii) no traditional learning phase. It is worth of remark that in spite of the apparent black box concept, compression algorithms like PPM are based on robust probabilistic frameworks. However, based on the size of the text or the number of samples per au-
4 thor which cannot be augmented, the performance of the PPM cannot be further improved. In other words, this is as good as it gets. On the other hand, the traditional way of defining and extracting features to train a machine learning model gives more perspective for improvements, since it’s always possible to explore new features, select the relevant and uncorrelated ones through feature selection, and try new classification algorithms.
nown. On a rustic bench sat a couple of tall footmen, as bright in their gorgeous liveries as gold coins fresh from the mint; still, despite their splendor, they were stretching and yawning to such a degree, that it seemed as if they would ultimately dislocate their jaws and arms. "Tell me," inquired the servant who was escorting Pascal, "can any one speak to the baron?" ”
VII. CONCLUSION This work discusses two different paradigms for author identification. The first one is based on the well known compression algorithm, called PPM by using WinRAR tool. In this case, the feature extraction is done in an implicitly fashion where each author of the dataset is modeled by compressing some samples of his chapters. Then, during recognition those models are used to compress a given questioned samples which is assigned to the class that produces the lowest compression rate. The second paradigm relies on the traditional pattern recognition framework, which involves steps such as definition of features, feature extraction, classifier training, etc. In this work we have used linguistic features to train two classifiers. Results using the same testing protocol show that both strategies produce slightly close results but with different confusions. This shows that both strategies are complementary to each other and can be combined to build a more reliable identification system. Besides, we believe that the PPM is a useful tool that can be used as parameter when designing new features for author identification. It is fair to expect that a discriminant feature set would at least achieve the same level of performance than PPM. Future work includes using a larger size of datasets with more authors and longer samples. This will enable the research to assess the impacts of bigger datasets on PPM. Moreover, different linguistic features are needed to be realized. As well as strategies to combine these two different paradigms will be investigated. APPENDIX A The following is an example from our own corpus. It’s a part from chapter one for Emile Gaboriau author number one. “ The sumptuous interior of the Trigault mansion was on a par with its external magnificence. Even the entrance bespoke the lavish millionaire, eager to conquer difficulties, jealous of achieving the impossible, and never haggling when his fancies were concerned. The spacious hall, paved with costly mosaics, had been transformed into a conservatory full of flowers, which were renewed every morning. Rare plants climbed the walls up gilded trellis work, or hung from the ceiling in vases of rare old china, while from among the depths of verdure peered forth exquisite statues, the work of sculptors of re-
REFERENCES [1] Stamatelos, E, “A survey of modern authorship attribution methods”, JASIST, 2009. [2] Frank, E., Chui, C., and Witten, I. H. “Text categorization using compression models”, In Data Compression Conference, 2000. [3] Moffat, A. "Implementing the PPM data compression scheme," Communications, IEEE Transactions, 1990. [4] Mahoney, M., ”Large text compression benchmark”, http://www.cs.fit.edu/ mmahoney/compression/text.html [5] Fukumoto, F., and Suzuki, Y., “Manipulating large corpora for text classification”. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing - (EMNLP '02), [6] [7] [8] [9]
2002. Gamon, M., “Linguistic correlates of style: authorship classification with deep linguistic analysis features”, In Proceedings of the 20th international conference on Computational Linguistics (COLING '04), 2004. Moschitti, A., and Basili, R., “Complex linguistic features for text classification: A comprehensive study”, Advances in Information Retrieval, Lecture Notes in Computer Science, 2004. (http://onlinebooks.library.upenn.edu) http://www.merriam-webster.com/