Book Name. Doctor Grimshawe's Secret. Hawthorne. House of Seven Gables. Confidence Man. Melville. Moby Dick. Last of the Mohicans. The Spy. Group A.
Text Categorization for Authorship Verification Moshe Koppel
Jonathan Schler
Dror Mughaz
Dept. of Computer Science Bar-Ilan University Ramat-Gan, Israel Abstract. One common version of the authorship attribution problem is that of authorship verification. We need to determine whether a given author, for whom we have a corpus of writing samples, is also the author of a given anonymous text. The set of alternate candidates is not limited to a given finite closed set. In this paper we show how usual text categorization methods can be adapted to solve the authorship verification problem given a sufficiently large anonymous text. The main trick is a type of robustness test in which the best discriminators between two corpora are iteratively eliminated: if texts remain distinguishable as the discriminators are eliminated, they are very likely by different authors.
1. Authorship verification The same text categorization methods used for classification text by topic [1] can be used for authorship attribution [2-4]. Given sufficiently large samples of two or more authors’ work, we can in principle learn models to distinguish them. Typically, the feature sets used for such attribution tasks will involve topic-neutral attributes [5-7] such as function words or parts-of-speech n-grams rather than the content words usually used for topic classification. One common version of the authorship attribution problem is that of authorship verification. We need to determine whether a given author (for whom we have a corpus of writing samples) is also the author of a given anonymous text. The set of alternate candidates is not limited to a given finite closed set. Thus, while there is no shortage of negative examples, we can never be sure that these negative examples fully represent the space of all alternative authors. In this paper we show how usual text categorization methods can be adapted to solve the authorship verification problem given a sufficiently large anonymous text. In particular, we will show that a judicious combination of applications of standard classification methods can be used to unmask authors writing pseudonymously or anonymously even when the authors’ known writings differ chronologically or thematically from the unattributed texts.
2. A Motivating Example Let us begin by considering a real-world example. We are given two 19th century collections of Hebrew-Aramaic responsa (letters written in response to legal queries). The first, RP (Rav Pe'alim) includes 509 documents authored by an Iraqi rabbinic scholar known as Ben Ish Chai. The second, TL (Torah Lishmah) includes 524 documents that Ben Ish Chai, claims to have found in an archive. There is ample historical reason to believe that he in fact authored the manuscript but did not wish to take credit for it for personal reasons. We are given four more collections of responsa written by four other authors working in the same area during the same period. While these far from exhaust the range of possible authors, they collectively constitute a reasonable starting point. To make it interesting, we will refer to Ben Ish Chai, the author of RP, as the “suspect”, to the author of TL as the “mystery man”, and the other four as “impostors”. The object is to determine if the suspect is “guilty”; is he in fact the mystery man? 2.1 Pairwise authorship attribution We will combine a number of pairwise authorship attribution experiments in order to solve the verification problem. Let’s begin by reviewing four stages of pairwise authorship attribution: 1. Cleanse the texts. Since the texts we have of the responsa may have undergone some editing, we must make sure to ignore possible effects of differences in the texts resulting from variant editing practices. Thus, we eliminate all orthographic variations: we expand all abbreviations and unify variant spellings of the same word. 2. Select an appropriate feature set. We select a list of lexical features as follows: the 200 most frequent words in the corpus are selected and all those that are deemed content-words are eliminated manually. We are left with 130 features. Strictly speaking, these are not all function
words but rather words that are typical of the legal genre generally, without being correlated with any particular sub-genre. Thus, in the responsa context, a word like question would be allowed although in other contexts it would not be considered a function word. 3. Represent the documents as vectors. Each individual responsum is represented as a vector of dimension 130, each entry of which represents the relative frequency (normalized by document length) of a respective feature. 4. Learn models. Using each responsum as a labeled example, we use a variant of Exponential Gradient [8,9] as our learner to distinguish pairs of collections.
2.2 From attribution to verification We begin by checking whether we are able to distinguish between the suspect’s work, RP, and that of each of the four impostors, using standard text categorization techniques. Five-fold cross-validation accuracy results for RP against each impostor are shown in Table 1. Note that we obtain accuracy of greater than 95% for each case. How can we parlay this information into a determination regarding the question of whether the author of TL is the author of RP? Let’s begin with a test that might settle the matter. Imp. 1
Imp. 2
Imp. 3
Imp. 4
98.1 %
98.8%
98.6%
97.4%
Table 1 – Cross-validation accuracy of RP vs. Impostors
Test 1: The lineup Given that we can distinguish between the suspect’s work, RP, and that of each of the impostors, it is now a simple matter to check whether the mystery man more closely resembles the suspect than any one of the impostors. That is, we feed each of the mystery man’s texts into the model distinguishing between the suspect and an
impostor and check if a significant number of them are more similar to an impostor. If so, it would be highly unlikely that the suspect actually authored these documents. In our case, however, this is not what happens. In fact, as shown in Table 2, for each impostor, the vast majority of mystery man’s documents are classified as more similar to the suspect than to the impostor. Does this prove that the suspect is in fact our mystery man? It certainly does not. It simply shows that the suspect is a more likely candidate to be the mystery man than are these four impostors. But it may very well be that neither any of the impostors nor the suspect is the mystery man and that the true author of TL is still at large. Thus, for our case, this test is indecisive.
Imp. 1
Imp. 2
Imp. 3
Imp. 4
95.4 %
92.2%
87.6%
96.2%
Table 2 - Percentage of TL docs classified as RP vs. impostors
Test 2: Composite sketch The next test is to determine whether the suspect and the mystery man resemble each other. All we need to do is to see whether there is a model that successfully distinguishes between documents in RP and documents in TL, using cross-validation accuracy as a measure. If no such model can be found, this would constitute strong evidence that the author of RP and of TL are one and the same, particularly if there are successful models for distinguishing between the mystery man and each of the impostors. Thus, we begin by learning models to distinguish between the mystery man’s work, TL, and that of each of the impostors. Cross-validation accuracy results are shown in Table 3. It turns out that the mystery man and the suspect are, in fact, easily distinguishable, with cross-validation accuracy of 98.5%. Does this mean that the author of RP is not the author of TL?
RP
Imp. 1
Imp. 2
Imp. 3
Imp. 4
98.5%
99.0 %
99.0%
97.3%
94.9%
Table 3 – Cross-validation accuracy for TL vs. others
A cursory glance at the texts indicates that the results do not necessarily support such a conclusion. The author of TL simply used certain stock phrases, which do not appear in RP, in a very consistent manner. The author of RP could easily have written TL and used such a ruse to deliberately mask his identity. Thus, this test is also indecisive. Test 2a: Unmasking Our objective is to determine whether two corpora are distinguishable based on many differences – probably indicating different authors – or based solely on a few features. Difference regarding only a few features might not indicate different authors; a small number of differences are to be expected even in different works by a single author as a result of varying content, purposes, genres, chronology, or even deliberate subterfuge by the author. We wish then to test the depth of difference between TL and RP, as well as – for comparison—between TL and each of the impostors. We already have effective models to distinguish TL from each other author. For each such model, we eliminate the five highest-weighted features in each direction and then learn a new model. We iterate this procedure ten times. The results (shown in Figure 1) could not be more glaring. For TL versus each author other than RP, we are able to distinguish with gradually degrading effectiveness as the best features are dropped. But for TL versus RP, the effectiveness of the models drops right off a shelf. This indicates that just a few features, possibly deliberately inserted as a ruse or possibly a function of slightly differing purposes assigned to the works, distinguish between the works. Once those are eliminated, the works become indistinguishable – a phenomenon which does not occur when we compare TL to each of the other collections. Thus we conclude that the author of RP probably was the author of TL.
100 90 80 70 60 50 1
2
3
4
5
6
7
8
9
10
11
Figure 1 – Cross-validation accuracy of TL vs. others as best features are iteratively dropped. The x-axis represents the number of iterations. The dotted line is TL vs. RP
3. The Algorithm We wish to glean the lessons learned from the above example to formulate a precise algorithm for authorship verification. The skeleton of the algorithm is as follows: Goal: Given corpora, S, M, I1,…,In, determine if the author of S is also the author of M. 1. For each j=1,…,n, run cross-validation experiments for S against each Ij. If results are not significant in each case, return FAIL. 2. Otherwise, for each j=1,…,n, use a single learned model for S against Ij to classify each document in M as either S or Ij. If for some j, “many” documents in M are classified as Ij, return NO. 3. Otherwise, for each j=1,…,n, run an “unmasking” test (defined below) for M against Ij. In addition, run an unmasking test for M against S. If results are “significantly worse” for M against S than for M against each Ij, return YES. The unmasking test referred to in Step 3 works as follows: Given two corpora, A and B, 1. Run a cross-validation experiment for A against B.
2. Using a single model for A against B, eliminate the k highest-weighted features. (A linear separator, such as Winnow or linear SVM automatically provides such weights.) 3. Go to step 1. Clearly, the terms “many” and “significantly worse” are not well-defined. In the remainder of this paper, we will run a number of systematic experiments to determine how to define some of these terms more precisely. We will focus specifically on the parameters necessary for better specifying the unmasking algorithm. 4. Experiments For each experiment in this section, we choose works in a single genre (broadly speaking) by two or three authors who lived in the same country during the same period. We choose multiple works by at least one of these authors. (See Table 4 for the full list of works used in each experiment.) For each pair of works we run an unmasking test using linear SVM [10] as our learner. The object is to find consistent differences in the curves produced by unmasking tests on two corpora by the same author and those produced by unmasking tests on two corpora by different authors. For our purposes, a corpus is simply a book divided into chunks. In these experiments, each book was divided into at least 50 chunks containing not less than 300 words. The initial feature set consisted of the 100 most frequent words in the corpus; at each stage of unmasking, we eliminate five features. In Figures 2-5, we show the results of these unmasking tests. In each, we divide the results between (a) comparisons of two books by a single author and (b) comparisons of books by two different authors in the same group. Results are reported in terms of classification accuracy.
Authors Group
Author
Book Name
Group A
Hawthorne
Doctor Grimshawe’s Secret House of Seven Gables
Melville
Confidence Man Moby Dick
Cooper
Last of the Mohicans The Spy Water Witch
Group B
Thoreau
Walden A Week on Concord
Emerson
Conduct Of Life English Traits
Group C
Shaw
Pygmalion Misalliance Getting Married
Wilde
An Ideal Husband A Woman of No Importance
Group D
Bronte Anne
Agnes Grey
Bronte Charlotte
The Professor
The Tenant Of Wildfell Hall Bronte Emily
Jane Eyre Wuthering heights
Table 4 - List of Authors and books tested
Note that as in the case of the responsa considered earlier, in many cases two works by the same author are initially easily distinguished from each other. But in each experiment, there is some level of elimination at which pairs of works by different authors are more easily distinguished from each other than pairs of works by the same author.
An example of a general rule suggested by these example is the following: if a pair of books can be distinguished with cross-validation accuracy above 84% both initially and after four rounds of feature elimination, then those books are by two different authors; otherwise, they are by the same author. This rule works for every example we considered except for some of the works by different Bronte sisters, which were difficult to distinguish after a few rounds of feature elimination. Preliminary experiments indicate that even these few exceptions can be eliminated if the initial feature set is expanded from 100 to 250 words.
Non Identical Authors
Identical Authors 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5 1
2
3
4
5
6
7
1
8
2
3
4
5
6
7
8
Figures 2a 2b - Unmasking Identical authors vs. Non-identical authors Group A
Identical Authors
Non Identical Authors
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5 1
2
3
4
5
6
7
8
1
2
3
4
5
Figures 3a 3b - Unmasking Identical authors vs. Non-identical authors Group B
6
7
8
Non Identical Authors
Identical Authors 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6 0.5
0.5 1
2
3
4
5
6
7
1
8
2
3
4
5
6
7
8
Figures 4a 4b - Unmasking Identical authors vs. Non-identical authors Group C Non Identical Authors
Identical Authors 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6 0.5
0.5 1
2
3
4
5
6
7
8
1
2
3
4
5
6
Figures 5a 5b - Unmasking Identical authors vs. Non-identical authors Group D
5. Conclusions We have suggested a new approach to exploiting text categorization methods for solving the problem of authorship confirmation. Our approach is limited to cases where the work to be classified is of sufficient length that it can be chunked into a significant number of reasonably long samples. (All the works we considered were of length at least 25000 words.) The heart of the approach is the “unmasking” process, which is a type of robustness test: if texts remain distinguishable as the best discriminators are eliminated, they are very likely by different authors. We have thus far assembled anecdotal evidence in support of the effectiveness of the technique. It remains for ongoing work to refine and test the hypothesis on a large out-of-sample test set.
7
8
6. Bibliography [1] Sebastiani, F. (2002). Machine learning in automated text categorization, ACM Computing Surveys 34 (1), pp. 1-45 [2] Mosteller, F. and Wallace, D. L. (1964). Inference and Disputed Authorship: The Federalist. Reading, Mass. : Addison Wesley, 1964. [3] Matthews, R. and Merriam, T. (1993). Neural computation in stylometry : An application to the works of Shakespeare and Fletcher. Literary and Linguistic computing, 8(4):203-209. [4] Holmes, D. (1998). The evolution of stylometry in humanities scholarship, Literary and Linguistic Computing, 13, 3, 1998, pp. 111-117. [5] Argamon, S., M. Koppel, G. Avneri (1998). Style-based text categorization: What newspaper am I reading?, in Proc. of AAAI Workshop on Learning for Text Categorization, 1998, pp. 1-4 [6] Stamatatos, E., N. Fakotakis & G. Kokkinakis, (2001). Computer-based authorship attribution without lexical measures, Computers and the Humanities 35, pp. 193—214. [7] de Vel, O., A. Anderson, M. Corney and George M. Mohay (2001). Mining e-mail content for author identification forensics. SIGMOD Record 30(4), pp. 55-64 [8] Kivinen, J., M. Warmuth, (1997). Additive versus exponentiated gradient updates for linear prediction, Information and Computation, 132, 1, 1997, pp. 1-64. [9] Dagan, I., Y. Karov, D. Roth (1997), Mistake-driven learning in text categorization, in EMNLP-97: 2nd Conf. on Empirical Methods in Natural Language Processing, 1997, pp. 55-63. [10] Joachims, T. (1998) Text categorization with support vector machines: learning with many relevant features. In Proc. 10th European Conference on Machine Learning ECML-98, pages 137-142, 1998.