MORE ALIKE THAN NOT— AN ANALYSIS OF WORD FREQUENCIES IN FOUR GENERAL-PURPOSE TEXT CORPORA TERRY COPECK, KEN BARKER, SYLVAIN DELISLE AND STAN SZPAKOWICZ School of Information Technology and Engineering University of Ottawa Ottawa, Ontario, Canada K1N 6N5
Département de mathématiques et d’informatique Université du Québec à Trois-Rivières Trois-Rivières, Québec G9A 5H7
{terry, kbarker, szpak}@site.uottawa.ca
[email protected]
We compare word frequency lists derived from four general-purpose written English corpora: BNC, Brown, LOB and WSJ. Statistically significant correlation exists among the ranks of common vocabulary words appearing in more than one list, despite marked differences between the underlying corpora. The correlation may be sufficient to postulate a corpusindependent list for common words. Proper names and specific tokens such as numbers show much less correlation and should be separated from common words if their similarity is to remain unobscured. Our result has a bearing on word-sense disambiguation, text categorization and text summarization. Key words: corpus linguistics, word frequency, frequency lists, comparative frequency rank analysis, corpus-independent word frequency lists
1.
INTRODUCTION
This work arose from our need for a word frequency list. We decided to create one by averaging the rankings in public domain frequency lists based on several large, widely known corpora representative of current written English language use. Four such lists can be found on the Internet: the British National Corpus (BNC), the Standard Corpus of Present-Day Edited American English (Brown), the Lancaster/Oslo-Bergen Corpus (LOB), and various collections of Wall Street Journal articles (WSJ). The four differ significantly in formulation, so each was first brought to a common format as far as practical. Words in ‘normalized’ lists were then classified into four categories of our own devising. Common words appear in our project’s dictionary of English words; proper words begin with a capital letter or have been tagged as such. Special words include numbers or punctuation other than hyphens; other words are not otherwise classifiable. The lists were then merged on category. Rankings were averaged and, for words appearing more than once, the standard deviation computed to measure variance from this average. These category lists were then resorted on the average to put them back in order, with words in the same rank ordered alphabetically. Here is the entry that resulted for ‘that’, the 10th common word. WORD
AVG
STDEV
bnc
brown
lob
wsj
that
9.25
2.28
8
7
9
13
TABLE 1. A Typical Entry in a Class List ‘That’ appears in all four corpora. 9.25 is its average of its rankings, which range from 7th most frequent word in Brown to 13th in WSJ. The standard deviation of the four values is 2.28. What can be said of the categories used in the study? The English word-stock of half a million common words (OED 1997) may seem large but is dwarfed by the number of proper names for people, places, brands and so on. Frequency lists also include numbers. While ‘three’ or ‘one-half’ are arguably vocabulary words, most numbers more closely resemble proper names in uniqueness and specificity—‘11283’, ‘0.4859’. The cardinality of numbers is also more akin to that of proper names. In contrast other and special are not real categories. Each could be reclassified as common or proper. Other exists only because lists of the size considered here required automatic processing and it was infeasible to develop a comprehensive mechanical classification scheme. Special words could also be reclassified. 7-bit ASCII is convenient but requires accents be encoded. ‘Resumé’ and ‘Osnabrück’, coded as ‘resumé’ and ‘osnabrück’ in BNC, could be assigned respectively to common and proper, but manual adjudication would be needed to apply a methodology consistently. Sufficient resources not being available, both pseudo-classes were retained in the study. How large are the two pseudo-classes? The four corpora contain 185,403 unique words. 29,533 are special. 45% include digits, another 15% incorporate the ampersand that signals BNC encoding of a special character, 2%
2
MORE ALIKE THAN NOT: AN ANALYSIS OF WORD FREQUENCIES
are phrases composed of two or more common words (‘out_of’, ‘a_lot’). The remaining special words defy easy classification. 16,364 other words appear divided fairly equally between common and proper. Common words in this class are often inflected forms of roots found in the dictionary—‘gendered’ and ‘gendering’ from ‘gender’. Classification of proper names is more arbitrary, depending on whether the word appears in a corpus that preserves word capitalization. Thus ‘French’ is proper while ‘british’ is other. As noted, the set of proper names is theoretically much larger than the set of common words. Uniting sets of different sizes produces one on the scale of the larger. With each pseudo-category composed of both common and proper words in some measure, other and special are potentially as large as proper. Common is therefore the smallest class. These relationships make it possible to anticipate outcomes before looking at the data. Repeatedly selecting elements from a small set will produce more duplicates than selecting from a large one. We therefore expect common vocabulary words to predominate towards the top of a frequency list. When the four classes of words are graphed (FIGURE 1) it becomes clear this expectation is met—vocabulary words tend to head all four lists. Further, common words tend to be ranked appreciably closer to one another than do words in the other classes. Any comparison of frequency lists which fails to distinguish between common words and proper names will therefore obscure the degree to which the lists agree on words from the one class, the vocabulary, all language users share to compose utterances. 2.
THE LITERATURE
Attention has been paid to the frequency of words in English for many years. Early frequency lists like Teacher’s (Thorndike 1944) and American Heritage (Carroll 1971) were printed in books. The movement to digital storage after 1960 allowed researchers other than the compilers to readily access data, and lists from corpora of that era remain in use: Brown (Francis & Kucera 1982) and LOB (Hofland & Johansson 1982). Although it may be contrasted with others—LOB was intended to be a British counterpart to the American Brown—studies generally consider a single corpus. Another almost universal practice is to place all words on a par when compiling a frequency list. A 1995 CORPORA discussion (www.hd.uib.no/corpora/1995-4/) offers justification for the former approach. Respondents to a request for a general word frequency list cautioned that, while such lists may have practical application (Piepenbrock 1995), they are “almost useless if taken from a domain different than the one under consideration ... [and] this is much less universal than people would think” (Dunning 1995, Clear 1995). Participants went on to acknowledge that “almost all - 96% - of the top 3,000 words in one large generalpurpose corpus are also in the top 3,000 in another” (Kilgarriff 1995), a finding supported in our research. Adam Kilgarriff has actively investigated word frequency. His survey of statistical techniques characterizes the corpora to which they apply (Kilgarriff 1996a). Other papers describe the application of BNC frequency information to augment dictionary entries (Kilgarriff 1996b) and to measure corpus homogeneity (Kilgarriff & Salkie 1996). Lexicographers stand on the other side of the uniqueness issue. Learners’ dictionaries (LDOCE 1978, LET 1981) and core vocabularies (LDOCE 1978, COBUILD 1995) rely somewhat on generalized frequency findings. 3. ·
·
·
·
THE DATA
BNC The British National Corpus is a collection of 89,740,556 words in 3,209 documents assembled at Oxford University Press in the early 1990s. The list used in this study came from ftp.itri.bron.ac.uk/pub/bnc. Written.num.o5 contains 204,700 words occurring 5 or more times in the corpus. Entries have the form 561041 that cjt 3175: frequency, word, part-of-speech, number of documents in which the word occurs. Brown Compiled by Francis and Kucera at Brown University in 1961, it is composed of 500 2,000-word texts. A frequency list produced as part of the MRC Psycholinguistic Database contains all 43,300 words in the corpus. Kf.wds is found at www.cis.rl.ac.uk/proj/psych/other.html. A typical entry is 25 abandoned: frequency, word. LOB The Lancaster/Oslo-Bergen corpus was created in 1961 at the Norwegian Computing Center for the Humanities as a British English counterpart to Brown. It also contains 500 2,000-word texts. Lob.freq is available at gopher.sil.org/11/gopher_root/linguistics/info/; its 8,106 entries appear in the corpus 10 or more times. A typical entry is 11391 that: frequency, word. WSJ A number of corpora have been constructed using Wall Street Journal articles. Produced through the ACL’s Digital Corpus Initiative, ours includes 39,000,000 words from articles between 1987 and 1989. Its 5,000 most frequent words, appearing 104 times or more, are in kwc.freqs at gopher.sil.org/11/gopher_root/ linguistics/ info/. A typical entry is 28457 with: frequency, word.
MORE ALIKE THAN NOT: AN ANALYSIS OF WORD FREQUENCIES 4.
3
METHODOLOGY
4.1 Normalization Corpus compilers must decide what constitutes a word, how to deal with accents, whether to retain capitals etc. These decisions are reflected in a frequency list as much as in the corpus itself. The lists used in our study were created independently, so it is not surprising that they differ on many of these conventions. Our first task was to translate all four lists to a single representation, a time-consuming operation where solving one inconsistency sometimes revealed another. For instance, lists that retain capitalization include common words in first-capitalized form when they are part of a proper name, eg ‘The’ and ‘Company’ in ‘The Bombay Company’. The BNC classifies both words and parts-of-speech, so one word can have two or more entries. Our biggest task was to consolidate its 205,000 entries into 130,000 unique word entries which concatenate each word’s POS keys with semicolons, sum its frequency counts, and take the largest candidate as the entry's document count. Lowercase proper nouns cannot automatically be distinguished from common words not in the dictionary. Brown and BNC are all lowercase, but in each case there was a remedy. The BNC POS codes includes np0 for proper nouns. Although consolidation complicated matters, this code was used to classify any BNC word as proper if it had a proper reading. The 59,000-word lexicon included with the Xerox tagger preserves capitalization. It is derived from Brown and can be used to identify proper nouns in that list. Automated processing produces lists only as good as the input data. Probably ‘centimeteranddecimeter’ in the Brown list is a tokenization error and ‘broodinf’ is ‘brooding’ misspelled. Because our resources were limited and we judged such flaws too infrequent to invalidate our overall conclusions, these errors were ignored. 4.2 Classification The dictionary used to identify vocabulary words in the study contains 370,000 words with roots and speech parts. It is based on the 311,587-entry version of the Collins English Dictionary included on the ACL’s first CDROM, supplemented with entries from the Moby part-of-speech dictionary (Ward 1996). First-capitalized entries have been removed and no entry is accented. Preference in classification follows the order common, special, proper, other to maximize recognition of common words. If 'ulster' is found in the dictionary it is classified common (garment) rather than proper (place name) regardless of capitalization, while no capitalized word will ever be judged other. TABLE 2 summarizes the results of normalizing and classifying the four frequency lists, listing classes by repeated entries. Thus 2,957 common words were found in all four lists; 383 special ones in two lists. 4.3 Analysis and Conclusions Assignment of words from each corpus to the four classes was first tabulated (T ABLE 3) and charted in rank order (FIGURE 1 p.4) to establish that the corpora are generally consistent. Attention then turned to words appearing in more than one corpus. The rank order distribution of common repeated words is quite different from that of repeated proper names (FIGURE 2 p.4). Such words’ rank averages and standard deviations also differ markedly (FIGURE 3 p.5). For selected classes, ranks in each lists were correlated with the average rank in other lists (TABLE 4 p.6). While there are anomalies, coefficients are in general statistically significant and consistent with other findings, suggesting frequency list research should distinguish between proper names and common vocabulary words. 5.
DISCUSSION
TABLE 3 shows how many words in each class appear in one or more corpus; FIGURE 1 charts this data over the range of each list. Common words are most frequent by far at the head of a list, diminishing as rank order increases until they make up considerably less than half of all words in BNC, the largest list. Although lists vary greatly in size, they resemble one another when equal extents from the beginning are compared. CLASS COMMON
PROPER
OTHER
4
2,957
96
16
3
4,455
389
55
2
21,454 3,062
1,088
SPECIAL
TOTAL
TOTAL 3,069
BNC
43
4,942
Brown
383
25,987
204,74 129,60 56,90 43,30
43,29
31,67
PROPER
OTHER
31,45 13,58 6,730 3,657
SPECIAL
27,66 1,232
LOB
8,106
7,950
7,026
583
81
260
5,000
4,550
3,701
428
43
378
39,20 17,36
29,53
1
31,200 31,526 14,959 28,638 106,323
WSJ
TOTAL
99,305 39,201 17,364 29,533 185,403
TOTAL
TABLE 2. Counts of Repeated Words by Class
CLASS
Original Normalized COMMON
261,15 185,40 99,30
TABLE 3. Entries in the Four Corpora, Total and by Class
4
MORE ALIKE THAN NOT: AN ANALYSIS OF WORD FREQUENCIES LOB
B ro wn
B NC
100%
W SJ
80%
S p e cial
60% 40%
O th e r
C om m o n
0%
20%
Instances
P ro p e r
0.5
3
1
6
1
6
11
16
21
26
31
36
41
1
11
21
31
41
51
61
71
81
91
101
111
121
1000 W ords
FIGURE 1: Word Distribution by Class, by Corpus Spikes and fluctuations on the right in larger lists require explanation. At the low end of a frequency list many words share a rank. Thus while the first word in the BNC occurs 5,776,384 times and the 100th 82,878 times, each is the single instance of a word with that particular frequency. By contrast, 18,630 words occur six times—fully 10 percent of the list. Within each rank words are ordered alphabetically. Initial characters such as ‘&’ or a digit guarantee a word will be classified special. These two factors ensure that the BNC will group together in low frequency ranks sufficiently many special words to produce the spikes seen on the right side of its chart. The effect is minimized here by averaging a sliding window of five adjacent data points to smooth the data. 5.1 Repeated Words FIGURE 2 graphs words occurring in one, two, three or all four corpora by class. While the strikingly different distribution of repeated common words is accounted for in part by the greater number of instances in this class, it is truly surprising that not one of nearly three thousand 4-occurrence common words ranks below 12,000 and only
Com m on
75 0
10 00
P ro p e r
4 Oc currenc es 2 Oc currenc es
0
25 0
1 Oc currenc es
1
6
11
16
21
26
31
36
41
46
51
56
1
3
5
7
9
O th er
11
13
15
17
19
10
11
21
25 0
50 0
75 0
10 00
S p ec ia l
0
D uplic ates / 1 000 W ords
50 0
3 Oc currenc es
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
10 00 W ords
FIGURE 2. Repeated Entries per 1000 Words, by Class
8
9
12
MORE ALIKE THAN NOT: AN ANALYSIS OF WORD FREQUENCIES
9 0 00 0
Com m on
5
P r o p er C oho rt
6 0 00 0
S td D e v
R ank
0
3 0 00 0
R ank
1
6
11
16
21
26
31
36
41
46
1
6
11
16
O th e r
21
26
31
36
41
46
31
36
41
46
0
3 0 00 0
6 0 00 0
9 0 00 0
S p e cia l
1
6
11
16
21
26
31
36
41
46
1
6
11
16
21
26
100 0 W ords
FIGURE 3. Rank and SD per 1000 Words, by Class seven are below 8,000. Though words in LOB or WSJ cannot be ranked below 7,950 or 4,550 (the respective size of these lists), it is not hard to imagine some word in all four corpora whose rank in Brown or BNC would pull its overall average below 12,000, since the average last place ranking in the four corpora is 60,049 and the average median is 30,025. Data for 3-occurrence words is similar. This suggests that a word found in multiple corpora must be common, and must be ranked high in the frequency lists involved. It provides factual evidence for Kilgarriff’s assertion about the top 3,000 words in a frequency list. 5.2 Rank and Standard Deviation Average statistics for the distribution of words in more than one corpus are charted by class in FIGURE 3. To facilitate comparison the overall rank is also plotted—this straight COHORT line represents the plot of all words. A class’ RANK necessarily plots above it. Of more interest is STD DEV, a measure of variance among the individual data points averaged to compose a word’s rank. Common differs markedly from the other three classes. Only for this class does the standard deviation fall below the cohort line, maintaining a 1:3 ratio to its rank up to 34,000th position. The first 6,000 other words approach the same ratio. This can be accounted for by the common words in this class, which are known to tend to the beginning of a list. Special appears to have fewer common words. In contrast, the ratio to rank of proper words is 1:2 throughout, and special is 1:1. This means the ranking of common words in individual corpora tends to agree far more than does the ranking of proper words. The pseudo-classes special and other share the characteristics of proper to a greater or lesser degree.
CLASS
#
4 Occurrences BNC
Brown
LOB
WSJ
AVG
#
3 Occurrences BNC
Brown
LOB
WSJ
AVG
common 2,957 .7901 .7078 .7935 .4660 .6894 4,371 .4946 .4597 .5733 .1742 .4255 4,357 4,366 3,801
not common 112 .5298 .4904 .6828 .6229 .5815 proper
96
.4883 .4630 .6630 .6207 .5588
589
487
.2262 .3286 .2790 .1956
389
.2770 .3388 .3496 .3244 .3225
456 362
443 383
345 261
217
.2574
161
TABLE 4. Counts and Correlations between Average and Corpus Rank, for Selected Classes
6
MORE ALIKE THAN NOT: AN ANALYSIS OF WORD FREQUENCIES
5.3 Correlation among Lists Frequency relationships among the four corpora were quantified by computing coefficients of correlation between the rank of words in one list and their average rank in the others. A word’s rank in BNC was therefore tested against the average of its rank in Brown, LOB and WSJ. The results appear in TABLE 4 (p.5) together with word counts. Coefficients were calculated for common, proper and not common words appearing in three or all four corpora (not common words are the union of proper, special and other words). 4-occurrence common words correlate quite strongly across corpora, 3-occurrence less so. Correlation is somewhat weaker for 3- and 4occurrence not common and proper words. All are significant at the .0001 level. This phase of analysis has not been entirely successful—many categorizations of the data correlate significantly with one another. The WSJ is at variance with the other three corpora, its not common and proper correlations greater than its common ones for both 3- and 4-occurrence words. No explanation has yet been found for this, which diminishes the coefficient averages a good deal. 6.
CONCLUSIONS & FUTURE WORK
These results square the two points of view expressed in CORPORA: while each frequency list is unique, ones based on written, general-purpose corpora have a statistically significant degree of resemblance when their shared vocabulary words are distinguished from proper names and other words. This provides theoretical support for basing learners' dictionaries in part on frequency data and for developing domain-independent lexicons for natural language processing. Frequency findings have application in all areas where statistical language processing is employed: text analysis and categorization, information extraction and text summarization, proper name detection and word sense disambiguation. Extending analysis to more complete lists drawn from the current corpora and to entirely new ones would strengthen results. New corpora might include both general-purpose ones like ICE and CELEX plus those with a specialized or spoken basis. Further investigation of certain findings is also needed. A final task would be to reanalyze the data with special and other words reclassified into the two true classes. 7.
REFERENCES
CARROLL, J.B., P. DAVIES and B. RICHMAN. 1971. The American Heritage Word Frequency Book. Houghton Mifflin, Boston. CLEAR, J. 1995. CORPORA list, 24 Nov 1995. http://www.hd.uib.no/corpora/1995-4/. COBUILD. 1995. The Collins COBUILD English Language Dictionary, J.H. Sinclair et al eds. London. DUNNING T. 1995. CORPORA list, 23 & 27 Nov 1995. http://www.hd.uib.no/corpora/1995-4/. FRANCIS, W.N. and H. KUCERA. 1982. Frequency Analysis of English Usage. Houghton Mifflin, Boston. HOFLAND, K. and S. JOHANSSON (eds). 1982. Word Frequencies in British and American English. Norwegian Computing Center for the Humanities, Bergen KILGARRIFF, A. 1995. CORPORA list, 24 Nov 1995. http://www.hd.uib.no/corpora/1995-4/. KILGARRIFF, A. 1996a. Which words are particularly characteristic of a text? A survey of statistical approaches. Language Engineering for Document Analysis, Sussex, UK, 33-40. KILGARRIFF, A. 1996b. Putting frequencies in the dictionary. International Journal of Lexicography, (10) 2, 135155. KILGARRIFF, A. and R. Salkie. 1996. Corpus similarity and homogeneity via word frequency. In Proc. EURALEX ’96, Gothenburg, Sweden, 121-130 LDOCE. 1978. Longman Dictionary of Contemporary English, P. Procter ed. Longman. LET. 1981. Leuven English Teaching Vocabulary List, L. Engels et al eds. Leuven. OED. 1997. About the Oxford English Dictionary, Second Edition on CD-ROM. http://www.oup-usa.org/oed/ oedcdbroch2.html PIEPENBROCK, R. 1995. CORPORA list, 23 Nov 1995. http://www.hd.uib.no/corpora/1995-4/. THORNDIKE, E.L. and I. LORGE. 1944. The Teacher's Word Book of 30,000 Words. Columbia University, New York. WARD, G. 1996. Moby Lexicon Project. http://www. dcs.shef.ac.uk/research/ilash/Moby.