exploring newspapers: a case study in corpus

EXPLORING NEWSPAPERS: A CASE STUDY IN CORPUS ANALYSIS George R S Weir

Nikolaos K. Anagnostou

Department of Computer and Information Sciences University of Strathclyde Richmond Street, Glasgow, G1 1XH UK [email protected]

Department of Computer and Information Sciences University of Strathclyde Richmond Street, Glasgow, G1 1XH UK [email protected]

ABSTRACT This paper details the analysis of text data taken from a leading Scottish newspaper. In this work, the purpose is twofold. Firstly, we report on the adopted approach and the application of simple corpus analysis tools and techniques. Secondly, we report some of the findings from our analysis of this particular newspaper corpus.

1. INTRODUCTION The increasing availability of user-friendly software applications for corpus analysis means that investigating the characteristics of example corpora is becoming easier. This growth in corpus analysis is matched by the proliferation of published work associated with this activity (e.g., Google Scholar lists 331,000 recent documents for the search term 'corpus analysis'), and increasing numbers of related academic texts (e.g., McEnery and Wilson, 2001; Hunston, 2002). The present work was motivated by a desire to investigate the practical issues associated with establishing and analysing a corpus of text from a leading Scottish newspaper. This was addressed from the perspective of interested 'laymen', since neither of the authors has training or background in linguistics. Consequently, this research activity is rightly seen as a case study that should shed light on the process of corpus analysis and the experience reported here may offer insight to others considering similar expeditions. To this end, we report on our analysis of text corpora from a leading Scottish newspaper. Our corpora contain all article texts from the daily newspaper for 2005. To these corpora, we applied simple analysis techniques as provided by several popular tools. In this process, we quickly appreciated the difficulties in using these tools on large text corpora. This experience led us to revert to a 'simpler' set of analysis tools that, while less sophisticated, were faster and more robust when dealing with large quantities of text. In the following, we describe the process of conducting these analyses and also record some of the results from the analysis.

2. INITIAL DATA The data for our corpus analysis comprised one year of the text content from a leading Scottish daily newspaper. This was provided as a single file of 32,070 articles, making up the full text corpus for the 2005 newspaper issue and only included material that was authored by journalists employed by the newspaper. Other copy, e.g., advertisements, readers' letters, etc., was not part of our data. From the outset, our interest was to derive insights on the dataset as a whole. For this reason we retained the single file context and did not extract the component texts. Each of these components was relatively short (~500 - 600 words) and we felt that comparisons at this micro level would hold less interest than a consideration of the full data set as a single corpus. Broad details of the corpus are given in Table 1, below.

No. of words (tokens): 15,453,587 No. unique words (types): 269,176 No. of sentences: 742,311 Average sentence length: 21 No. of newspaper articles: 32,070 No. of questions: 22,614 Table 1: Newspaper corpus statistics

2.1 Data analysis In order to extract data from this newspaper corpus, we applied our software tools in two phases. The first phase includes five steps, the first of which converts the corpus content to ASCII (or ensures that the version we will use to proceed contains no characters that would not appear in a printed text). This is achieved by applying a filter through which only the specified character sets are permitted. The second step operates on the filtered data from step 1, and extracts all tokens from the corpus. In the third step, the number of tokens is counted. This is followed by a fourth step in which the number of unique words is derived from the data. Finally, for phase one, the fifth step counts the frequency of occurrence for each unique word in the corpus. Phase two of our processing has four component steps. The first step applies a part-of-speech (POS) tagger to the corpus in order to generate a Part-of-speech tagged version of the corpus contents. Once this has completed, step two extracts all tagged tokens from the corpus. The third step counts the number of sentences in the corpus and the final step counts the frequency of occurrence for each unique tagged word in the corpus. These component operations in our two phase processing are shown in Figure 1, below.

Phase One Convert corpus to ASCII

Tokenize the corpus

Count tokens in corpus

Extract unique words

Tokenize the tagged corpus

Count sentences in corpus

Count frequency of POS

Count frequency of unique words

Phase Two POS tag the corpus

Figure 1: Two phase corpus processing This two phase process generates data that allows us to derive a variety of facts about the newspaper corpus. A variety of these results is presented in the following section.

3. DIMENSIONS OF ANALYSIS While there are many possible dimensions for analysis of this data, we selected three aspects that we considered interesting. The first dimension identifies a set of top ten terms in several categories. Our second dimension extracts insights on the variety of gender specific terms within the newspaper corpus. The third dimension contrasts characteristics of the newspaper corpus with a reference corpus (the British National Corpus).

3.1 Top tens In this first dimension, we reviewed the priority for several categories of term. Priority in this case equates to frequency of occurrence. For each of the following topics, we derived the top ten terms in order of frequency of occurrence in the 2005 newspaper corpus.

3.1.1 Countries Our first topic was countries and a list of country names was derived from the set of terms identified as proper nouns by the POS tagging process. The resultant country ranking, with absolute frequency, is shown in Table 2, below. Rank 1 2 3 4 5 6 7 8 9 10

Country Frequency Scotland 32,906 Britain 5,168 England 5,055 Iraq 3,262 Ireland 2,187 Africa 2,100 America 1,837 China 1,444 Wales 1,393 Italy 1,288 Table 2: Top Ten Countries

Clearly, there may be many different reasons for the resultant ranking. In some instances, e.g., the prevalence of ‘Scotland’ in the list of countries, the result might be expected, although the vastly greater frequency of occurrence may surprise. (The figures indicate that ‘Scotland’ is ~six times ‘more important’ than ‘Britain’ and ‘England’.) In other instances, such as the appearance of ‘Italy’ in the top ten countries, the explanation is less obvious. Perhaps the significant presence of Italian immigrants in Scotland may hold the key.

3.1.2 Towns and Cities As our second ranked category, we selected ‘Towns and Cities’. Once again, we derived a list of the top ten occurrences of this type by reviewing terms identified as proper nouns by the POS tagging process. The result of this selection is shown in Table 3, below.

Rank Town/City Frequency 1 Glasgow 17,738 2 Edinburgh 10,806 3 London 6,639 4 Aberdeen 3,426 5 Dundee 2,524 6 [St.] Andrews 1,208 7 Manchester 1,055 8 Motherwell 1,028 9 Stirling 1,009 10 Inverness 999 Table 3: Top Ten Towns and Cities Notably, this ranking has eight Scottish locations within the top ten, with London and Manchester the sole English representation. The top twenty listing for this category contains a further six Scottish locations and two further English representations. Only two ‘foreign’ cities appear in the top twenty. The first ‘foreign’ city is Washington (with a frequency of 629) coming in at number 18, the other is Milan (with a frequency of 553) at number 20. The ranking and frequency of locations in this list suggest that this newspaper maintains a primarily Scottish focus. Furthermore, while the dominance of ‘Glasgow’ is to be expected for a Glasgow-based newspaper, the order of the initial four Scottish cities correctly reflects the order of population for Scottish cities. The Scottish Abstract of Statistics for the 1991 census, published by the Government Statistical Service, details the most populous Scottish towns and cities as: Glasgow, (population: 662,954), Edinburgh (population: 401,910), Aberdeen (population: 189,707) and Dundee (population: 158,981).

3.1.3 Male Forenames Our fourth category for ranking was male forenames. We anticipated that the evident Scottish orientation of the newspaper would be reflected in names that are common in Scotland. This was only partially borne out by the resultant top ten, as shown in Table 4, below. Rank Forename Frequency 1 John 7,451 2 David 6,137 3 Gordon 4,022 4 Paul 3,523 5 Michael 3,317 6 George 3,292 7 Ian 3,162 8 Murray 2,986 9 Tony 2,815 10 Mark 2,785 Table 4: Top Ten Male Forenames

Rank Forename 1 Lewis 2 Jack 3 Callum 4 James 5 Ryan 6 Cameron 7 Kyle 8 Jamie 9 Daniel 10 Matthew Table 5: Top Ten Male Forenames for 2005 (Register Office for Scotland)

Of the top ten male forenames, only three are distinctively Scottish (Gordon, Ian and Murray). This suggests that the occurrence of male forenames does not reflect name distribution in Scotland. This conclusion is supported by data drawn from the General Register Office for Scotland. Table 5 lists the top ten registered given names for males in the year 2005. Note that the ordered list of registered male forenames in Table 5 has no names in common with the newspaper results shown in Table 4. This signifies that those males named in the newspaper during 2005 do not reflect the distribution of names in the wider population. As a further contrast, we might add that Table 5 lists five names that are characteristically Scottish (Lewis, Callum, Cameron, Kyle and Jamie).

3.1.4 Female Forenames In similar fashion, we extracted the most commonly occuring female names from the 2005 newspaper corpus and compared this top ten with data obtained from the Register Office for Scotland. Female names in the top ten from the newspaper corpus (Table 6) had no specifically Scottish instances. When compared with the registered female names for 2005 (Table 7), we find again that there is no match. This further indicates that individuals named in the newspaper are not, in name, representative of the general population. Rank Forename Frequency 1 Margaret 848 2 Mary 843 3 Alison 651 4 Lucy 633 5 Catherine 612 6 Susan 592 7 Elizabeth 587 8 Anne 581 9 Helen 555 10 Victoria 510 Table 6: Top Ten Female Forenames

Rank Forename 1 Sophie 2 Emma 3 Ellie 4 Amy 5 Erin 6 Lucy 7 Katy 8 Chloe 9 Rebecca 10 Emily Table 7: Top Ten Female Forenames for 2005 (Register Office for Scotland)

A counter point to this name comparison suggests that we may be comparing the newspaper name sets with the wrong year of publicly registered names. After all, the top ten names registered in 2005 were names given to children born in that year, whereas those named in the newspaper of 2005 were likely to be adults. On this line of thought, a proper comparison might contrast the top ten names from the 2005 newspaper with registered names of those who were born, say, in 1985. Unfortunately, lists for this period are not presently available online from the Scottish Register, so we cannot presently pursue this line of speculation.

3.2 Gender considerations Following the direction of research reported in Nakamura (2007), we sought to compare the presence of male and female references in the content of our newspaper corpus. To this end, we noted the frequency of occurrence for personal pronouns and quantified the proportion of male to female referents. The data for individual pronouns are shown in Table 8, below, and the comparison of male to female usage is detailed in Table 9. Pronoun he his she her him himself herself hers

Count 100,254 76,301 26,079 25,230 15,459 3,688 808 69

% 40 31 11 10 6 1 >1 >1

Total: 247,888 Table 8: Gender specific pronouns

Total:

Male 195,702

Female 52,186

Table 9: Aggregated male/female pronouns

These figures indicate that males are mentioned almost four times more often than females in the newspaper corpus. The scale of this disparity is exceeded by the frequency measures shown earlier for male

and female forenames. In that context (see Table 4 and Table 6, above), we found that the top ten male forenames had a collective frequency of 39,490, while the top ten female forenames had a collective frequency of 6,412. This shows a ratio of approximately 6 male references to 1 female reference. This is disproportionate when considered against the gender ratio in the Scottish population. The 2001 census data (available from Scotland’s Census Results OnLine at http://www.scrol.gov.uk/) shows a Scottish population of 5,062,011 with a marginally smaller percentage of males (48.05) to females (51.95).

3.3 BNC Comparisons The purpose of the BNC comparison is to determine the extent to which the newspaper corpus resembles characteristics of the BNC reference corpus (Burnard, 2000). At the outset, this may seem an odd consideration. Why should a collection of newspaper texts show any similarity in character with a corpus intended to be general? Perhaps we should presume an absence of similarity and review the results of our comparison in this light. Our comparison of the 2005 newspaper corpus with the BNC is based primarily upon analysis of the part-of-speech content of each corpus. In this context, analysis of the newspaper data revealed the parts-ofspeech counts for unique words (types) and all word occurrences (tokens), indicated in Table 10, below. Table 11, shows this data as percentages of the whole corpus. POS types tokens Adjectives 24,323 1,138,098 Adverbs 4,343 605,707 Nouns 134,600 5,018,944 Verbs 35,048 2,644,704 Determiners 229 1,531,069 Pronouns 373 750,205 Prepositions 1,125 1,850,521 Others 69,135 1,914,339 Table 10: POS counts for Newspaper Corpus

POS types % tokens % Adjectives 9.04 7.36 Adverbs 1.61 3.92 Nouns 50.00 32.48 Verbs 13.02 17.11 Determiners 0.09 9.91 Pronouns 0.14 4.85 Prepositions 0.42 11.97 Others 25.68 12.39 Table 11: POS % for Newspaper Corpus

This data is represented graphically in the pie charts of Figures 2 and 3, below. tokens %

types % Adjectives

Others

Adjectives

Adverbs

Adverbs Others

Prepositions

Pronouns

Prepositions

Nouns

Pronouns Determiners Determiners

Verbs Nouns

Verbs

Figure 2: Percentage of unique words (types) in Newspaper Corpus

Figure 3: Percentage of total words (tokens) in Newspaper Corpus

In order to compare these data with the BNC, we counted the number of occurrences for each part-ofspeech in the BNC and derived the percentage values indicated in Table 12, below.

POS

%

Nouns Adjectives Adverbs Verbs Personal pronouns Determiners Prepositions Others

26.62 7.82 6.05 19.10 4.97 2.38 11.16 21.89

Table 12: POS % for BNC

These statistics allow a direct comparison between the newspaper corpus and the BNC in terms of the percentage contribution made by specific parts-of-speech. This comparison is shown in Figure 4, below. 35

30

25

20 %

BNC Newspaper 15

10

5

0 Word type

Nouns

Adjectives

Adverbs

Verbs

Personal pronouns

Determiners

Prepositions

Others

POS

Figure 4: Comparison of Newspaper Corpus and BNC by % POS

3.3.1 Comparison results As noted above, we should not be surprised to find variation between the newspaper and the BNC in terms of word frequency or POS distribution. Yet, the results from our comparison suggest a fairly strong parallel between the two corpora for most of their part-of-speech components. There are three notable exceptions. Firstly, the newspaper corpus has a greater proportion of nouns than the BNC (32% against 26%). Secondly, the BNC has a greater contribution from ‘non-specified’ parts-of-speech than does the newspaper corpus (22% against 12%). Finally, there is a wide difference in the proportion of determiners, with the newspaper showing almost 5 times the number of the BNC. These disparities led us to re-examine in detail the newspaper corpus data from which the figures were derived. This revealed interesting facts about the process of corpus tagging. In particular, we reviewed the sets of nouns and determiners identified for the newspaper data. As a result, we found a variety of words that had been mis-typed by the POS tagging software. For instance, the word ‘hers’ was identified consistently either as a plural common noun or as a proper noun. In addition, a large number of low

frequency words, many apparently typographical errors in the original newspaper data, were also classed as proper nouns. Finally, the set of determiners identified by the tagging software included many terms comprised of two words conjoined. This also arose from typing errors in the source data. In this instance, missing spaces between adjacent words. Although we have not yet attempted further cleaning of the source data as a basis for re-calculating the POS proportions for the newspaper corpus, indications are that such steps are likely to bring the distribution of parts-of-speech even more in line with the BNC. This, in turn, suggests that the newspaper corpus is broadly ‘general’ in its POS distribution.

4. SUMMARY AND CONCLUSIONS In the foregoing, we have described our approach to analysing corpus data taken from one year of a leading Scottish daily newspaper. Based upon our tagging, counting and data extraction, we detailed several dimensions of results. The most frequently cited countries, towns and cities, male and female forenames were noted and discussed. Thereafter, we considered the gender-specific references within the newspaper corpus and noted the significant disparity between male and female referents. Finally, we compared the data statistics from our newspaper corpus against data drawn from the British National Corpus. Aside from three aspects of parts-of-speech, there appeared to be a plausible match in distribution of parts-of-speech between the two corpora. Through this exercise in corpus analysis, we gained insight on the potentially rich data that lies within such a source corpus. On the negative side, we encountered problems arising from incorrect tagging on parts-of-speech. This offers two lessons. Firstly, it indicates a need to review the capabilities of the publicly available POS tagger that we employed in this work - with an eye to sourcing a more effective alternative. Secondly, this reveals inherent difficulties in working with a corpus that contains errors. Given the significant scale of such corpora, data integrity is difficult to validate in advance of its analysis. In consequence, there is an evident need for caution when drawing conclusions that may be influenced by noise in the source data.

REFERENCES Burnard, L. 2000. Reference Guide for the British National Corpus (World Edition). URL: http://www.natcorp.ox.ac.uk/docs/userManual/urg.pdf Last Accessed 8 Jul. 2007. Hunston, S. 2002. Corpora in Applied Linguistics. Cambridge: Cambridge University Press. McEnery, A.M., Wilson, A. 2001. Corpus Linguistics. Edinburgh: Edinburgh University Press. Nakamura, T. 2007. A Feminist Study of English Textbooks published in Japan. In Texts, Textbooks and Readability, G. Weir and T. Ozasa (Eds.), Strathclyde University Publishing, Glasgow, pp. 1-8.